production server down

Started by Joe Conwayover 21 years ago32 messageshackers
Jump to latest
#1Joe Conway
mail@joeconway.com

I've got a down production server (will not restart) with the following
tail to its log file:

2004-12-13 15:05:52 LOG: recycled transaction log file "000001650000004C"
2004-12-13 15:26:01 LOG: recycled transaction log file "000001650000004D"
2004-12-13 16:39:55 LOG: database system was shut down at 2004-11-02
17:05:33 PST
2004-12-13 16:39:55 LOG: checkpoint record is at 0/9B0B8C
2004-12-13 16:39:55 LOG: redo record is at 0/9B0B8C; undo record is at
0/0; shutdown TRUE
2004-12-13 16:39:55 LOG: next transaction ID: 536; next OID: 17142
2004-12-13 16:39:55 LOG: database system is ready
2004-12-14 15:36:20 FATAL: IDENT authentication failed for user "colprod"
2004-12-14 15:36:58 FATAL: IDENT authentication failed for user "colprod"
2004-12-14 15:39:26 LOG: received smart shutdown request
2004-12-14 15:39:26 LOG: shutting down
2004-12-14 15:39:28 PANIC: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:28 LOG: shutdown process (PID 23202) was terminated by
signal 6
2004-12-14 15:39:39 LOG: database system shutdown was interrupted at
2004-12-14 15:39:26 PST
2004-12-14 15:39:39 LOG: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:39 LOG: invalid primary checkpoint record
2004-12-14 15:39:39 LOG: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:39 LOG: invalid secondary checkpoint record
2004-12-14 15:39:39 PANIC: could not locate a valid checkpoint record
2004-12-14 15:39:39 LOG: startup process (PID 23298) was terminated by
signal 6
2004-12-14 15:39:39 LOG: aborting startup due to startup process failure

This is a SuSE 9, 8-way Xeon IBM x445, with nfs mounted Network
Appliance for database storage, postgresql-7.4.5-36.4.

The server experienced a hang (as yet unexplained) yesterday and was
restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
network admin that there was a problem with the network card on restart,
so the nfs mount most probably disappeared and then reappeared
underneath a quiescent postgresql at some point between 2004-12-13
16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
latter).

Any help would be much appreciated. Is our only option pg_resetxlog?

Thanks,

Joe

#2Bruce Momjian
bruce@momjian.us
In reply to: Joe Conway (#1)
Re: production server down

Joe Conway wrote:

This is a SuSE 9, 8-way Xeon IBM x445, with nfs mounted Network
Appliance for database storage, postgresql-7.4.5-36.4.

The server experienced a hang (as yet unexplained) yesterday and was
restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
network admin that there was a problem with the network card on restart,
so the nfs mount most probably disappeared and then reappeared
underneath a quiescent postgresql at some point between 2004-12-13
16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
latter).

Well, my first reaction is that if the file system storage was not
always 100% reliable, then there is no way to know the data is correct
except by restoring from backup. The startup failure indicates that
there were surely storage problems in the past. There is no way to know
how far that corrupt goes.

You can use pg_resetxlog to clear it out and look to see how accurate it
is, but there is no way to be sure. I would back up the file system
with the server down in case you want to do some more serious recovery
attempts later though.

The Freenode IRC channel can probably walk you through more details of
the recovery process.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#1)
Re: production server down

Joe Conway <mail@joeconway.com> writes:

I've got a down production server (will not restart) with the following
tail to its log file:

Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.

The server experienced a hang (as yet unexplained) yesterday and was
restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
network admin that there was a problem with the network card on restart,
so the nfs mount most probably disappeared and then reappeared
underneath a quiescent postgresql at some point between 2004-12-13
16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
latter).

I've always felt that running a database across NFS was a Bad Idea ;-)

Any help would be much appreciated. Is our only option pg_resetxlog?

Possibly, but let's try to dig first. I suppose the DB is too large
to save an image aside for forensics later?

regards, tom lane

#4Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#3)
Re: production server down

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

I've got a down production server (will not restart) with the following
tail to its log file:

Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.

OK, will do shortly.

The server experienced a hang (as yet unexplained) yesterday and was
restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
network admin that there was a problem with the network card on restart,
so the nfs mount most probably disappeared and then reappeared
underneath a quiescent postgresql at some point between 2004-12-13
16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
latter).

I've always felt that running a database across NFS was a Bad Idea ;-)

Yeah, I knew I had that coming :-)

Any help would be much appreciated. Is our only option pg_resetxlog?

Possibly, but let's try to dig first. I suppose the DB is too large
to save an image aside for forensics later?

Actually, although the database is about 400 GB, we do have room and are
in the process of saving an image now.

Joe

#5Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#3)
Re: production server down

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

I've got a down production server (will not restart) with the following
tail to its log file:

Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.

OK, here it is:

# pg_controldata /replica/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Dec 14 15:39:26 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

Joe

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#5)
Re: production server down

Joe Conway <mail@joeconway.com> writes:

Tom Lane wrote:

Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.

OK, here it is:

...
pg_control last modified: Tue Dec 14 15:39:26 2004
...
Time of latest checkpoint: Tue Nov 2 17:05:32 2004

[ blink... ] That seems like an unreasonable gap between checkpoints,
especially for a production server. Can you see an explanation?

regards, tom lane

#7Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#6)
Re: production server down

Tom Lane wrote:

...
pg_control last modified: Tue Dec 14 15:39:26 2004
...
Time of latest checkpoint: Tue Nov 2 17:05:32 2004

[ blink... ] That seems like an unreasonable gap between checkpoints,
especially for a production server. Can you see an explanation?

Hmmm, this is even more scary. We have two database clusters on this
server, one on /replica/pgdata, and one on /production/pgdata (ignore
the names -- /replica is actually the "production" instance at the moment).

# pg_controldata /replica/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Dec 14 15:39:26 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

# pg_controldata /production/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Nov 2 21:57:49 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

I have no idea how this happened, but those look too similar except for
the "last modified" date. The space used is quite what I'd expect:

# du -h --max-depth=1 /replica
403G /replica/pgdata

# du -h --max-depth=1 /production
201G /production/pgdata

The "/production/pgdata" cluster has not been in use since Nov 2. But
we've been loading data aggressively into "/replica/pgdata".

Any theories on how we screwed up?

Joe

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#7)
Re: production server down

Joe Conway <mail@joeconway.com> writes:

Any theories on how we screwed up?

I hesitate to suggest this, but maybe a cron job blindly copying data
from point A to point B?

I'm not sure that that could entirely explain the facts. My
recollection of the xlog.c logic is that the pg_control file is read
into shared memory during postmaster boot, and after that it's
write-only: at checkpoint times we update the file image in shared
memory and then write it out to pg_control.

Offhand my bets would revolve around (a) multiple postmasters trying to
run the same PGDATA directory (we have interlocks to protect against
this, but I have no faith that they work against an NFS-mounted data
directory), or (b) you somehow wiped a PGDATA directory and restored it
from backup tapes underneath a running postmaster.

regards, tom lane

#9Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#8)
Re: production server down

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

Any theories on how we screwed up?

I hesitate to suggest this, but maybe a cron job blindly copying data
from point A to point B?

Not likely, but I'll check.

Offhand my bets would revolve around (a) multiple postmasters trying
to run the same PGDATA directory (we have interlocks to protect
against this, but I have no faith that they work against an
NFS-mounted data directory)

This might be possible I suppose. I know we have two init scripts.
Perhaps there is an error in them that caused both postmasters to point
to the same place when the server was rebooted. I'll look them over.

or (b) you somehow wiped a PGDATA directory and restored it from
backup tapes underneath a running postmaster.

This seems highly unlikely because our *nix admin would have had to
deliberately do it, and I don't think he'd fail to tell me about
something like that. But all the same, I'll ask him tomorrow.

Assuming the only real problem here is the control data (long shot, I
know), and the actual database files and transaction logs are OK, is
there any reasonable way to reconstruct the correct contol data? Or is
that the point at which you use pg_resetxlog?

Joe

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#9)
Re: production server down

Joe Conway <mail@joeconway.com> writes:

Assuming the only real problem here is the control data (long shot, I
know), and the actual database files and transaction logs are OK, is
there any reasonable way to reconstruct the correct contol data? Or is
that the point at which you use pg_resetxlog?

Well, the problem is that if you can't trust pg_control you don't know
what you can trust.

My advice is to backup the $PGDATA tree (which you said was in
progress), then pg_resetxlog, then cross-check the hell out of the data
you see. Only if you can detect some data problems can we guess at
something else to do ...

regards, tom lane

#11Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#10)
Re: production server down

Tom Lane wrote:

My advice is to backup the $PGDATA tree (which you said was in
progress), then pg_resetxlog, then cross-check the hell out of the data
you see. Only if you can detect some data problems can we guess at
something else to do ...

OK. I plan to gather the usual suspects and try to get an accurate
picture of the chain of events first thing tomorrow. Then we'll likely
proceed as you suggest.

Thinking about your comments and reading xlog.c, it almost seems as
though the mount points were momentarily reversed between /replica and
/production. I.e. that the /production mount point was used near the
beginning of StartupXLOG() for ReadControlFile(), and the /replica mount
point was used at the end of StartupXLOG() for UpdateControlFile(). But
I have no idea how that could happen.

Thanks,

Joe

#12Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Joe Conway (#5)
Re: production server down

On Tue, Dec 14, 2004 at 09:22:42PM -0800, Joe Conway wrote:

# pg_controldata /replica/pgdata

Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142

Isn't it strange that these values are so close to the values found in a
just-initdb'd cluster?

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Saca el libro que tu religi�n considere como el indicado para encontrar la
oraci�n que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Ducl�s)

#13Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#10)
Re: production server down

Tom Lane wrote:

My advice is to backup the $PGDATA tree (which you said was in
progress), then pg_resetxlog, then cross-check the hell out of the data
you see. Only if you can detect some data problems can we guess at
something else to do ...

Before running pg_resetxlog, a couple of questions:

1. Since it appears that pg_control is suspect, should I force it to be
rebuilt, and if so, how?

2. At the end of GuessControlValues is this comment:
/*
* XXX eventually, should try to grovel through old XLOG to develop
* more accurate values for startupid, nextXID, and nextOID.
*/
What would be involved in doing this, and do you think it would be
worth it?

Thanks,

Joe

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#13)
Re: production server down

Joe Conway <mail@joeconway.com> writes:

Before running pg_resetxlog, a couple of questions:

1. Since it appears that pg_control is suspect, should I force it to be
rebuilt, and if so, how?

pg_resetxlog will rebuild it in any case. However it will re-use the
existing contents as much as it can (if you don't use any of the command
line options to override values). Given Alvaro's observation that the
existing file looks suspiciously close to a freshly-initdb'd one, I
don't think you want to trust the existing contents.

2. At the end of GuessControlValues is this comment:
/*
* XXX eventually, should try to grovel through old XLOG to develop
* more accurate values for startupid, nextXID, and nextOID.
*/
What would be involved in doing this, and do you think it would be
worth it?

What if anything have you got in $PGDATA/pg_xlog?

regards, tom lane

#15Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#14)
Re: production server down

Tom Lane wrote:

pg_resetxlog will rebuild it in any case. However it will re-use the
existing contents as much as it can (if you don't use any of the command
line options to override values). Given Alvaro's observation that the
existing file looks suspiciously close to a freshly-initdb'd one, I
don't think you want to trust the existing contents.

I don't trust it at all. So does that imply that I should override next
transaction id and WAL starting address per the manpage?

What if anything have you got in $PGDATA/pg_xlog?

# pwd
/replica/pgdata/pg_xlog
# ll
total 688836
drwx------ 2 postgres postgres 32768 Dec 13 15:47 .
drwx------ 6 postgres postgres 4096 Dec 14 17:45 ..
-rw------- 1 postgres postgres 16777216 Dec 13 16:02 000001650000004E
-rw------- 1 postgres postgres 16777216 Dec 13 06:42 000001650000004F
-rw------- 1 postgres postgres 16777216 Dec 13 06:55 0000016500000050
-rw------- 1 postgres postgres 16777216 Dec 13 07:21 0000016500000051
-rw------- 1 postgres postgres 16777216 Dec 13 07:41 0000016500000052
-rw------- 1 postgres postgres 16777216 Dec 13 07:57 0000016500000053
-rw------- 1 postgres postgres 16777216 Dec 13 08:00 0000016500000054
-rw------- 1 postgres postgres 16777216 Dec 13 08:04 0000016500000055
-rw------- 1 postgres postgres 16777216 Dec 13 08:09 0000016500000056
-rw------- 1 postgres postgres 16777216 Dec 13 08:13 0000016500000057
-rw------- 1 postgres postgres 16777216 Dec 13 08:26 0000016500000058
-rw------- 1 postgres postgres 16777216 Dec 13 08:42 0000016500000059
-rw------- 1 postgres postgres 16777216 Dec 13 09:09 000001650000005A
-rw------- 1 postgres postgres 16777216 Dec 13 09:23 000001650000005B
-rw------- 1 postgres postgres 16777216 Dec 13 09:40 000001650000005C
-rw------- 1 postgres postgres 16777216 Dec 13 09:51 000001650000005D
-rw------- 1 postgres postgres 16777216 Dec 13 09:58 000001650000005E
-rw------- 1 postgres postgres 16777216 Dec 13 10:03 000001650000005F
-rw------- 1 postgres postgres 16777216 Dec 13 10:09 0000016500000060
-rw------- 1 postgres postgres 16777216 Dec 13 10:24 0000016500000061
-rw------- 1 postgres postgres 16777216 Dec 13 10:37 0000016500000062
-rw------- 1 postgres postgres 16777216 Dec 13 10:56 0000016500000063
-rw------- 1 postgres postgres 16777216 Dec 13 11:11 0000016500000064
-rw------- 1 postgres postgres 16777216 Dec 13 11:38 0000016500000065
-rw------- 1 postgres postgres 16777216 Dec 13 11:52 0000016500000066
-rw------- 1 postgres postgres 16777216 Dec 13 11:56 0000016500000067
-rw------- 1 postgres postgres 16777216 Dec 13 12:04 0000016500000068
-rw------- 1 postgres postgres 16777216 Dec 13 12:07 0000016500000069
-rw------- 1 postgres postgres 16777216 Dec 13 12:17 000001650000006A
-rw------- 1 postgres postgres 16777216 Dec 13 12:29 000001650000006B
-rw------- 1 postgres postgres 16777216 Dec 13 12:52 000001650000006C
-rw------- 1 postgres postgres 16777216 Dec 13 13:15 000001650000006D
-rw------- 1 postgres postgres 16777216 Dec 13 13:36 000001650000006E
-rw------- 1 postgres postgres 16777216 Dec 13 13:51 000001650000006F
-rw------- 1 postgres postgres 16777216 Dec 13 13:59 0000016500000070
-rw------- 1 postgres postgres 16777216 Dec 13 14:06 0000016500000071
-rw------- 1 postgres postgres 16777216 Dec 13 14:10 0000016500000072
-rw------- 1 postgres postgres 16777216 Dec 13 14:15 0000016500000073
-rw------- 1 postgres postgres 16777216 Dec 13 14:37 0000016500000074
-rw------- 1 postgres postgres 16777216 Dec 13 14:51 0000016500000075
-rw------- 1 postgres postgres 16777216 Dec 13 15:17 0000016500000076
-rw------- 1 postgres postgres 16777216 Dec 13 15:39 0000016500000077

Joe

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#15)
Re: production server down

Joe Conway <mail@joeconway.com> writes:

I don't trust it at all. So does that imply that I should override next
transaction id and WAL starting address per the manpage?

Yes, override everything there's a switch for. Also check that the
other values shown by pg_controldata look reasonable (the locale
settings are probably the only ones you might get burned on).

What if anything have you got in $PGDATA/pg_xlog?

-rw------- 1 postgres postgres 16777216 Dec 13 15:39 0000016500000077

Um. That's so far from the values shown in pg_control that it's not funny.

This is 7.4, right? I have a crude xlog dump tool that I'll send you
off-list. We should be able to identify the latest checkpoint in the
existing XLOG files, and that will give you something to work with.

regards, tom lane

#17Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#16)
Re: production server down

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

I don't trust it at all. So does that imply that I should override
next transaction id and WAL starting address per the manpage?

Yes, override everything there's a switch for. Also check that the
other values shown by pg_controldata look reasonable (the locale
settings are probably the only ones you might get burned on).

OK

What if anything have you got in $PGDATA/pg_xlog?

-rw------- 1 postgres postgres 16777216 Dec 13 15:39
0000016500000077

Um. That's so far from the values shown in pg_control that it's not
funny.

This is 7.4, right?

Correct.

I have a crude xlog dump tool that I'll send you off-list. We should
be able to identify the latest checkpoint in the existing XLOG files,
and that will give you something to work with.

Thanks,

Joe

#18Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#3)
Re: production server down

Tom Lane <tgl@sss.pgh.pa.us> writes:

The server experienced a hang (as yet unexplained) yesterday and was
restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
network admin that there was a problem with the network card on restart,
so the nfs mount most probably disappeared and then reappeared
underneath a quiescent postgresql at some point between 2004-12-13
16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
latter).

I've always felt that running a database across NFS was a Bad Idea ;-)

Well not that I disagree with that sentiment, but NFS was specifically
designed to handle this particular scenario. *UNLESS* you use the "soft"
option. As popular as it is, this is precisely the scenario where it causes
problems.

(The "intr" option as well, but I don't think that would be relevant for
postgres).

--
greg

#19Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#16)
Re: production server down

Tom Lane wrote:

Yes, override everything there's a switch for. Also check that the
other values shown by pg_controldata look reasonable (the locale
settings are probably the only ones you might get burned on).

What if anything have you got in $PGDATA/pg_xlog?

-rw------- 1 postgres postgres 16777216 Dec 13 15:39 0000016500000077

Um. That's so far from the values shown in pg_control that it's not funny.

This is 7.4, right? I have a crude xlog dump tool that I'll send you
off-list. We should be able to identify the latest checkpoint in the
existing XLOG files, and that will give you something to work with.

Just wanted to close the loop for the sake of the list archives. With
Tom's xlog dump tool I was able (with a bunch of his help off-list) to
identify the needed parameters for pg_resetxlog. Running pg_resetxlog
got us back a running database. We're now involved in checking the data.

Thank you to everyone for your help -- especially Tom!

Joe

#20Michael Fuhr
mike@fuhr.org
In reply to: Joe Conway (#19)
Re: production server down

On Wed, Dec 15, 2004 at 11:41:02AM -0800, Joe Conway wrote:

Just wanted to close the loop for the sake of the list archives. With
Tom's xlog dump tool I was able (with a bunch of his help off-list) to
identify the needed parameters for pg_resetxlog. Running pg_resetxlog
got us back a running database. We're now involved in checking the data.

Any chance you could write up a summary of the thread: what caused
the problem, how you diagnosed it, how you fixed it, and how to
avoid it? Might make a useful "lessons learned" document.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

#21Joe Conway
mail@joeconway.com
In reply to: Bruce Momjian (#18)
#22Joe Conway
mail@joeconway.com
In reply to: Michael Fuhr (#20)
#23Bruce Momjian
bruce@momjian.us
In reply to: Joe Conway (#22)
#24Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Joe Conway (#22)
#25Joe Conway
mail@joeconway.com
In reply to: Alvaro Herrera (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#25)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#22)
#29Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#26)
#30Andrew Dunstan
andrew@dunslane.net
In reply to: Joe Conway (#29)
#31Joe Conway
mail@joeconway.com
In reply to: Andrew Dunstan (#30)
#32Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#27)