Failing to recover after panic shutdown

Started by Per Lauvåsalmost 18 years ago6 messagesgeneral
Jump to latest
#1Per Lauvås
per.lauvaas@mintra.no

Hi

I am running Postgres 8.2 on Windows 2003 server SP2.

Every now and then (2-3 times a year) our Postgres service is down and we need to manually start it. This is what we find:

In log when going down:
2008-06-02 13:40:02 PANIC: could not open file "pg_xlog/000000010000001C00000081" (log file 28, segment 129): Invalid argument

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
2008-06-02 13:40:02 LOG: server process (PID 5792) exited with exit code 3
2008-06-02 13:40:02 LOG: terminating any other active server processes
2008-06-02 13:40:02 WARNING: terminating connection because of crash of another server process
2008-06-02 13:40:02 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2008-06-02 13:40:02 HINT: In a moment you should be able to reconnect to the database and repeat your command.
(WARNING, DETAIL and HINT is repeated a few times...)
2008-06-02 13:40:02 LOG: all server processes terminated; reinitializing
2008-06-02 13:40:02 LOG: database system was interrupted at 2008-06-02 13:39:39 W. Europe Daylight Time
2008-06-02 13:40:02 LOG: checkpoint record is at 1C/80F646B0
2008-06-02 13:40:02 LOG: redo record is at 1C/80F646B0; undo record is at 0/0; shutdown FALSE
2008-06-02 13:40:02 LOG: next transaction ID: 0/316291661; next OID: 44447343
2008-06-02 13:40:02 LOG: next MultiXactId: 1929; next MultiXactOffset: 4093
2008-06-02 13:40:02 LOG: database system was not properly shut down; automatic recovery in progress
2008-06-02 13:40:02 LOG: redo starts at 1C/80F64700
2008-06-02 13:40:02 LOG: unexpected pageaddr 1C/79000000 in log file 28, segment 129, offset 0
2008-06-02 13:40:02 LOG: redo done at 1C/80FFD6E0
2008-06-02 13:40:02 FATAL: the database system is starting up
2008-06-02 13:40:03 FATAL: the database system is starting up
2008-06-02 13:40:03 FATAL: the database system is starting up
2008-06-02 13:40:03 FATAL: the database system is starting up
2008-06-02 13:40:03 FATAL: could not open file "pg_xlog/000000010000001C00000081" (log file 28, segment 129): Invalid argument
2008-06-02 13:40:03 FATAL: the database system is starting up
2008-06-02 13:40:03 LOG: startup process (PID 4420) exited with exit code 1
2008-06-02 13:40:03 LOG: aborting startup due to startup process failure
2008-06-02 13:40:05 LOG: archived transaction log file "000000010000001C00000080"
2008-06-02 13:40:05 LOG: logger shutting down

In log when manually started:

2008-06-02 13:50:34 LOG: database system was shut down at 2008-06-02 13:40:03 W. Europe Daylight Time
2008-06-02 13:50:34 LOG: checkpoint record is at 1C/80FFE990
2008-06-02 13:50:34 LOG: redo record is at 1C/80FFE990; undo record is at 0/0; shutdown TRUE
2008-06-02 13:50:34 LOG: next transaction ID: 0/316292191; next OID: 44447343
2008-06-02 13:50:34 LOG: next MultiXactId: 1929; next MultiXactOffset: 4093
2008-06-02 13:50:34 LOG: database system is ready
2008-06-02 13:50:35 LOG: archived transaction log file "000000010000001C00000080"
2008-06-02 13:50:54 LOG: archived transaction log file "000000010000001C00000080"

We are archiving WAL-segments at a remote machine, and we are copying non-filled WAL-segments every 10 minutes to be able to rebuild the DB with a maximum of 10 minutes of missing data. (I don't know if that has anything to do with it).

Does anyone have a clue on what the problem could be?

Vennlig hilsen
Per Lauvås
Systemutvikler
Fax: +47 22 41 60 61
Direct: +47 24 15 55 51
Mintra as
Storgata 1
P.O. Box 8945 Youngstorget
N-0028 Oslo
Tel: +47 24 15 55 00
Fax: +47 22 41 60 61
http://www.mintra.no
Faglig forum: Helhetlig læring og kompetanse
Motta Mintras faglige e-postmagasin Helhetlig LÆRING & KOMPETANSE - gratis hver måned. Svar på denne e-posten med "HLK" i emnefeltet eller se www.mintra.no for å melde din interesse.

Hyllevare e-læringskurs på Trainingportal.no
Få tilgang til Mintras hyllevarer av e-læringskurs innen HMS, applikasjonsopplæring, prosjektledelse, teknisk opplæring med mer på www.trainingportal.no

#2Magnus Hagander
magnus@hagander.net
In reply to: Per Lauvås (#1)
Re: Failing to recover after panic shutdown

Per Lauvås wrote:

Hi

I am running Postgres 8.2 on Windows 2003 server SP2.

Every now and then (2-3 times a year) our Postgres service is down
and we need to manually start it. This is what we find:

In log when going down:
2008-06-02 13:40:02 PANIC: could not open file
"pg_xlog/000000010000001C00000081" (log file 28, segment 129):
Invalid argument

Are you by any chance running an antivirus or other "security software"
on this server?

We are archiving WAL-segments at a remote machine, and we are copying
non-filled WAL-segments every 10 minutes to be able to rebuild the DB
with a maximum of 10 minutes of missing data. (I don't know if that
has anything to do with it).

How are you copying these files? Are you saying you're actually copying
the files out of the pg_xlog directory, or are you using the
archive_command along with archive_timeout?

//Magnus

#3Per Lauvås
per.lauvaas@mintra.no
In reply to: Per Lauvås (#1)
Re: Failing to recover after panic shutdown

Yes, we are copying from pg_xlog. By doing so we let the WAL-segments fill up (not using timeout) and we are able to recover within a 10 minute interval.

Could it be that this copy operation is causing the problem?

Per

-----Original Message-----
From: Magnus Hagander [mailto:magnus@hagander.net]
Sent: 3. juni 2008 15:47
To: Per Lauvås
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Failing to recover after panic shutdown

Per Lauvås wrote:

Hi

I am running Postgres 8.2 on Windows 2003 server SP2.

Every now and then (2-3 times a year) our Postgres service is down
and we need to manually start it. This is what we find:

In log when going down:
2008-06-02 13:40:02 PANIC: could not open file
"pg_xlog/000000010000001C00000081" (log file 28, segment 129):
Invalid argument

Are you by any chance running an antivirus or other "security software"
on this server?

We are archiving WAL-segments at a remote machine, and we are copying
non-filled WAL-segments every 10 minutes to be able to rebuild the DB
with a maximum of 10 minutes of missing data. (I don't know if that
has anything to do with it).

How are you copying these files? Are you saying you're actually copying
the files out of the pg_xlog directory, or are you using the
archive_command along with archive_timeout?

//Magnus

#4Magnus Hagander
magnus@hagander.net
In reply to: Per Lauvås (#3)
Re: Failing to recover after panic shutdown

Hi!

Yes, almost certianly. Windows has major issues with more than one
process opening the same file, so it's very likely that this is your
issue. The only way you can safely get the file off the system without
affecting the running PostgreSQL instance is to use a Volume Shadow
Copy snapshot.

That said, I believe what you are trying to do is not safe even if you
do that. You can't just copy WAL segments out of there - if that was
actually safe, you wouldn't really need archive_command at all. To be
safe to just "grab files out of the $PGDATA directory" you can again
use a VSS snapshot, but that will require you to copy all of PGDATA -
both the data and the xlog directories.

Bottom line: you really should be using archive_command and
archive_timeout for this :-)

//Magnus

Per Lauvås wrote:

Show quoted text

Yes, we are copying from pg_xlog. By doing so we let the WAL-segments
fill up (not using timeout) and we are able to recover within a 10
minute interval.

Could it be that this copy operation is causing the problem?

Per

-----Original Message-----
From: Magnus Hagander [mailto:magnus@hagander.net]
Sent: 3. juni 2008 15:47
To: Per Lauvås
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Failing to recover after panic shutdown

Per Lauvås wrote:

Hi

I am running Postgres 8.2 on Windows 2003 server SP2.

Every now and then (2-3 times a year) our Postgres service is down
and we need to manually start it. This is what we find:

In log when going down:
2008-06-02 13:40:02 PANIC: could not open file
"pg_xlog/000000010000001C00000081" (log file 28, segment 129):
Invalid argument

Are you by any chance running an antivirus or other "security
software" on this server?

We are archiving WAL-segments at a remote machine, and we are
copying non-filled WAL-segments every 10 minutes to be able to
rebuild the DB with a maximum of 10 minutes of missing data. (I
don't know if that has anything to do with it).

How are you copying these files? Are you saying you're actually
copying the files out of the pg_xlog directory, or are you using the
archive_command along with archive_timeout?

//Magnus

#5Per Lauvås
per.lauvaas@mintra.no
In reply to: Per Lauvås (#1)
Re: Failing to recover after panic shutdown

Hi, and thanks for the replies!

OK. I think we will reconsider this. The backup procedure was set up a few years ago. I have personally made several point in time recoveries using this technique (for testing purposes), and it works. But I guess an undesirable side-effect is a recovery failure every now and then.

And, Magnus: The DB is producing about 25 WALs each day (I guess it will increase to at least 144 with a 10 min timeout). Do you know how often a base backup is taken out there by the average administrator? I am getting fed up with doing a new base backup each week. Could the base backup operation be automated?

And good luck with the Euro championship (if you are from Sweden).

Per

-----Original Message-----
From: Magnus Hagander [mailto:magnus@hagander.net]
Sent: 4. juni 2008 09:04
To: Per Lauvås
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Failing to recover after panic shutdown

Hi!

Yes, almost certianly. Windows has major issues with more than one
process opening the same file, so it's very likely that this is your
issue. The only way you can safely get the file off the system without
affecting the running PostgreSQL instance is to use a Volume Shadow
Copy snapshot.

That said, I believe what you are trying to do is not safe even if you
do that. You can't just copy WAL segments out of there - if that was
actually safe, you wouldn't really need archive_command at all. To be
safe to just "grab files out of the $PGDATA directory" you can again
use a VSS snapshot, but that will require you to copy all of PGDATA -
both the data and the xlog directories.

Bottom line: you really should be using archive_command and
archive_timeout for this :-)

//Magnus

Per Lauvås wrote:

Show quoted text

Yes, we are copying from pg_xlog. By doing so we let the WAL-segments
fill up (not using timeout) and we are able to recover within a 10
minute interval.

Could it be that this copy operation is causing the problem?

Per

-----Original Message-----
From: Magnus Hagander [mailto:magnus@hagander.net]
Sent: 3. juni 2008 15:47
To: Per Lauvås
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Failing to recover after panic shutdown

Per Lauvås wrote:

Hi

I am running Postgres 8.2 on Windows 2003 server SP2.

Every now and then (2-3 times a year) our Postgres service is down
and we need to manually start it. This is what we find:

In log when going down:
2008-06-02 13:40:02 PANIC: could not open file
"pg_xlog/000000010000001C00000081" (log file 28, segment 129):
Invalid argument

Are you by any chance running an antivirus or other "security
software" on this server?

We are archiving WAL-segments at a remote machine, and we are
copying non-filled WAL-segments every 10 minutes to be able to
rebuild the DB with a maximum of 10 minutes of missing data. (I
don't know if that has anything to do with it).

How are you copying these files? Are you saying you're actually
copying the files out of the pg_xlog directory, or are you using the
archive_command along with archive_timeout?

//Magnus

#6Magnus Hagander
magnus@hagander.net
In reply to: Per Lauvås (#5)
Re: Failing to recover after panic shutdown

Per Lauvås wrote:

Hi, and thanks for the replies!

OK. I think we will reconsider this. The backup procedure was set up
a few years ago. I have personally made several point in time
recoveries using this technique (for testing purposes), and it works.
But I guess an undesirable side-effect is a recovery failure every
now and then.

I'm afraid it may be one of those things that appears to work, because
it works much of the time, but once you have an actual failure, it
might not work :-( I wouldn't trust it, really.

And, Magnus: The DB is producing about 25 WALs each day (I guess it
will increase to at least 144 with a 10 min timeout). Do you know how
often a base backup is taken out there by the average administrator?
I am getting fed up with doing a new base backup each week. Could the
base backup operation be automated?

Uh, yeah, sure. You can script it (though scripting on Windows
using .bat files certainly isn't fun, you can use any kind of scripting
language to do it). A pretty nice way is actually to use VSS - you just
do a pg_start_backup() and pg_stop_backup(). If your regular backup
software uses VSS and supports pre/post scripts, you can use those to
implement it...

And good luck with the Euro championship (if you are from Sweden).

I am - thanks :-)

//Magnus