question about warm standby databases in 8.2.5

Started by Brett Neumeierover 18 years ago3 messagesgeneral
Jump to latest
#1Brett Neumeier
bneumeier@gmail.com

Hi,

I set up a warm standby failover system on Redhat, using built-from-source
postgresql 8.2.5 on (of course) both the master and standby systems.

The setup of the system was very easy, and the recovery script we have in
place on the standby system correctly copies in the archived WAL log files,
which are then applied.

What seems odd is what happens when we abort the continuous recovery so the
standby database becomes primary.

It seems that the recovery command always copies the source WAL file (with a
name like 00000001000000020000009C) to a file path "pg_xlog/RECOVERYXLOG",
which is fine. However, then when we abort recovery, postgresql seems to
expect that the most recent WAL log should be in pg_xlog with its original
filename, e.g., the 0....9C filename from above.

This seems broken -- if the WAL file should wind up in the pg_xlog directory
with the 0...9C name, why isn't postgresql copying it there?

Here are the log messages that show what I'm talking about. Note that
everything is fine for quite a while; then we triggered the standby database
to come online before 0...B4 was archived...and postgresql then bails out
because 0...B3 (which has already been restored) doesn't exist!

We're working around this, for now, by having the recovery command script
copy archived WAL files to the specified location pg_xlog/RECOVERYXLOG, and
also to the pg_xlog directory with the file's original basename. But that
seems awfully sloppy, and isn't the process documented in the manual.

Advice is eagerly solicited!

LOG: starting archive recovery
LOG: restore_command = "/home/pgsql/bin/recover_script.rb %f %p"
LOG: restored log file "0000000100000002000000A1.001FAD68.backup" from
archive
LOG: restored log file "0000000100000002000000A1" from archive
LOG: checkpoint record is at 2/A11FAD68
LOG: redo record is at 2/A11FAD68; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 0/82464990; next OID: 45282
LOG: next MultiXactId: 28; next MultiXactOffset: 55
LOG: automatic recovery in progress
LOG: redo starts at 2/A11FADB0
LOG: restored log file "0000000100000002000000A2" from archive
[a bunch of similar messages omitted]
LOG: restored log file "0000000100000002000000B3" from archive
LOG: could not open file "pg_xlog/0000000100000002000000B4" (log file 2,
segment 180): No such file or directory
LOG: redo done at 2/B354BDD0
PANIC: could not open file "pg_xlog/0000000100000002000000B3" (log file 2,
segment 179): No such file or directory
LOG: startup process (PID 17604) was terminated by signal 6
LOG: aborting startup due to startup process failure
LOG: database system was interrupted while in recovery at log time
2007-12-10 16:57:42 EST
HINT: If this has occurred more than once some data may be corrupted and
you may need to choose an earlier recovery target.

Cheers,

bn

--
Brett Neumeier (bneumeier@gmail.com)

#2Greg Smith
gsmith@gregsmith.com
In reply to: Brett Neumeier (#1)
Re: question about warm standby databases in 8.2.5

On Mon, 10 Dec 2007, Brett Neumeier wrote:

It seems that the recovery command always copies the source WAL file (with a
name like 00000001000000020000009C) to a file path "pg_xlog/RECOVERYXLOG",
which is fine. However, then when we abort recovery, postgresql seems to
expect that the most recent WAL log should be in pg_xlog with its original
filename, e.g., the 0....9C filename from above.

I think your expectation for what the recovery command script you provide
does and what the server actually requires are a little mismatched.
RECOVERYXLOG is strictly a temporary file and as you've discovered the
server may want the original back again by its original name. This has
come up before--check out this thread, from this message to the end:

http://archives.postgresql.org/pgsql-admin/2007-08/msg00425.php

It's pointed out there that the documentation is little sparse in this
area.

Not that many people run into this because there is a reference
implementation of a recovery command that correctly implements the
required behavior:

http://developer.postgresql.org/pgdocs/postgres/pgstandby.html

pg_standby ships with the upcoming 8.3, but if you grab that code you can
run it just fine against an 8.2 system. You'd be better off in the long
run (and probably the short run too) replacing whatever script you've
developed with that one, so you can just get updates to it rather than
keeping one updated in-house. Consider the time you've spend working on
your own not wasted but educational--you can never know too much about
disaster recovery of your database.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Brett Neumeier (#1)
Re: question about warm standby databases in 8.2.5

On Mon, 2007-12-10 at 22:43 -0600, Brett Neumeier wrote:

It seems that the recovery command always copies the source WAL file
(with a name like 00000001000000020000009C) to a file path
"pg_xlog/RECOVERYXLOG", which is fine. However, then when we abort
recovery, postgresql seems to expect that the most recent WAL log
should be in pg_xlog with its original filename, e.g., the 0....9C
filename from above.

This allows the recovery to be restartable, which you'll want even if if
you haven't realised it yet.

This seems broken -- if the WAL file should wind up in the pg_xlog
directory with the 0...9C name, why isn't postgresql copying it there?

This part of the design specifically allows infinitely long recoveries.

You can only delete WAL files prior to the last restartpoint.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com