odd output in restore mode
I have just been working on setting up a continuous recovery failover
system, and noticed some odd log lines, shown below. (Using 8.3).
First note that our parsing of recovery.conf in xlog.c is pretty bad,
and at least we need to document the quirks if it's not going to be
fixed. log_restartpoints is said to be boolean, but when I set it to an
unquoted true I got a fatal error, while a quoted 'on' sets it to false,
as seen. Ick. What is more, I apparently managed to get the recovery
server to lose a WAL file and hang totally by having a bad
recovery.conf. Triple ick.
Second, what is all this about .history files? My understanding is that
they are not necessary, so surely we should try to stat them to see if
they are present before trying to copy them. These lines are going to
confuse a lot of people, I suspect (including me).
Lastly, not quite related to this output, but in the same general area,
should we have an option on pg_standby to allow removing the archive
file after it has been restored?
cheers
andrew
LOG: database system was interrupted; last known up at 2008-05-12
15:18:23 EDT
LOG: starting archive recovery
LOG: log_restartpoints = false
LOG: restore_command = '../bin/pg_standby -t ../common_archive/failover
../common_archive %f %p %r '
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
LOG: restored log file "0000000100000000000000A5.00000068.backup" from
archive
LOG: restored log file "0000000100000000000000A5" from archive
LOG: automatic recovery in progress
LOG: redo starts at 0/A50000B0
LOG: restored log file "0000000100000000000000A6" from archive
LOG: restored log file "0000000100000000000000A7" from archive
LOG: restored log file "0000000100000000000000A8" from archive
LOG: restored log file "0000000100000000000000A9" from archive
trigger file found
LOG: could not open file "pg_xlog/0000000100000000000000AA" (log file
0, segment 170): No such file or directory
LOG: redo done at 0/A9000068
LOG: restored log file "0000000100000000000000A9" from archive
cp: cannot stat `../common_archive/00000002.history': No such file or
directory
cp: cannot stat `../common_archive/00000002.history': No such file or
directory
cp: cannot stat `../common_archive/00000002.history': No such file or
directory
LOG: selected new timeline ID: 2
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
LOG: archive recovery complete
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
On Mon, 2008-05-12 at 16:57 -0400, Andrew Dunstan wrote:
I have just been working on setting up a continuous recovery failover
system, and noticed some odd log lines, shown below. (Using 8.3).
Hmmm, well, the first time you use something complex, there are some
surprising features, I guess. Most especially the log lines are there to
allow production issues to be diagnosed, not to create a beautiful log.
Many of the things that look somewhat strange are there for a reason,
since a wide range of options and save-your-customers-ass scenarios are
covered by the recovery code.
Suggestions for improvement are always welcome and you are welcome to
suggest doc changes, as many people do.
First note that our parsing of recovery.conf in xlog.c is pretty bad,
and at least we need to document the quirks if it's not going to be
fixed. log_restartpoints is said to be boolean, but when I set it to an
unquoted true I got a fatal error, while a quoted 'on' sets it to false,
as seen. Ick.
Yes, some improvements are definitely due there.
What is more, I apparently managed to get the recovery
server to lose a WAL file and hang totally by having a bad
recovery.conf. Triple ick.
Sounds like a bug you should report in the normal way. Correctness is
paramount. Or are you confusing the message in the log for file AA with
an error?
Second, what is all this about .history files? My understanding is that
they are not necessary, so surely we should try to stat them to see if
they are present before trying to copy them. These lines are going to
confuse a lot of people, I suspect (including me).
I try to keep it as simple as possible, since much of this code only
gets run when you really need it to work. The request for the .history
file and the cp are examples of that.
Lastly, not quite related to this output, but in the same general area,
should we have an option on pg_standby to allow removing the archive
file after it has been restored?
There already is one, but its more complex than that. (%r)
LOG: database system was interrupted; last known up at 2008-05-12
15:18:23 EDT
LOG: starting archive recovery
LOG: log_restartpoints = false
LOG: restore_command = '../bin/pg_standby -t ../common_archive/failover
../common_archive %f %p %r '
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
LOG: restored log file "0000000100000000000000A5.00000068.backup" from
archive
LOG: restored log file "0000000100000000000000A5" from archive
LOG: automatic recovery in progress
LOG: redo starts at 0/A50000B0
LOG: restored log file "0000000100000000000000A6" from archive
LOG: restored log file "0000000100000000000000A7" from archive
LOG: restored log file "0000000100000000000000A8" from archive
LOG: restored log file "0000000100000000000000A9" from archive
trigger file found
LOG: could not open file "pg_xlog/0000000100000000000000AA" (log file
0, segment 170): No such file or directory
LOG: redo done at 0/A9000068
LOG: restored log file "0000000100000000000000A9" from archive
cp: cannot stat `../common_archive/00000002.history': No such file or
directory
cp: cannot stat `../common_archive/00000002.history': No such file or
directory
cp: cannot stat `../common_archive/00000002.history': No such file or
directory
LOG: selected new timeline ID: 2
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
cp: cannot stat `../common_archive/00000001.history': No such file or
directory
LOG: archive recovery complete
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
There is an outstanding Windows issue with pg_standby that your help
would be appreciated with, shown on latest commitfest page. It's a
Windows issue and I don't maintain a Windows dev environment.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
What is more, I apparently managed to get the recovery
server to lose a WAL file and hang totally by having a bad
recovery.conf. Triple ick.Sounds like a bug you should report in the normal way. Correctness is
paramount. Or are you confusing the message in the log for file AA with
an error?
No, it had to do with pg_standby waiting for a WAL file that had already
gone, somehow. I will try to reproduce it when I get a spare moment.
Second, what is all this about .history files? My understanding is that
they are not necessary, so surely we should try to stat them to see if
they are present before trying to copy them. These lines are going to
confuse a lot of people, I suspect (including me).I try to keep it as simple as possible, since much of this code only
gets run when you really need it to work. The request for the .history
file and the cp are examples of that.
I don't follow. AFAICT no .history file was in fact archived. ISTM that
in this case we should only call RestoreWALFileForRecovery if the file
in fact exists.
Lastly, not quite related to this output, but in the same general area,
should we have an option on pg_standby to allow removing the archive
file after it has been restored?There already is one, but its more complex than that. (%r)
I was using %r. But the WAL files that have been restored (according to
the log) are still in the archive dir. So it looks like %r isn't working
properly.
There is an outstanding Windows issue with pg_standby that your help
would be appreciated with, shown on latest commitfest page. It's a
Windows issue and I don't maintain a Windows dev environment.
The patch has been rejected for now, according to the Commitfest page.
Not sure what you want my help on.
BTW, none of what I reported was on Windows - it's on Linux.
cheers
andrew
On Mon, 2008-05-12 at 18:58 -0400, Andrew Dunstan wrote:
No, it had to do with pg_standby waiting for a WAL file that had already
gone, somehow. I will try to reproduce it when I get a spare moment.
Sounds like the bug I just fixed.
There is an outstanding Windows issue with pg_standby that your help
would be appreciated with, shown on latest commitfest page. It's a
Windows issue and I don't maintain a Windows dev environment.
The patch has been rejected for now, according to the Commitfest page.
Not sure what you want my help on.
Well, the patch was rejected long ago, not sure why its in this
commitfest. But its an open issue on the Windows port.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Monday 12 May 2008 18:58:37 Andrew Dunstan wrote:
Simon Riggs wrote:
Lastly, not quite related to this output, but in the same general area,
should we have an option on pg_standby to allow removing the archive
file after it has been restored?There already is one, but its more complex than that. (%r)
I was using %r. But the WAL files that have been restored (according to
the log) are still in the archive dir. So it looks like %r isn't working
properly.
Are you sure you've moved passed the latest restart point? Just because a WAL
file has been processed doesn't mean it can be deleted.
--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Simon Riggs wrote:
On Mon, 2008-05-12 at 18:58 -0400, Andrew Dunstan wrote:
No, it had to do with pg_standby waiting for a WAL file that had already
gone, somehow. I will try to reproduce it when I get a spare moment.Sounds like the bug I just fixed.
Yes, so I see. I didn't have that fix, so I'll test again with the patch.
There is an outstanding Windows issue with pg_standby that your help
would be appreciated with, shown on latest commitfest page. It's a
Windows issue and I don't maintain a Windows dev environment.The patch has been rejected for now, according to the Commitfest page.
Not sure what you want my help on.Well, the patch was rejected long ago, not sure why its in this
commitfest. But its an open issue on the Windows port.
Surely the right fix is to use the recently implemented
pgwin32_safestat() (if we aren't already - I suspect we probably are)
and remove the kluge in pg_standby.c.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Simon Riggs wrote:
Well, the patch was rejected long ago, not sure why its in this
commitfest. But its an open issue on the Windows port.
Surely the right fix is to use the recently implemented
pgwin32_safestat() (if we aren't already - I suspect we probably are)
and remove the kluge in pg_standby.c.
I think the open issue is how to know whether pgwin32_safestat fixes the
problem that the kluge tried to work around.
regards, tom lane
Robert Treat wrote:
On Monday 12 May 2008 18:58:37 Andrew Dunstan wrote:
Simon Riggs wrote:
Lastly, not quite related to this output, but in the same general area,
should we have an option on pg_standby to allow removing the archive
file after it has been restored?There already is one, but its more complex than that. (%r)
I was using %r. But the WAL files that have been restored (according to
the log) are still in the archive dir. So it looks like %r isn't working
properly.Are you sure you've moved passed the latest restart point? Just because a WAL
file has been processed doesn't mean it can be deleted.
Thanks. It wasn't that, but when I ran with the very latest patches this
problem went away.
cheers
andrew
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Simon Riggs wrote:
Well, the patch was rejected long ago, not sure why its in this
commitfest. But its an open issue on the Windows port.Surely the right fix is to use the recently implemented
pgwin32_safestat() (if we aren't already - I suspect we probably are)
and remove the kluge in pg_standby.c.I think the open issue is how to know whether pgwin32_safestat fixes the
problem that the kluge tried to work around.
Well, I think we need to consider quite a number of scenarios. The
archive directory could be local, on a remote Windows machine, or on a
remote Samba server. The archive file could be copied by Windows copy,
or Unix cp, or scp, or rsync, among others.
I'd like to know the setup that was found to produce the error, to start
with.
cheers
andrew
On Mon, 2008-05-12 at 22:40 -0400, Andrew Dunstan wrote:
Robert Treat wrote:
On Monday 12 May 2008 18:58:37 Andrew Dunstan wrote:
Simon Riggs wrote:
Lastly, not quite related to this output, but in the same general area,
should we have an option on pg_standby to allow removing the archive
file after it has been restored?There already is one, but its more complex than that. (%r)
I was using %r. But the WAL files that have been restored (according to
the log) are still in the archive dir. So it looks like %r isn't working
properly.
Are you sure you've moved passed the latest restart point? Just because a WAL
file has been processed doesn't mean it can be deleted.Thanks. It wasn't that, but when I ran with the very latest patches this
problem went away.
Thanks for testing.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Mon, 2008-05-12 at 23:03 -0400, Andrew Dunstan wrote:
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Simon Riggs wrote:
Well, the patch was rejected long ago, not sure why its in this
commitfest. But its an open issue on the Windows port.Surely the right fix is to use the recently implemented
pgwin32_safestat() (if we aren't already - I suspect we probably are)
and remove the kluge in pg_standby.c.I think the open issue is how to know whether pgwin32_safestat fixes the
problem that the kluge tried to work around.Well, I think we need to consider quite a number of scenarios. The
archive directory could be local, on a remote Windows machine, or on a
remote Samba server. The archive file could be copied by Windows copy,
or Unix cp, or scp, or rsync, among others.I'd like to know the setup that was found to produce the error, to start
with.
It's a race condition, not a deterministic bug with recreatable
conditions. My understanding is the current code was introduced to work
around the implementation of stat on Windows which says the filesize is
correct even while it is still copying it. The 1sec delay fixed that but
is clearly not a foolproof fix and introduces a delay also, which was
the original complaint.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Tue, May 13, 2008 at 2:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Simon Riggs wrote:
Well, the patch was rejected long ago, not sure why its in this
commitfest. But its an open issue on the Windows port.Surely the right fix is to use the recently implemented
pgwin32_safestat() (if we aren't already - I suspect we probably are)
and remove the kluge in pg_standby.c.I think the open issue is how to know whether pgwin32_safestat fixes the
problem that the kluge tried to work around.
Per the comments on the commitfest page, I don't believe it is.
pgwin32_safestat fixes a bug in which stat() returns stale information
(if memory serves). The hack in pg_standby was added because copy in
Windows appears to preallocate the required space for the file it's
copying, thus checking the file size to verify that the copy has
completed is not a valid test.
--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com
On Tue, 2008-05-13 at 08:42 +0100, Dave Page wrote:
On Tue, May 13, 2008 at 2:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Simon Riggs wrote:
Well, the patch was rejected long ago, not sure why its in this
commitfest. But its an open issue on the Windows port.Surely the right fix is to use the recently implemented
pgwin32_safestat() (if we aren't already - I suspect we probably are)
and remove the kluge in pg_standby.c.I think the open issue is how to know whether pgwin32_safestat fixes the
problem that the kluge tried to work around.Per the comments on the commitfest page, I don't believe it is.
pgwin32_safestat fixes a bug in which stat() returns stale information
(if memory serves). The hack in pg_standby was added because copy in
Windows appears to preallocate the required space for the file it's
copying, thus checking the file size to verify that the copy has
completed is not a valid test.
Could somebody suggest and test an improvement to the Windows code, to
fix the kluge?
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Tue, 2008-05-13 at 08:42 +0100, Dave Page wrote:
On Tue, May 13, 2008 at 2:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Simon Riggs wrote:
Well, the patch was rejected long ago, not sure why its in this
commitfest. But its an open issue on the Windows port.Surely the right fix is to use the recently implemented
pgwin32_safestat() (if we aren't already - I suspect we probably are)
and remove the kluge in pg_standby.c.I think the open issue is how to know whether pgwin32_safestat fixes the
problem that the kluge tried to work around.Per the comments on the commitfest page, I don't believe it is.
pgwin32_safestat fixes a bug in which stat() returns stale information
(if memory serves). The hack in pg_standby was added because copy in
Windows appears to preallocate the required space for the file it's
copying, thus checking the file size to verify that the copy has
completed is not a valid test.Could somebody suggest and test an improvement to the Windows code, to
fix the kluge?
Given what Dave says, I'm not sure there is an easy one, at least
without a lot of testing. Greg Stark's suggestion might or might not work.
However, we should probably make the behaviour switchable. If the
archive_command populating the archive_directory were rsync, for
example, this problem should not occur, because it copies to a temp
file, and then renames it, so we should never see an incomplete file
even though rsync also apparently preallocates space.
We should also document it better in the code, along the lines of Dave's
comment above.
cheers
andrew
I wrote:
However, we should probably make the behaviour switchable. If the
archive_command populating the archive_directory were rsync, for
example, this problem should not occur, because it copies to a temp
file, and then renames it, so we should never see an incomplete file
even though rsync also apparently preallocates space.
Another and probably simpler thing to try would be the GnuWin32 version
of cp. If we can verify that it behaves itself, we should probably
recommend it for use in archive_command instead of the native Windows copy.
I'm still not sure how to construct a test, though.
cheers
andrew
Andrew Dunstan wrote:
Another and probably simpler thing to try would be the GnuWin32 version
of cp. If we can verify that it behaves itself, we should probably
recommend it for use in archive_command instead of the native Windows
copy.
Perhaps use xcopy, which should be more ubiquitous?
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote:
Andrew Dunstan wrote:
Another and probably simpler thing to try would be the GnuWin32 version
of cp. If we can verify that it behaves itself, we should probably
recommend it for use in archive_command instead of the native Windows
copy.Perhaps use xcopy, which should be more ubiquitous?
I would be very surprised if xcopy did not exhibit the same
preallocating behaviour as copy.
cheers
andrew
Andrew Dunstan wrote:
I would be very surprised if xcopy did not exhibit the same
preallocating behaviour as copy.
I, on the other hand, would not say anything until someone tried it, and
then wouldn't be surprised if it behaved either way :-)
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
On Tue, May 13, 2008 at 5:11 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Andrew Dunstan wrote:
I would be very surprised if xcopy did not exhibit the same
preallocating behaviour as copy.I, on the other hand, would not say anything until someone tried it, and
then wouldn't be surprised if it behaved either way :-)
It pre-allocates the space as copy does. And yes, I did test :-p
--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com
Dave Page wrote:
On Tue, May 13, 2008 at 5:11 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Andrew Dunstan wrote:
I would be very surprised if xcopy did not exhibit the same
preallocating behaviour as copy.I, on the other hand, would not say anything until someone tried it, and
then wouldn't be surprised if it behaved either way :-)It pre-allocates the space as copy does. And yes, I did test :-p
Dave,
I don't know how you tested, but could you please repeat the test with
GnuWin32's cp.exe? If it doesn't preallocate the space then I think our
way forward is reasonably clear:
. we recommend its use for Windows archive_command settings
. we provide the delay kluge as switchable behaviour on Windows instead
of having it always on.
cheers
andrew