pg_stop_backup does not complete
Simon, Fujii, All:
While demoing HS/SR at SCALE, I ran into a problem which is likely to be
a commonly encountered bug when people first setup HS/SR. Here's the
sequence:
1) Set up a brand new master with an archive-commmand and archive=on.
2) Start the master
3) Do a pg_start_backup()
4) Realize, based on log error messages, that I've misconfigured the
archive_command.
5) Attempt to shut down the master. Master tells me that pg_stop_backup
must be run in order to shut down.
6) Execute pg_stop_backup.
7) pg_stop_backup waits forever without ever stopping backup. Ever 60
seconds, it give me a helpful "still waiting" message, but at least in
the amount of time I was willing to wait (5 minutes), it never completed.
8) do an immediate shutdown, as it's the only way I can get the database
unstuck.
With some experimentation, the problem seems to occur when you have a
failing archive_command and a master which currently has no database
traffic; for example, if I did some database write activity (a createdb)
then pg_stop_backup would complete after about 60 seconds (which, btw,
is extremely annoying, but at least tolerable).
This issue is 100% reproduceable.
--Josh Berkus
This issue is 100% reproduceable.
Oh, btw, this is on Alpha4.
--Josh Berkus
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
Simon, Fujii, All:
While demoing HS/SR at SCALE, I ran into a problem which is likely to be
a commonly encountered bug when people first setup HS/SR. Here's the
sequence:1) Set up a brand new master with an archive-commmand and archive=on.
2) Start the master
3) Do a pg_start_backup()
4) Realize, based on log error messages, that I've misconfigured the
archive_command.5) Attempt to shut down the master. Master tells me that pg_stop_backup
must be run in order to shut down.
If I issue a shutdown, PostgreSQL should do whatever it needs to do to
shutdown; including issuing a pg_stop_backup.
Joshua D. Drake
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
"Joshua D. Drake" <jd@commandprompt.com> wrote:
If I issue a shutdown, PostgreSQL should do whatever it needs to
do to shutdown; including issuing a pg_stop_backup.
Should we have a pg_fail_backup function, so that it doesn't put out
a file which suggests that we have a complete backup?
-Kevin
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
1) Set up a brand new master with an archive-commmand and archive=on.
2) Start the master
3) Do a pg_start_backup()
4) Realize, based on log error messages, that I've misconfigured the
archive_command.
5) Attempt to shut down the master. Master tells me that pg_stop_backup
must be run in order to shut down.6) Execute pg_stop_backup.
7) pg_stop_backup waits forever without ever stopping backup. Ever 60
seconds, it give me a helpful "still waiting" message, but at least in
the amount of time I was willing to wait (5 minutes), it never completed.8) do an immediate shutdown, as it's the only way I can get the database
unstuck.With some experimentation, the problem seems to occur when you have a
failing archive_command and a master which currently has no database
traffic; for example, if I did some database write activity (a createdb)
then pg_stop_backup would complete after about 60 seconds (which, btw,
is extremely annoying, but at least tolerable).This issue is 100% reproduceable.
IMHO there in no problem in that behaviour. If somebody requests a
backup then we should wait for it to complete. Kevin's suggestion of
pg_fail_backup() is the only sensible conclusion there because it gives
an explicit way out of deadlock.
ISTM the problem is that you didn't test. Steps 3 and 4 should have been
reversed. Perhaps we should put something in the docs to say "and test".
The correct resolution is to put in an archive_command that works.
We can put in an extra step to prevent a pg_start_backup() if there are
a significant number of outstanding files to be archived. Doing that
seems like closing the door after the horse has bolted, since we just
introduced streaming replication that doesn't rely on archived files. In
any case, I don't see many people working on a production system hitting
a problem on an archive_command and then deciding to shut down.
So I don't see this as something that needs fixing for 9.0. There is
already too much non-essential code there, all of which needs to be
tested. I don't think adding in new corner cases to "help" people makes
any sense until we have automated testing that allows us to rerun the
regression tests to check all this stuff still works.
--
Simon Riggs www.2ndQuadrant.com
On Tue, 2010-02-23 at 18:58 +0000, Simon Riggs wrote:
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
This issue is 100% reproduceable.
IMHO there in no problem in that behaviour. If somebody requests a
backup then we should wait for it to complete. Kevin's suggestion of
pg_fail_backup() is the only sensible conclusion there because it gives
an explicit way out of deadlock.ISTM the problem is that you didn't test. Steps 3 and 4 should have been
reversed. Perhaps we should put something in the docs to say "and test".
The correct resolution is to put in an archive_command that works.
The problem isn't that it is a bad archive_command, it is that
PostgreSQL has no way to deal with this gracefully. Yes people should
test but are we dealing with the real world or not?
So I don't see this as something that needs fixing for 9.0. There is
already too much non-essential code there, all of which needs to be
tested. I don't think adding in new corner cases to "help" people makes
any sense until we have automated testing that allows us to rerun the
regression tests to check all this stuff still works.
This will bite us if we release like this.
Joshua D. Drake
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
Simon Riggs <simon@2ndQuadrant.com> wrote:
The correct resolution is to put in an archive_command that works.
One really should ensure that WAL files (or should I now say data?
;-) are flowing before issuing running the pg_start_backup()
function. The documentation has always been pretty explicit about
that:
http://www.postgresql.org/docs/8.4/interactive/continuous-archiving.html
| 24.3.2. Making a Base Backup
|
| The procedure for making a base backup is relatively simple:
|
| 1. Ensure that WAL archiving is enabled and working.
|
| 2. Connect to the database as a superuser, and issue the command:
|
| SELECT pg_start_backup('label');
| ...
As long as the SR documentation is equally explicit on this point,
you'd have to be blatantly going against the instructions to hit
this.
Which makes me think that while pg_fail_backup() might actually be a
good idea, it's not really needed to solve this, so it's 9.1
material at best.
-Kevin
On Tue, Feb 23, 2010 at 06:58:22PM +0000, Simon Riggs wrote:
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
1) Set up a brand new master with an archive-commmand and
archive=on.2) Start the master
3) Do a pg_start_backup()
4) Realize, based on log error messages, that I've misconfigured
the archive_command.5) Attempt to shut down the master. Master tells me that
pg_stop_backup must be run in order to shut down.6) Execute pg_stop_backup.
7) pg_stop_backup waits forever without ever stopping backup.
Ever 60 seconds, it give me a helpful "still waiting" message, but
at least in the amount of time I was willing to wait (5 minutes),
it never completed.8) do an immediate shutdown, as it's the only way I can get the
database unstuck.With some experimentation, the problem seems to occur when you
have a failing archive_command and a master which currently has no
database traffic; for example, if I did some database write
activity (a createdb) then pg_stop_backup would complete after
about 60 seconds (which, btw, is extremely annoying, but at least
tolerable).This issue is 100% reproduceable.
IMHO there in no problem in that behaviour. If somebody requests a
backup then we should wait for it to complete. Kevin's suggestion of
pg_fail_backup() is the only sensible conclusion there because it
gives an explicit way out of deadlock.ISTM the problem is that you didn't test. Steps 3 and 4 should have
been reversed. Perhaps we should put something in the docs to say
"and test". The correct resolution is to put in an archive_command
that works.
+1 for clarifying and extending the docs.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, 2010-02-23 at 11:24 -0800, Joshua D. Drake wrote:
This will bite us if we release like this.
No it won't. The current behaviour was put there by user request a few
releases back. This isn't a 9.0 issue, and as I've said its addressing
something that we now longer see as mainstream going forwards.
There are plenty of things that will bite us, but not this.
--
Simon Riggs www.2ndQuadrant.com
On Tue, Feb 23, 2010 at 12:52 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
Simon, Fujii, All:
While demoing HS/SR at SCALE, I ran into a problem which is likely to be
a commonly encountered bug when people first setup HS/SR. Here's the
sequence:1) Set up a brand new master with an archive-commmand and archive=on.
2) Start the master
3) Do a pg_start_backup()
4) Realize, based on log error messages, that I've misconfigured the
archive_command.5) Attempt to shut down the master. Master tells me that pg_stop_backup
must be run in order to shut down.If I issue a shutdown, PostgreSQL should do whatever it needs to do to
shutdown; including issuing a pg_stop_backup.
Maybe. But for sure, if it doesn't, and instead tells the user to
issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
user tries to execute it. I gather that the problem is that it has to
finish all that outstanding archiving before shutting down, in which
case Kevin's suggestion of having a command to abort the backup seems
reasonable. I might call it pg_abort_backup() rather than
pg_fail_backup(), but...
...Robert
On Tue, 2010-02-23 at 14:49 -0500, Robert Haas wrote:
If I issue a shutdown, PostgreSQL should do whatever it needs to do to
shutdown; including issuing a pg_stop_backup.Maybe. But for sure, if it doesn't, and instead tells the user to
issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
user tries to execute it.
Right.
I gather that the problem is that it has to
finish all that outstanding archiving before shutting down, in which
case Kevin's suggestion of having a command to abort the backup seems
reasonable. I might call it pg_abort_backup() rather than
pg_fail_backup(), but...
But...?
Joshua D. Drake
...Robert
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Tue, Feb 23, 2010 at 3:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
On Tue, 2010-02-23 at 14:49 -0500, Robert Haas wrote:
If I issue a shutdown, PostgreSQL should do whatever it needs to do to
shutdown; including issuing a pg_stop_backup.Maybe. But for sure, if it doesn't, and instead tells the user to
issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
user tries to execute it.Right.
I gather that the problem is that it has to
finish all that outstanding archiving before shutting down, in which
case Kevin's suggestion of having a command to abort the backup seems
reasonable. I might call it pg_abort_backup() rather than
pg_fail_backup(), but...But...?
But it seems like a good idea other than that.
...Robert
On Wed, Feb 24, 2010 at 4:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Maybe. But for sure, if it doesn't, and instead tells the user to
issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
user tries to execute it. I gather that the problem is that it has to
finish all that outstanding archiving before shutting down, in which
case Kevin's suggestion of having a command to abort the backup seems
reasonable. I might call it pg_abort_backup() rather than
pg_fail_backup(), but...
Or how about adding new boolean parameter of pg_stop_backup() that
specifies whether WAL archiving needs to be waited?
pg_stop_backup([wait boolean])
This parameter is optional. If true (default), it waits for archiving.
In warm-standby and SR, we don't need to wait for archiving before starting
the standby from the base backup. So pg_stop_backup(false) would be
useful for speedup of setup of log-shipping.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On 2/23/10 10:58 AM, Simon Riggs wrote:
So I don't see this as something that needs fixing for 9.0. There is
already too much non-essential code there, all of which needs to be
tested. I don't think adding in new corner cases to "help" people makes
any sense until we have automated testing that allows us to rerun the
regression tests to check all this stuff still works.
So, you're going to personally field the roughly 10,000 bug reports we
get on pgsql-general about this behaviour? 24/7?
The fact that we ran into this issue on the *first* day of testing the
new alpha4 is indicative of how common it will be -- it is not a corner
case, it is a common setup error which will affect probably 20% of new
users who try 9.0. And new users are going to panic when they can't
shut down postgresql, not just test for issues.
Any situation where postgresql cannot be safely shut down because of a
common setup mistake (typoing an archive_command) is, IMNSHO, not
something we can release with.
--Josh Berkus
On Tue, 2010-02-23 at 17:46 -0800, Josh Berkus wrote:
On 2/23/10 10:58 AM, Simon Riggs wrote:
So I don't see this as something that needs fixing for 9.0. There is
already too much non-essential code there, all of which needs to be
tested. I don't think adding in new corner cases to "help" people makes
any sense until we have automated testing that allows us to rerun the
regression tests to check all this stuff still works.So, you're going to personally field the roughly 10,000 bug reports we
get on pgsql-general about this behaviour? 24/7?
The fact that we ran into this issue on the *first* day of testing the
new alpha4 is indicative of how common it will be -- it is not a corner
case, it is a common setup error which will affect probably 20% of new
users who try 9.0. And new users are going to panic when they can't
shut down postgresql, not just test for issues.Any situation where postgresql cannot be safely shut down because of a
common setup mistake (typoing an archive_command) is, IMNSHO, not
something we can release with.
It's not a common setup mistake. Nothing changed in this release and
this has never been reported before.
The behaviour to wait for pg_stop_backup() was added by user request.
The behaviour for shutdown to wait for pg_stop_backup() was also added
by user request.
Your mistake was not typoing an archive_command, it was not correctly
testing that what you had done was actually working. The fix is to read
the manual and correct the typo. Shutting down the server after failing
to configure it is not likely to be a normal reaction to experiencing an
error in configuration. Better docs might help you, but I doubt it.
ISTM you should collect test reports, then analyse and prioritise them.
This rates pretty low for me: low severity, low frequency.
(If new users panic when they can't do shutdown the server, they
probably won't like smart shutdown very much either.)
--
Simon Riggs www.2ndQuadrant.com
Simon,
It's not a common setup mistake. Nothing changed in this release and
this has never been reported before.The behaviour to wait for pg_stop_backup() was added by user request.
The behaviour for shutdown to wait for pg_stop_backup() was also added
by user request.
Your two statements above contradict each other.
And, while it makes sense for smart shutdown to wait for
pg_stop_backup(), it does not make sense for fast shutdown to wait.
Aside from that, the main issue is not having shutdown wait for
pg_stop_backup; it's pg_stop_backup never completing. An issue, I'll
note, you're ignoring. If you're going to be this defensive whenever
anyone reports a bug, it's going to be veeeeeeery slow going to
troubleshoot HS.
As Robert Haas said: "But for sure, if it doesn't, and instead tells the
user to issue pg_stop_backup(), then pg_stop_backup() had better WORK
when the user tries to execute it."
Your mistake was not typoing an archive_command, it was not correctly
testing that what you had done was actually working. The fix is to read
the manual and correct the typo. Shutting down the server after failing
to configure it is not likely to be a normal reaction to experiencing an
error in configuration.
The problem is you're thinking of an experienced PostgreSQL DBA doing
setup on a production server. That's not what I'm talking about. I'm
talking about the thousands of new users who are going to try PostgreSQL
for the first time because of HS/SR on a test installation. If they
encounter this issue, they will decide (again) that PostgreSQL is too
hard to use and give up on us for another 5 years.
We've spent the last few years overcoming the image of PostgreSQL being
too complicated for most people to use. You seem hell-bent on restoring
it. Given the timing, our project has one chance to establish a new
reputation as the SQL database for everybody. User-hostile behavior
like this will ruin that chance.
Saying "RTFM and test, you newbie!" is not a valid response, and that's
what your "you should have read the docs" amounts to. Heck, I *did*
read the docs.
ISTM you should collect test reports, then analyse and prioritise them.
This rates pretty low for me: low severity, low frequency.
To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of
LAPUG all find this highly problematic behavior. So consider it 6
problem reports, not just one.
--Josh Berkus
On Wed, 2010-02-24 at 10:07 -0800, Josh Berkus wrote:
Simon,
Your mistake was not typoing an archive_command, it was not correctly
testing that what you had done was actually working. The fix is to read
the manual and correct the typo. Shutting down the server after failing
to configure it is not likely to be a normal reaction to experiencing an
error in configuration.The problem is you're thinking of an experienced PostgreSQL DBA doing
setup on a production server. That's not what I'm talking about. I'm
talking about the thousands of new users who are going to try PostgreSQL
for the first time because of HS/SR on a test installation. If they
encounter this issue, they will decide (again) that PostgreSQL is too
hard to use and give up on us for another 5 years.
Shoot forget the "new users", I am thinking about the hundreds of
thousands of existing NOT DBA users. E.g; 90% of our user base.
Saying "RTFM and test, you newbie!" is not a valid response, and that's
what your "you should have read the docs" amounts to. Heck, I *did*
read the docs.
Agreed. Although RTFM is important, we shouldn't have RTFM for something
that is clearly a user visible behavior mistake on our part.
ISTM you should collect test reports, then analyse and prioritise them.
This rates pretty low for me: low severity, low frequency.To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of
LAPUG all find this highly problematic behavior. So consider it 6
problem reports, not just one.
Basically the reports boil down to people who are actually going to be
dealing with this in the field. Simon with respect you have been 6 feet
deep in code for too long on this. You need to step back and take some
constructive feedback from those that are dealing with the production
issues and do so with a smile.
Sincerely,
Joshua D. Drake
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Wed, Feb 24, 2010 at 1:07 PM, Josh Berkus <josh@agliodbs.com> wrote:
And, while it makes sense for smart shutdown to wait for
pg_stop_backup(), it does not make sense for fast shutdown to wait.
TFM in fact says:
http://www.postgresql.org/docs/8.4/static/app-pg-ctl.html#APP-PG-CTL-DESCRIPTION
In stop mode, the server that is running in the specified data
directory is shut down. Three different shutdown methods can be
selected with the -m option: "Smart" mode waits for online backup mode
to finish and all the clients to disconnect. This is the default.
"Fast" mode does not wait for clients to disconnect and will terminate
an online backup in progress. All active transactions are rolled back
and clients are forcibly disconnected, then the server is shut down.
"Immediate" mode will abort all server processes without a clean
shutdown. This will lead to a recovery run on restart.
Your OP was not too clear about whether you tried a smart shutdown or
a fast shutdown, but if you meant a fast shutdown, this is apparently
(he says without testing) a regression.
...Robert
Your OP was not too clear about whether you tried a smart shutdown or
a fast shutdown, but if you meant a fast shutdown, this is apparently
(he says without testing) a regression.
Ah, sorry. Yes, I attempted a fast shutdown.
--Josh Berkus
Josh Berkus wrote:
And, while it makes sense for smart shutdown to wait for
pg_stop_backup(), it does not make sense for fast shutdown to wait.
Hang on, fast shutdown does *not* wait for backup to finish.
Aside from that, the main issue is not having shutdown wait for
pg_stop_backup; it's pg_stop_backup never completing. An issue, I'll
note, you're ignoring.
Ahh, that's a detail I missed too.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com