BUG #7643: Issuing a shutdown request while server startup leads to server hang
The following bug has been logged on the website:
Bug reference: 7643
Logged by: Haribabu
Email address: haribabu.kommi@huawei.com
PostgreSQL version: 9.2.1
Operating system: Suse Linux 10.2
Description:
Problem Reproduction:
1. Add recovery.conf to the database directory.
2. Start the server
3. Issue the shutdown request
and the shutdown request timing should be such that below server logs should
print.
Log:
./postgres -D data -p 3335
LOG: database system was shut down in recovery at 2012-11-08 19:42:42 IST
LOG: entering standby mode
LOG: received fast shutdown request
LOG: consistent recovery state reached at 0/17D0700
LOG: record with zero length at 0/17D0700
Problem reproduced in 9.3 head.
I think the following things are leading to the problem:
1. The postmaster is queued up with the following signals in the order of
SIGINT (shutdown) and SIGUSR1 (Here we create bgwriter and checkpoint) and
the current state of postmaster is PM_STARTUP.
we are creating the bgwriter and checkpoint after processing the shutdown
request, because of this reason no one is there is to send the SIGTERM to
bgwriter and checkpoint.
haribabu.kommi@huawei.com writes:
Problem Reproduction:
1. Add recovery.conf to the database directory.
2. Start the server
3. Issue the shutdown request
and the shutdown request timing should be such that below server logs should
print.
Log:
./postgres -D data -p 3335
LOG: database system was shut down in recovery at 2012-11-08 19:42:42 IST
LOG: entering standby mode
LOG: received fast shutdown request
LOG: consistent recovery state reached at 0/17D0700
LOG: record with zero length at 0/17D0700
Problem reproduced in 9.3 head.
After further investigation, I can't reproduce this and I don't believe
your patch fixes it. What happens when I try this is
* postmaster gets SIGINT, sends SIGTERM to startup process
* startup process exits with exit(1)
* postmaster sees that as a startup crash and exits, per the first
test in reaper()
So the log trace I'm getting looks like
LOG: received fast shutdown request
LOG: startup process (PID 9772) exited with exit code 1
LOG: aborting startup due to startup process failure
Now, transitioning to PM_WAIT_BACKENDS state earlier, as your patch
proposes, might make the log look a bit nicer because the logic in
reaper() wouldn't think the exit was a "crash". But it's not going to
have anything to do with whether the startup process exits on the signal
or not. What seems to have happened for you is that the startup process
ignored the SIGTERM signal, but it's not at all obvious why.
We're going to need more details about how to reproduce this.
I speculate it might have something to do with the specific
restore_command you're using.
regards, tom lane
haribabu.kommi@huawei.com writes:
Problem Reproduction:
1. Add recovery.conf to the database directory.
2. Start the server
3. Issue the shutdown request
and the shutdown request timing should be such that below server logs
should
print.
Log:
./postgres -D data -p 3335
LOG: database system was shut down in recovery at 2012-11-08 19:42:42 IST
LOG: entering standby mode
LOG: received fast shutdown request
LOG: consistent recovery state reached at 0/17D0700
LOG: record with zero length at 0/17D0700
Problem reproduced in 9.3 head.
After further investigation, I can't reproduce this and I don't believe
your patch fixes it. What happens when I try this is
* postmaster gets SIGINT, sends SIGTERM to startup process
* startup process exits with exit(1)
* postmaster sees that as a startup crash and exits, per the first
test in reaper()
So the log trace I'm getting looks like
LOG: received fast shutdown request
LOG: startup process (PID 9772) exited with exit code 1
LOG: aborting startup due to startup process failure
Now, transitioning to PM_WAIT_BACKENDS state earlier, as your patch
proposes, might make the log look a bit nicer because the logic in
reaper() wouldn't think the exit was a "crash". But it's not going to
have anything to do with whether the startup process exits on the signal
or not. What seems to have happened for you is that the startup process
ignored the SIGTERM signal, but it's not at all obvious why.
We're going to need more details about how to reproduce this.
I speculate it might have something to do with the specific
restore_command you're using.
The problem occurs only when active server is restarting by just adding a
recovery.conf file to the data directory.
No need of specifying any restore command. or the standby server restart
also can lead to this problem.
The startup process sends "PMSIGNAL_RECOVERY_STARTED" to postmaster only
incase of "InArchiveRecovery" flag is enabled.
The SIGINT signal should reach postmaster before the
"PMSIGNAL_RECOVERY_STARTED" sent by the startup process.
with the following code change in the startupXlog function, the issue can
reproduce very easily.
if (InArchiveRecovery && IsUnderPostmaster)
{
PublishStartupProcessInformation();
SetForwardFsyncRequests();
kill (PostmasterPid, SIGINT);
SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
bgwriterLaunched = true;
}
Please let me know if I miss anything.
Regards,
Hari babu.
Hari Babu <haribabu.kommi@huawei.com> writes:
We're going to need more details about how to reproduce this.
The problem occurs only when active server is restarting by just adding a
recovery.conf file to the data directory.
Well, you can't just put an empty file there, but I eventually managed
to reproduce this with the suggested hack in xlog.c.
I think the key problem is that postmaster.c's sigusr1_handler() is
willing to start new children even after shutdown has been initiated.
I don't see any good reason for it to do that, so I think the
appropriate patch is as attached.
Changing that still leaves us with the postmaster thinking that the
eventual exit(1) of the startup process is a "crash". This is mostly
cosmetic since it still shuts down okay, but we can fix it by reversing
the order of the first two checks in reaper() --- that is, if Shutdown
is set, we should prefer that code path even if we're in PM_STARTUP
state.
I concluded that it probably wasn't a good idea to have the additional
state transition in SIGINT handling. Generally PM_STARTUP means "we're
running the startup process and nothing else", and that's useful state
info that we shouldn't throw away lightly.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us > writes:
I concluded that it probably wasn't a good idea to have the additional
state transition in SIGINT handling. Generally >PM_STARTUP means "we're
running the startup process and nothing else", and that's useful state info
that we shouldn't >throw away lightly.
I tested the patch, it is working fine.
I have a query regarding the changing of pmState only incase of SIGTERM but
not incase of SIGINT.
Is there any reason to handle like this? or Is it ok to modify the handling
of pmState in SIGTERM same as SIGINT?
Regards,
Hari babu.
Hari Babu <haribabu.kommi@huawei.com> writes:
I have a query regarding the changing of pmState only incase of SIGTERM but
not incase of SIGINT.
Is there any reason to handle like this? or Is it ok to modify the handling
of pmState in SIGTERM same as SIGINT?
AFAICT the "smart shutdown" code path is okay: it transitions to
PM_WAIT_READONLY state, and then PostmasterStateMachine issues SIGTERM
to the startup process and goes to PM_WAIT_BACKENDS state. Perhaps this
could be done more clearly but I'm disinclined to mess with it.
regards, tom lane