Race conditions with checkpointer and shutdown
Hi all,
This is a continuation of the following thread, but I prefer spawning
a new thread for clarity:
/messages/by-id/20190416064512.GJ2673@paquier.xyz
The buildfarm has reported two similar failures when shutting down a
node:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01
In both cases, the instance cannot shut down because it times out,
waiting for the shutdown checkpoint to finish but I suspect that this
checkpoint actually never happens.
The first case involves piculet which has --disable-atomics, gcc 6 and
the recovery test 016_min_consistency where we trigger a checkpoint,
then issue a fast shutdown on a standby. And at this point the test
waits forever.
The second case involves dragonet which has JIT enabled and clang.
The failure is on test 009_twophase.pl. The failure happens after
test preparing transaction xact_009_11, where a *standby* gets
restarted. Again, the test waits forever for the instance to shut
down.
The most recent commits which have touched checkpoints are 0dfe3d0e
and c6c9474a, which maps roughly to the point where the failures
began to happen, and that something related to standby clean shutdowns
has broken since.
Thanks,
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
The buildfarm has reported two similar failures when shutting down a
node:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01
In both cases, the instance cannot shut down because it times out,
waiting for the shutdown checkpoint to finish but I suspect that this
checkpoint actually never happens.
Hmm, I don't think that that is actually where the problem is. In
piculet's failure, the test script times out waiting for a "fast"
shutdown of the standby server, and what we see in the standby's log is
2019-03-23 22:44:12.181 UTC [9731] LOG: received fast shutdown request
2019-03-23 22:44:12.181 UTC [9731] LOG: aborting any active transactions
2019-03-23 22:44:12.181 UTC [9960] FATAL: terminating walreceiver process due to administrator command
2019-03-23 22:50:13.088 UTC [9731] LOG: received immediate shutdown request
where the last line indicates that the test script lost patience and
issued an immediate shutdown. However, in a successful run of the
test, the log looks like
2019-03-24 03:33:25.592 UTC [23816] LOG: received fast shutdown request
2019-03-24 03:33:25.592 UTC [23816] LOG: aborting any active transactions
2019-03-24 03:33:25.592 UTC [23895] FATAL: terminating walreceiver process due to administrator command
2019-03-24 03:33:25.595 UTC [23819] LOG: shutting down
2019-03-24 03:33:25.600 UTC [23816] LOG: database system is shut down
2019-03-24 03:33:25.696 UTC [23903] LOG: starting PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.2.0-12) 8.2.0, 64-bit
where the last line reflects restarting the server for the next test step.
So in the failure case we don't see the "shutting down" message, which
means we never got to ShutdownXLOG, so no checkpoint request was made.
Even if we had got to ShutdownXLOG, the process is just executing the
operation directly, it's not sending a signal asking some other process
to do the checkpoint; so it's hard to see how either of the commits
you mention could be involved.
I think what we need to look for is reasons why (1) the postmaster
never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
main loop doesn't get to noticing shutdown_requested.
A rather scary point for (2) is that said main loop seems to be
assuming that MyLatch a/k/a MyProc->procLatch is not used for any
other purposes in the checkpointer process. If there were something,
like say a condition variable wait, that would reset MyLatch at any
time during a checkpoint, then we could very easily go to sleep at the
bottom of the loop and not notice that there's a pending shutdown request.
Now, c6c9474aa did not break this, because the latch resets that
it added happen in other processes not the checkpointer. But I'm
feeling suspicious that some other change we made recently might've
borked it. And in general, it seems like we've managed to load a
lot of potentially conflicting roles onto process latches.
Do we need to think harder about establishing rules for multiplexed
use of the process latch? I'm imagining some rule like "if you are
not the outermost event loop of a process, you do not get to
summarily clear MyLatch. Make sure to leave it set after waiting,
if there was any possibility that it was set by something other than
the specific event you're concerned with".
regards, tom lane
On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do we need to think harder about establishing rules for multiplexed
use of the process latch? I'm imagining some rule like "if you are
not the outermost event loop of a process, you do not get to
summarily clear MyLatch. Make sure to leave it set after waiting,
if there was any possibility that it was set by something other than
the specific event you're concerned with".
Hmm, yeah. If the latch is left set, then the outer loop will just go
through an extra and unnecessary iteration, which seems fine. If the
latch is left clear, then the outer loop might miss a wakeup intended
for it and hang forever.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do we need to think harder about establishing rules for multiplexed
use of the process latch? I'm imagining some rule like "if you are
not the outermost event loop of a process, you do not get to
summarily clear MyLatch. Make sure to leave it set after waiting,
if there was any possibility that it was set by something other than
the specific event you're concerned with".Hmm, yeah. If the latch is left set, then the outer loop will just go
through an extra and unnecessary iteration, which seems fine. If the
latch is left clear, then the outer loop might miss a wakeup intended
for it and hang forever.
Arguably that's a sign that the latch using code in the outer loop(s) isn't
written correctly? If you do it as:
while (true)
{
CHECK_FOR_INTERRUPTS();
ResetLatch(MyLatch);
if (work_needed)
{
Plenty();
Code();
Using(MyLatch);
}
else
{
WaitLatch(MyLatch);
}
}
I think that's not a danger? I think the problem really is that we
suggest doing that WaitLatch() unconditionally:
* The correct pattern to wait for event(s) is:
*
* for (;;)
* {
* ResetLatch();
* if (work to do)
* Do Stuff();
* WaitLatch();
* }
*
* It's important to reset the latch *before* checking if there's work to
* do. Otherwise, if someone sets the latch between the check and the
* ResetLatch call, you will miss it and Wait will incorrectly block.
*
* Another valid coding pattern looks like:
*
* for (;;)
* {
* if (work to do)
* Do Stuff(); // in particular, exit loop if some condition satisfied
* WaitLatch();
* ResetLatch();
* }
Obviously there's the issue that a lot of latch using code isn't written
that way - but I also don't think there's that many latch using code
that then also uses latch. Seems like we could fix that. While it has
obviously dangers of not being followed, so does the
'always-set-latch-unless-outermost-loop' approach.
I'm not sure I like the idea of incurring another unnecessary SetLatch()
call for most latch using places.
I guess there's a bit bigger danger of taking longer to notice
postmaster-death. But I'm not sure I can quite see that being
problematic - seems like all we should incur is another cycle through
the loop, as the latch shouldn't be set anymore.
Greetings,
Andres Freund
Hi,
On 2019-04-16 17:05:36 -0700, Andres Freund wrote:
On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do we need to think harder about establishing rules for multiplexed
use of the process latch? I'm imagining some rule like "if you are
not the outermost event loop of a process, you do not get to
summarily clear MyLatch. Make sure to leave it set after waiting,
if there was any possibility that it was set by something other than
the specific event you're concerned with".Hmm, yeah. If the latch is left set, then the outer loop will just go
through an extra and unnecessary iteration, which seems fine. If the
latch is left clear, then the outer loop might miss a wakeup intended
for it and hang forever.Arguably that's a sign that the latch using code in the outer loop(s) isn't
written correctly? If you do it as:while (true)
{
CHECK_FOR_INTERRUPTS();ResetLatch(MyLatch);
if (work_needed)
{
Plenty();
Code();
Using(MyLatch);
}
else
{
WaitLatch(MyLatch);
}
}I think that's not a danger? I think the problem really is that we
suggest doing that WaitLatch() unconditionally:* The correct pattern to wait for event(s) is:
*
* for (;;)
* {
* ResetLatch();
* if (work to do)
* Do Stuff();
* WaitLatch();
* }
*
* It's important to reset the latch *before* checking if there's work to
* do. Otherwise, if someone sets the latch between the check and the
* ResetLatch call, you will miss it and Wait will incorrectly block.
*
* Another valid coding pattern looks like:
*
* for (;;)
* {
* if (work to do)
* Do Stuff(); // in particular, exit loop if some condition satisfied
* WaitLatch();
* ResetLatch();
* }Obviously there's the issue that a lot of latch using code isn't written
that way - but I also don't think there's that many latch using code
that then also uses latch. Seems like we could fix that. While it has
obviously dangers of not being followed, so does the
'always-set-latch-unless-outermost-loop' approach.I'm not sure I like the idea of incurring another unnecessary SetLatch()
call for most latch using places.I guess there's a bit bigger danger of taking longer to notice
postmaster-death. But I'm not sure I can quite see that being
problematic - seems like all we should incur is another cycle through
the loop, as the latch shouldn't be set anymore.
I think we should thus change our latch documentation to say:
something like:
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index fc995819d35..dc46dd94c5b 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -44,22 +44,31 @@
* {
* ResetLatch();
* if (work to do)
- * Do Stuff();
- * WaitLatch();
+ * DoStuff();
+ * else
+ * WaitLatch();
* }
*
* It's important to reset the latch *before* checking if there's work to
* do. Otherwise, if someone sets the latch between the check and the
* ResetLatch call, you will miss it and Wait will incorrectly block.
*
+ * The reason to only wait on the latch in case there is nothing to do is that
+ * code inside DoStuff() might use the same latch, and leave it reset, even
+ * though a SetLatch() aimed for the outer loop arrived. Which again could
+ * lead to incorrectly blocking in Wait.
+ *
* Another valid coding pattern looks like:
*
* for (;;)
* {
* if (work to do)
- * Do Stuff(); // in particular, exit loop if some condition satisfied
- * WaitLatch();
- * ResetLatch();
+ * DoStuff(); // in particular, exit loop if some condition satisfied
+ * else
+ * {
+ * WaitLatch();
+ * ResetLatch();
+ * }
* }
*
* This is useful to reduce latch traffic if it's expected that the loop's
and adapt code to match (at least in the outer loops).
Greetings,
Andres Freund
On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think what we need to look for is reasons why (1) the postmaster
never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
main loop doesn't get to noticing shutdown_requested.A rather scary point for (2) is that said main loop seems to be
assuming that MyLatch a/k/a MyProc->procLatch is not used for any
other purposes in the checkpointer process. If there were something,
like say a condition variable wait, that would reset MyLatch at any
time during a checkpoint, then we could very easily go to sleep at the
bottom of the loop and not notice that there's a pending shutdown request.
Agreed on the non-composability of that coding, but if there actually
is anything in that loop that can reach ResetLatch(), it's well
hidden...
--
Thomas Munro
https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes:
On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think what we need to look for is reasons why (1) the postmaster
never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
main loop doesn't get to noticing shutdown_requested.A rather scary point for (2) is that said main loop seems to be
assuming that MyLatch a/k/a MyProc->procLatch is not used for any
other purposes in the checkpointer process. If there were something,
like say a condition variable wait, that would reset MyLatch at any
time during a checkpoint, then we could very easily go to sleep at the
bottom of the loop and not notice that there's a pending shutdown request.
Agreed on the non-composability of that coding, but if there actually
is anything in that loop that can reach ResetLatch(), it's well
hidden...
Well, it's easy to see that there's no other ResetLatch call in
checkpointer.c. It's much less obvious that there's no such
call anywhere in the code reachable from e.g. CreateCheckPoint().
To try to investigate that, I hacked things up to force an assertion
failure if ResetLatch was called from any other place in the
checkpointer process (dirty patch attached for amusement's sake).
This gets through check-world without any assertions. That does not
really prove that there aren't corner timing cases where a latch
wait and reset could happen, but it does put a big dent in my theory.
Question is, what other theory has anybody got?
regards, tom lane
Attachments:
assert-there-is-only-one.patchtext/x-diff; charset=us-ascii; name=assert-there-is-only-one.patchDownload+10-0
Robert Haas <robertmhaas@gmail.com> wrote (in the other thread):
Any idea whether it's something newly-introduced or of long standing?
It's the latter. I searched the buildfarm database for failure logs
including the string "server does not shut down" within the last three
years, and got all of the hits attached. Not all of these look like
the failure pattern Michael pointed to, but enough of them do to say
that the problem has existed since at least mid-2017. To be concrete,
we have quite a sample of cases where a standby server has received a
"fast shutdown" signal and acknowledged that in its log, but it never
gets to the expected "shutting down" message, meaning it never starts
the shutdown checkpoint let alone finishes it. The oldest case that
clearly looks like that is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29
A significant majority of the recent cases look just like the piculet
failure Michael pointed to, that is we fail to shut down the "london"
server while it's acting as standby in the recovery/t/009_twophase.pl
test. But there are very similar failures in other tests.
I also notice that the population of machines showing the problem seems
heavily skewed towards, um, weird cases. For instance, in the set
that have shown this type of failure since January, we have
dragonet: uses JIT
francolin: --disable-spinlocks
gull: armv7
mereswine: armv7
piculet: --disable-atomics
sidewinder: amd64, but running netbsd 7 (and this was 9.6, note)
spurfowl: fairly generic amd64
This leads me to suspect that the problem is (a) some very low-level issue
in spinlocks or or latches or the like, or (b) a timing problem that just
doesn't show up on generic Intel-oid platforms. The timing theory is
maybe a bit stronger given that one test case shows this more often than
others. I've not got any clear ideas beyond that.
Anyway, this is *not* new in v12.
regards, tom lane
Attachments:
server-does-not-shut-down.txttext/plain; charset=us-ascii; name=server-does-not-shut-down.txtDownload
On Fri, Apr 19, 2019 at 2:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Question is, what other theory has anybody got?
I wondered if there might be a way for PostmasterStateMachine() to be
reached with without signals blocked, in the case where we fork a
fresh checkpointers, and then it misses the SIGUSR2 that we
immediately send because it hasn't installed its handler yet. But I
can't see it.
This is a curious thing from dragonet's log:
2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request
2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions
2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver
process due to administrator command
2019-04-16 08:28:23.166 CEST [8337] LOG: restartpoint starting: time
LogCheckpointStart() is the thing that writes "starting: ...", and it
prefers to report "shutdown" over "time", but it didn't.
--
Thomas Munro
https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes:
This is a curious thing from dragonet's log:
2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request
2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions
2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver
process due to administrator command
2019-04-16 08:28:23.166 CEST [8337] LOG: restartpoint starting: time
LogCheckpointStart() is the thing that writes "starting: ...", and it
prefers to report "shutdown" over "time", but it didn't.
Yeah, but since we don't see "shutting down", we know that the shutdown
checkpoint hasn't begun. Here's another similar case:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-30%2011%3A44%3A54
The relevant fragment of the standby server's log is
2018-11-30 05:09:22.996 PST [4229] LOG: received fast shutdown request
2018-11-30 05:09:23.628 PST [4229] LOG: aborting any active transactions
2018-11-30 05:09:23.649 PST [4231] LOG: checkpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 1 recycled; write=3.021 s, sync=0.000 s, total=3.563 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16563 kB, estimate=16563 kB
2018-11-30 05:09:23.679 PST [4229] LOG: background worker "logical replication launcher" (PID 4276) exited with exit code 1
2018-11-30 05:11:23.757 PST [4288] master LOG: unexpected EOF on standby connection
2018-11-30 05:11:23.883 PST [4229] LOG: received immediate shutdown request
2018-11-30 05:11:23.907 PST [4229] LOG: database system is shut down
To the extent that I've found logs in which the checkpointer prints
anything at all during this interval, it seems to be just quietly
plodding along with its usual business, without any hint that it's
aware of the pending shutdown request. It'd be very easy to believe
that the postmaster -> checkpointer SIGUSR2 is simply getting dropped,
or never issued.
Hmm ... actually, looking at the postmaster's logic, it won't issue
SIGUSR2 to the checkpointer until the walreceiver (if any) is gone.
And now that I think about it, several of these logs contain traces
showing that the walreceiver is still live. Like the one quoted above:
seems like the line from PID 4288 has to be from a walreceiver.
Maybe what we should be looking for is "why doesn't the walreceiver
shut down"? But the dragonet log you quote above shows the walreceiver
exiting, or at least starting to exit. Tis a puzzlement.
I'm a bit tempted to add temporary debug logging to the postmaster so
that walreceiver start/stop is recorded at LOG level. We'd have to wait
a few weeks to have any clear result from the buildfarm, but I'm not sure
how we'll get any hard data without some such measures.
regards, tom lane
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
It's the latter. I searched the buildfarm database for failure logs
including the string "server does not shut down" within the last three
years, and got all of the hits attached. Not all of these look like
the failure pattern Michael pointed to, but enough of them do to say
that the problem has existed since at least mid-2017. To be concrete,
we have quite a sample of cases where a standby server has received a
"fast shutdown" signal and acknowledged that in its log, but it never
gets to the expected "shutting down" message, meaning it never starts
the shutdown checkpoint let alone finishes it. The oldest case that
clearly looks like that ishttps://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29
Interesting. I was sort of thinking about c6c3334 first but this
failed based on 9fcf670, which does not include the former.
This leads me to suspect that the problem is (a) some very low-level issue
in spinlocks or or latches or the like, or (b) a timing problem that just
doesn't show up on generic Intel-oid platforms. The timing theory is
maybe a bit stronger given that one test case shows this more often than
others. I've not got any clear ideas beyond that.Anyway, this is *not* new in v12.
Indeed. It seems to me that v12 makes the problem easier to appear
though, and I got to wonder if c6c9474 is helping in that as more
cases are popping up since mid-March.
--
Michael
On Fri, Apr 19, 2019 at 2:30 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
Anyway, this is *not* new in v12.
Indeed. It seems to me that v12 makes the problem easier to appear
though, and I got to wonder if c6c9474 is helping in that as more
cases are popping up since mid-March.
Interesting, but I'm not sure how that could be though. Perhaps, a
bit like the other thing that cropped up in the build farm after that
commit, removing ~200ms of needless sleeping around an earlier online
CHECKPOINT made some other pre-existing race condition more likely to
go wrong.
--
Thomas Munro
https://enterprisedb.com
Michael Paquier <michael@paquier.xyz> writes:
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
Anyway, this is *not* new in v12.
Indeed. It seems to me that v12 makes the problem easier to appear
though, and I got to wonder if c6c9474 is helping in that as more
cases are popping up since mid-March.
Yeah. Whether that's due to a server code change, or new or modified
test cases, is unknown. But looking at my summary of buildfarm runs
that failed like this, there's a really clear breakpoint at
gull | 2018-08-24 03:27:16 | recoveryCheck | pg_ctl: server does not shut down
Since that, most of the failures with this message have been in the
recoveryCheck step. Before that, the failures were all over the
place, and now that I look closely a big fraction of them were
in bursts on particular animals, suggesting it was more about
some local problem on that animal than any real code issue.
So it might be worth groveling around in the commit logs from
last August...
regards, tom lane
Thomas Munro <thomas.munro@gmail.com> writes:
Interesting, but I'm not sure how that could be though. Perhaps, a
bit like the other thing that cropped up in the build farm after that
commit, removing ~200ms of needless sleeping around an earlier online
CHECKPOINT made some other pre-existing race condition more likely to
go wrong.
The data that we've got is entirely consistent with the idea that
there's a timing-sensitive bug that gets made more or less likely
to trigger by "unrelated" changes in test cases or server code.
regards, tom lane
On Fri, Apr 19, 2019 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver
process due to administrator command
Maybe what we should be looking for is "why doesn't the walreceiver
shut down"? But the dragonet log you quote above shows the walreceiver
exiting, or at least starting to exit. Tis a puzzlement.
One thing I noticed about this message: if you receive SIGTERM at a
rare time when WalRcvImmediateInterruptOK is true, then that ereport()
runs directly in the signal handler context. That's not strictly
allowed, and could cause nasal demons. On the other hand, it probably
wouldn't have managed to get the FATAL message out if that was the
problem here (previously we've seen reports of signal handlers
deadlocking while trying to ereport() but they couldn't get their
message out at all, because malloc or some such was already locked in
the user context). Is there some way that the exit code could hang
*after* that due to corruption of libc resources (FILE streams,
malloc, ...)? It doesn't seem likely to me (we'd hopefully see some
more clues) but I thought I'd mention the idea.
--
Thomas Munro
https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Apr 19, 2019 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maybe what we should be looking for is "why doesn't the walreceiver
shut down"? But the dragonet log you quote above shows the walreceiver
exiting, or at least starting to exit. Tis a puzzlement.
... Is there some way that the exit code could hang
*after* that due to corruption of libc resources (FILE streams,
malloc, ...)? It doesn't seem likely to me (we'd hopefully see some
more clues) but I thought I'd mention the idea.
I agree it's not likely ... but that's part of the reason I was thinking
about adding some postmaster logging. Whatever we're chasing here is
"not likely", per the observed buildfarm failure rate.
regards, tom lane
Maybe what we should be looking for is "why doesn't the walreceiver
shut down"? But the dragonet log you quote above shows the walreceiver
exiting, or at least starting to exit. Tis a puzzlement.
huh ... take a look at this little stanza in PostmasterStateMachine:
if (pmState == PM_SHUTDOWN_2)
{
/*
* PM_SHUTDOWN_2 state ends when there's no other children than
* dead_end children left. There shouldn't be any regular backends
* left by now anyway; what we're really waiting for is walsenders and
* archiver.
*
* Walreceiver should normally be dead by now, but not when a fast
* shutdown is performed during recovery.
*/
if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0 &&
WalReceiverPID == 0)
{
pmState = PM_WAIT_DEAD_END;
}
}
I'm too tired to think through exactly what that last comment might be
suggesting, but it sure seems like it might be relevant to our problem.
If the walreceiver *isn't* dead yet, what's going to ensure that we
can move forward later?
regards, tom lane
I have spent a fair amount of time trying to replicate these failures
locally, with little success. I now think that the most promising theory
is Munro's idea in [1]/messages/by-id/CA+hUKG+=1G98m61VjNS-qGboJPwdZcF+rAPu2eC4XuWRTR3UPw@mail.gmail.com that the walreceiver is hanging up during its
unsafe attempt to do ereport(FATAL) from inside a signal handler. It's
extremely plausible that that could result in a deadlock inside libc's
malloc/free, or some similar place. Moreover, if that's what's causing
it, then the windows for trouble are fixed by the length of time that
malloc might hold internal locks, which fits with the results I've gotten
that inserting delays in various promising-looking places doesn't do a
thing towards making this reproducible.
Even if that isn't the proximate cause of the current reports, it's
clearly trouble waiting to happen, and we should get rid of it.
Accordingly, see attached proposed patch. This just flushes the
"immediate interrupt" stuff in favor of making sure that
libpqwalreceiver.c will take care of any signals received while
waiting for input.
The existing code does not use PQsetnonblocking, which means that it's
theoretically at risk of blocking while pushing out data to the remote
server. In practice I think that risk is negligible because (IIUC) we
don't send very large amounts of data at one time. So I didn't bother to
change that. Note that for the most part, if that happened, the existing
code was at risk of slow response to SIGTERM anyway since it didn't have
Enable/DisableWalRcvImmediateExit around the places that send data.
My thought is to apply this only to HEAD for now; it's kind of a large
change to shove into the back branches to handle a failure mode that's
not been reported from the field. Maybe we could back-patch after we
have more confidence in it.
regards, tom lane
[1]: /messages/by-id/CA+hUKG+=1G98m61VjNS-qGboJPwdZcF+rAPu2eC4XuWRTR3UPw@mail.gmail.com
Attachments:
dont-try-to-exit-from-signal-handler-1.patchtext/x-diff; charset=us-ascii; name=dont-try-to-exit-from-signal-handler-1.patchDownload+85-101
On Sun, Apr 28, 2019 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Even if that isn't the proximate cause of the current reports, it's
clearly trouble waiting to happen, and we should get rid of it.
Accordingly, see attached proposed patch. This just flushes the
"immediate interrupt" stuff in favor of making sure that
libpqwalreceiver.c will take care of any signals received while
waiting for input.
+1
I see that we removed the code that this was modelled on back in 2015,
and in fact your patch even removes a dangling reference in a comment:
- * This is very much like what regular backends do with ImmediateInterruptOK,
The existing code does not use PQsetnonblocking, which means that it's
theoretically at risk of blocking while pushing out data to the remote
server. In practice I think that risk is negligible because (IIUC) we
don't send very large amounts of data at one time. So I didn't bother to
change that. Note that for the most part, if that happened, the existing
code was at risk of slow response to SIGTERM anyway since it didn't have
Enable/DisableWalRcvImmediateExit around the places that send data.
Right.
My thought is to apply this only to HEAD for now; it's kind of a large
change to shove into the back branches to handle a failure mode that's
not been reported from the field. Maybe we could back-patch after we
have more confidence in it.
+1
That reminds me, we should probably also clean up at least the
ereport-from-signal-handler hazard identified over in this thread:
/messages/by-id/CAEepm=10MtmKeDc1WxBM0PQM9OgtNy+RCeWqz40pZRRS3PNo5Q@mail.gmail.com
--
Thomas Munro
https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes:
On Sun, Apr 28, 2019 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Even if that isn't the proximate cause of the current reports, it's
clearly trouble waiting to happen, and we should get rid of it.
+1
That reminds me, we should probably also clean up at least the
ereport-from-signal-handler hazard identified over in this thread:
/messages/by-id/CAEepm=10MtmKeDc1WxBM0PQM9OgtNy+RCeWqz40pZRRS3PNo5Q@mail.gmail.com
Yeah, probably. I imagine the reason we've not already seen complaints
about that is that not that many custom bgworkers exist.
I do not think we can get away with back-patching a change in that area,
though, since it'll move the goalposts about what bgworker code has to
do to cope with SIGTERMs. It might already be too late for v12, unless
we want to treat that as an Open Item.
regards, tom lane