Unduly short fuse in RequestCheckpoint
I noticed an odd buildfarm failure today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2019-03-16%2012%3A12%3A20
of which the key bit seems to be
2019-03-16 15:20:43.835 UTC [10879304] 003_promote.pl LOG: received replication command: BASE_BACKUP LABEL 'pg_basebackup base backup' NOWAIT
2019-03-16 15:20:45.857 UTC [10879304] 003_promote.pl ERROR: could not request checkpoint because checkpointer not running
2019-03-16 15:20:47.227 UTC [61604144] LOG: received immediate shutdown request
Digging in the buildfarm archives finds seven other occurrences of the
same error in the past three months (I didn't look back further).
The cause of this error is that RequestCheckpoint will give up and fail
after just 2 seconds, which evidently is not long enough on slow or
heavily loaded machines. Since there isn't any good reason why the
checkpointer wouldn't be running, I'm inclined to swing a large hammer
and kick this timeout up to 60 seconds. Thoughts?
regards, tom lane
I wrote:
The cause of this error is that RequestCheckpoint will give up and fail
after just 2 seconds, which evidently is not long enough on slow or
heavily loaded machines. Since there isn't any good reason why the
checkpointer wouldn't be running, I'm inclined to swing a large hammer
and kick this timeout up to 60 seconds. Thoughts?
So I had imagined this as about a 2-line patch, s/2/60/g and be done.
Looking closer, though, there's other pre-existing problems in this code:
1. As it's currently coded, the requesting process can wait for up to 2
seconds for the checkpointer to start *even if the caller did not say
CHECKPOINT_WAIT*. That seems a little bogus, and an unwanted 60-second
wait would be a lot bogus.
2. If the timeout does elapse, and we didn't have the CHECKPOINT_WAIT
flag, we just log the failure and return. When the checkpointer
ultimately does start, it will have no idea that it should set to work
right away on a checkpoint. (I wonder if this accounts for any other
of the irreproducible buildfarm failures we get on slow machines. From
the calling code's viewpoint, it'd seem like it was taking a darn long
time to perform a successfully-requested checkpoint. Given that most
checkpoint requests are non-WAIT, this seems not very nice.)
After some thought I came up with the attached proposed patch. The
basic idea here is that we record a checkpoint request by ensuring
that the shared-memory ckpt_flags word is nonzero. (It's not clear
to me that a valid request would always have at least one of the
existing flag bits set, so I just added an extra always-set bit to
guarantee this.) Then, whether the signal gets sent or not, there is
a persistent record of the request in shmem, which the checkpointer
will eventually notice. In the expected case where the problem is
that the checkpointer hasn't started just yet, it will see the flag
during its first main loop and begin a checkpoint right away.
I took out the local checkpoint_requested flag altogether.
A possible objection to this fix is that up to now, it's been possible
to trigger a checkpoint just by sending SIGINT to the checkpointer
process, without touching shmem at all. None of the core code depends
on that, and since the checkpointer's PID is difficult to find out
from "outside", it's hard to believe that anybody's got custom tooling
that depends on it, but perhaps they do. I thought about keeping the
checkpoint_requested flag to allow that to continue to work, but if
we do so then we have a race condition: the checkpointer could see the
shmem flag set and start a checkpoint, then receive the signal a moment
later and believe that that represents a second, independent request
requiring a second checkpoint. So I think we should just blow off that
hypothetical possibility and do it like this.
Comments?
regards, tom lane
Attachments:
fix-checkpoint-request-waits.patchtext/x-diff; charset=us-ascii; name=fix-checkpoint-request-waits.patchDownload+33-24
On Sun, Mar 17, 2019 at 3:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
So I think we should just blow off that
hypothetical possibility and do it like this.
Makes sense to me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company