Another reason why the recovery tests take a long time
I've found another edge-case bug through investigation of unexpectedly
slow recovery test runs. It goes like this:
* While streaming from master to slave, test script shuts down master
while slave is left running. We soon restart the master, but meanwhile:
* slave's walreceiver process fails, reporting
2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server
2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098.
2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress
* slave's startup process observes that walreceiver is gone and sends
PMSIGNAL_START_WALRECEIVER to ask for a new one
* more often than you would guess, in fact nearly 100% reproducibly for
me, the postmaster receives/services the PMSIGNAL before it receives
SIGCHLD for the walreceiver. In this situation sigusr1_handler just
throws away the walreceiver start request, reasoning that the walreceiver
is already running.
* eventually, it dawns on the startup process that the walreceiver
isn't starting, and it asks for a new one. But that takes ten seconds
(WALRCV_STARTUP_TIMEOUT).
So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver. But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
I've found another edge-case bug through investigation of unexpectedly
slow recovery test runs. It goes like this:* While streaming from master to slave, test script shuts down master
while slave is left running. We soon restart the master, but meanwhile:* slave's walreceiver process fails, reporting
2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server
2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098.
2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress* slave's startup process observes that walreceiver is gone and sends
PMSIGNAL_START_WALRECEIVER to ask for a new one* more often than you would guess, in fact nearly 100% reproducibly for
me, the postmaster receives/services the PMSIGNAL before it receives
SIGCHLD for the walreceiver. In this situation sigusr1_handler just
throws away the walreceiver start request, reasoning that the walreceiver
is already running.
Yuck.
I've recently seen a bunch of symptoms vaguely around this, e.g. I can
atm frequently reconnect to a new backend after a
PANIC/segfault/whatnot, before postmastre gets the message. I've not
yet figured out whether that's a kernel change, or whether some of the
more recent tinkering in postmaster.c changed this.
* eventually, it dawns on the startup process that the walreceiver
isn't starting, and it asks for a new one. But that takes ten seconds
(WALRCV_STARTUP_TIMEOUT).So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver. But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.
It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
... But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.
It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.
Ah, I see. I'd misinterpreted the purpose of the infinite loop in
WalReceiverMain() --- now I see that's for receiving requests from the
startup proc for different parts of the WAL stream, not for reconnecting
to the master.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-06-26 13:42:52 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
... But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.Ah, I see. I'd misinterpreted the purpose of the infinite loop in
WalReceiverMain() --- now I see that's for receiving requests from the
startup proc for different parts of the WAL stream, not for reconnecting
to the master.
Right. And if the connection fails, we intentionally (whether that's
good or bad is another question) switch to restore_command (or
pg_xlog...) based recovery, in which case we certainly do not want the
walsender around.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I wrote:
So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver.
Concretely, I propose the attached patch. Together with reducing
wal_retrieve_retry_interval to 500ms, which I propose having
PostgresNode::init do in its standard postgresql.conf adjustments,
this takes the runtime of the recovery TAP tests down from 2m50s
(after the patches I posted yesterday) to 1m30s.
I think there's still gold to be mined, because "top" is still
showing pretty low CPU load over most of the run, but this is
lots better than 4m30s.
regards, tom lane
Attachments:
walreceiver-restart-fix.patchtext/x-diff; charset=us-ascii; name=walreceiver-restart-fix.patchDownload+39-13
On 26 June 2017 at 19:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver.Concretely, I propose the attached patch. Together with reducing
wal_retrieve_retry_interval to 500ms, which I propose having
PostgresNode::init do in its standard postgresql.conf adjustments,
this takes the runtime of the recovery TAP tests down from 2m50s
(after the patches I posted yesterday) to 1m30s.
Patch looks good
I think there's still gold to be mined, because "top" is still
showing pretty low CPU load over most of the run, but this is
lots better than 4m30s.
Thanks for looking into this
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers