Another reason why the recovery tests take a long time

Started by Tom Laneover 8 years ago6 messages

tgl@sss.pgh.pa.us

over 8 years ago

I've found another edge-case bug through investigation of unexpectedly
slow recovery test runs. It goes like this:

* While streaming from master to slave, test script shuts down master
while slave is left running. We soon restart the master, but meanwhile:

* slave's walreceiver process fails, reporting

2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server
2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098.
2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress

* slave's startup process observes that walreceiver is gone and sends
PMSIGNAL_START_WALRECEIVER to ask for a new one

* more often than you would guess, in fact nearly 100% reproducibly for
me, the postmaster receives/services the PMSIGNAL before it receives
SIGCHLD for the walreceiver. In this situation sigusr1_handler just
throws away the walreceiver start request, reasoning that the walreceiver
is already running.

* eventually, it dawns on the startup process that the walreceiver
isn't starting, and it asks for a new one. But that takes ten seconds
(WALRCV_STARTUP_TIMEOUT).

So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver. But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Tom Lane (#1)

Re: Another reason why the recovery tests take a long time

Hi,

On 2017-06-26 12:32:00 -0400, Tom Lane wrote:

I've found another edge-case bug through investigation of unexpectedly
slow recovery test runs. It goes like this:

* While streaming from master to slave, test script shuts down master
while slave is left running. We soon restart the master, but meanwhile:

* slave's walreceiver process fails, reporting

2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server
2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098.
2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress

* slave's startup process observes that walreceiver is gone and sends
PMSIGNAL_START_WALRECEIVER to ask for a new one

* more often than you would guess, in fact nearly 100% reproducibly for
me, the postmaster receives/services the PMSIGNAL before it receives
SIGCHLD for the walreceiver. In this situation sigusr1_handler just
throws away the walreceiver start request, reasoning that the walreceiver
is already running.

Yuck.

I've recently seen a bunch of symptoms vaguely around this, e.g. I can
atm frequently reconnect to a new backend after a
PANIC/segfault/whatnot, before postmastre gets the message. I've not
yet figured out whether that's a kernel change, or whether some of the
more recent tinkering in postmaster.c changed this.

* eventually, it dawns on the startup process that the walreceiver
isn't starting, and it asks for a new one. But that takes ten seconds
(WALRCV_STARTUP_TIMEOUT).

So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver. But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.

It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Andres Freund (#2)

Re: Another reason why the recovery tests take a long time

Andres Freund <andres@anarazel.de> writes:

On 2017-06-26 12:32:00 -0400, Tom Lane wrote:

... But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.

It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.

Ah, I see. I'd misinterpreted the purpose of the infinite loop in
WalReceiverMain() --- now I see that's for receiving requests from the
startup proc for different parts of the WAL stream, not for reconnecting
to the master.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Tom Lane (#3)

Re: Another reason why the recovery tests take a long time

On 2017-06-26 13:42:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-06-26 12:32:00 -0400, Tom Lane wrote:

... But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.

It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.

Ah, I see. I'd misinterpreted the purpose of the infinite loop in
WalReceiverMain() --- now I see that's for receiving requests from the
startup proc for different parts of the WAL stream, not for reconnecting
to the master.

Right. And if the connection fails, we intentionally (whether that's
good or bad is another question) switch to restore_command (or
pg_xlog...) based recovery, in which case we certainly do not want the
walsender around.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#1)

1 attachment(s)

Re: Another reason why the recovery tests take a long time

I wrote:

So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver.

Concretely, I propose the attached patch. Together with reducing
wal_retrieve_retry_interval to 500ms, which I propose having
PostgresNode::init do in its standard postgresql.conf adjustments,
this takes the runtime of the recovery TAP tests down from 2m50s
(after the patches I posted yesterday) to 1m30s.

I think there's still gold to be mined, because "top" is still
showing pretty low CPU load over most of the run, but this is
lots better than 4m30s.

regards, tom lane

Attachments:

walreceiver-restart-fix.patchtext/x-diff; charset=us-ascii; name=walreceiver-restart-fix.patchDownload

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 83e99b7..6c79c36 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** static volatile sig_atomic_t start_autov
*** 357,362 ****
--- 357,365 ----
  /* the launcher needs to be signalled to communicate some condition */
  static volatile bool avlauncher_needs_signal = false;
  
+ /* received START_WALRECEIVER signal */
+ static volatile sig_atomic_t WalReceiverRequested = false;
+ 
  /* set when there's a worker that needs to be started up */
  static volatile bool StartWorkerNeeded = true;
  static volatile bool HaveCrashedWorker = false;
*************** static void maybe_start_bgworkers(void);
*** 426,431 ****
--- 429,435 ----
  static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
  static pid_t StartChildProcess(AuxProcType type);
  static void StartAutovacuumWorker(void);
+ static void MaybeStartWalReceiver(void);
  static void InitPostmasterDeathWatchHandle(void);
  
  /*
*************** ServerLoop(void)
*** 1810,1815 ****
--- 1814,1823 ----
  				kill(AutoVacPID, SIGUSR2);
  		}
  
+ 		/* If we need to start a WAL receiver, try to do that now */
+ 		if (WalReceiverRequested)
+ 			MaybeStartWalReceiver();
+ 
  		/* Get other worker processes running, if needed */
  		if (StartWorkerNeeded || HaveCrashedWorker)
  			maybe_start_bgworkers();
*************** reaper(SIGNAL_ARGS)
*** 2958,2964 ****
  		/*
  		 * Was it the wal receiver?  If exit status is zero (normal) or one
  		 * (FATAL exit), we assume everything is all right just like normal
! 		 * backends.
  		 */
  		if (pid == WalReceiverPID)
  		{
--- 2966,2973 ----
  		/*
  		 * Was it the wal receiver?  If exit status is zero (normal) or one
  		 * (FATAL exit), we assume everything is all right just like normal
! 		 * backends.  (If we need a new wal receiver, we'll start one at the
! 		 * next iteration of the postmaster's main loop.)
  		 */
  		if (pid == WalReceiverPID)
  		{
*************** sigusr1_handler(SIGNAL_ARGS)
*** 5066,5079 ****
  		StartAutovacuumWorker();
  	}
  
! 	if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&
! 		WalReceiverPID == 0 &&
! 		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
! 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
! 		Shutdown == NoShutdown)
  	{
  		/* Startup Process wants us to start the walreceiver process. */
! 		WalReceiverPID = StartWalReceiver();
  	}
  
  	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE) &&
--- 5075,5086 ----
  		StartAutovacuumWorker();
  	}
  
! 	if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER))
  	{
  		/* Startup Process wants us to start the walreceiver process. */
! 		/* Start immediately if possible, else remember request for later. */
! 		WalReceiverRequested = true;
! 		MaybeStartWalReceiver();
  	}
  
  	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE) &&
*************** StartAutovacuumWorker(void)
*** 5410,5415 ****
--- 5417,5440 ----
  }
  
  /*
+  * MaybeStartWalReceiver
+  *		Start the WAL receiver process, if requested and our state allows.
+  */
+ static void
+ MaybeStartWalReceiver(void)
+ {
+ 	if (WalReceiverPID == 0 &&
+ 		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
+ 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
+ 		Shutdown == NoShutdown)
+ 	{
+ 		WalReceiverPID = StartWalReceiver();
+ 		WalReceiverRequested = false;
+ 	}
+ }
+ 
+ 
+ /*
   * Create the opts file
   */
  static bool

Simon Riggs

simon@2ndquadrant.com

over 8 years ago

In reply to: Tom Lane (#5)

Re: Another reason why the recovery tests take a long time

On 26 June 2017 at 19:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver.

Concretely, I propose the attached patch. Together with reducing
wal_retrieve_retry_interval to 500ms, which I propose having
PostgresNode::init do in its standard postgresql.conf adjustments,
this takes the runtime of the recovery TAP tests down from 2m50s
(after the patches I posted yesterday) to 1m30s.

Patch looks good

I think there's still gold to be mined, because "top" is still
showing pretty low CPU load over most of the run, but this is
lots better than 4m30s.

Thanks for looking into this

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers