Failed assertion on standby while shutdown

Started by Maxim Orlovalmost 5 years ago10 messages
#1Maxim Orlov
m.orlov@postgrespro.ru

Hi, haсkers!

Recently, I was doing some experiments with primary/standby instances
interaction. In certain conditions I’ve got and was able to reproduce
crash on failed assertion.

The scenario is the following:
1. start primary server
2. start standby server by pg_basebackup -P -R -X stream -c fast -p5432
-D data
3. apply some load to the primary server by pgbench -p5432 -i -s 150
postgres
4. kill primary server (with kill -9) and keep it down
5. stop standby server by pg_ctl
6. run standby server

Then any standby server termination will result in a failed assertion.

The log with a backtrace is following:

2021-03-19 18:54:25.352 MSK [3508443] LOG: received fast shutdown
request
2021-03-19 18:54:25.379 MSK [3508443] LOG: aborting any active
transactions
TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))", File:
"/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c",
Line: 592, PID: 3508452)
postgres: walreceiver (ExceptionalCondition+0xd0)[0x555555d0526f]
postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x555555b43e31]
postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x55555574ae32]
postgres: walreceiver (+0x530bff)[0x555555a84bff]
postgres: walreceiver (+0x531044)[0x555555a85044]
postgres: walreceiver (+0x530959)[0x555555a84959]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ffff7a303c0]
/lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x7ffff72a40da]
postgres: walreceiver (+0x52bea4)[0x555555a7fea4]
postgres: walreceiver (PostmasterMain+0x129f)[0x555555a7f7c1]
postgres: walreceiver (+0x41ff1f)[0x555555973f1f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff71b30b3]
postgres: walreceiver (_start+0x2e)[0x55555561abfe]

After a brief investigation I found out that I can get this assert with
100% probability if I insert a sleep for about 5 sec into
InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c:

diff --git a/src/backend/storage/lmgr/proc.c 
b/src/backend/storage/lmgr/proc.c
index 897045ee272..b5f365f426d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -525,7 +525,7 @@ InitAuxiliaryProcess(void)

if (MyProc != NULL)
elog(ERROR, "you already exist");
-
+ pg_usleep(5000000L);
/*
* We use the ProcStructLock to protect assignment and releasing
of
* AuxiliaryProcs entries.

Maybe, this kinda behaviour would appear if a computer hosting instances
is under significant side load, which cause delay to start db-instances
under a heavy load.

Configuration for a primary server is default with "wal_level = logical"

Configuration for a standby server is default with "wal_level = logical"
and "primary_conninfo = 'port=5432'"

I'm puzzled with this behavor. I'm pretty sure it is not what should be.
Any ideas how this can be fixed?

---
Best regards,
Maxim Orlov.

#2Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Maxim Orlov (#1)
1 attachment(s)
Re: Failed assertion on standby while shutdown

On 2021/03/20 2:25, Maxim Orlov wrote:

Hi, haсkers!

Recently, I was doing some experiments with primary/standby instances interaction. In certain conditions I’ve got and was able to reproduce crash on failed assertion.

The scenario is the following:
1. start primary server
2. start standby server by pg_basebackup -P -R -X stream -c fast -p5432 -D data
3. apply some load to the primary server by pgbench -p5432 -i -s 150 postgres
4. kill primary server (with kill -9) and keep it down
5. stop standby server by pg_ctl
6. run standby server

Then any standby server termination will result in a failed assertion.

The log with a backtrace is following:

2021-03-19 18:54:25.352 MSK [3508443] LOG:  received fast shutdown request
2021-03-19 18:54:25.379 MSK [3508443] LOG:  aborting any active transactions
TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))", File: "/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c", Line: 592, PID: 3508452)
postgres: walreceiver (ExceptionalCondition+0xd0)[0x555555d0526f]
postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x555555b43e31]
postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x55555574ae32]
postgres: walreceiver (+0x530bff)[0x555555a84bff]
postgres: walreceiver (+0x531044)[0x555555a85044]
postgres: walreceiver (+0x530959)[0x555555a84959]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ffff7a303c0]
/lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x7ffff72a40da]
postgres: walreceiver (+0x52bea4)[0x555555a7fea4]
postgres: walreceiver (PostmasterMain+0x129f)[0x555555a7f7c1]
postgres: walreceiver (+0x41ff1f)[0x555555973f1f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff71b30b3]
postgres: walreceiver (_start+0x2e)[0x55555561abfe]

After a brief investigation I found out that I can get this assert with 100% probability if I insert a sleep for about 5 sec into InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c:

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 897045ee272..b5f365f426d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -525,7 +525,7 @@ InitAuxiliaryProcess(void)

        if (MyProc != NULL)
                elog(ERROR, "you already exist");
-
+       pg_usleep(5000000L);
        /*
         * We use the ProcStructLock to protect assignment and releasing of
         * AuxiliaryProcs entries.

Thanks for the report! I could reproduce this issue by adding that sleep
into InitAuxiliaryProcess().

Maybe, this kinda behaviour would appear if a computer hosting instances is under significant side load, which cause delay to start db-instances under a heavy load.

Configuration for a primary server is default with "wal_level = logical"

Configuration for a standby server is default with "wal_level = logical" and "primary_conninfo = 'port=5432'"

I'm puzzled with this behavor. I'm pretty sure it is not what should be. Any ideas how this can be fixed?

ISTM that the cause of this issue is that the startup process exits
without releasing the locks that it was holding when shutdown is
requested. To address this issue, IMO the startup process should
call ShutdownRecoveryTransactionEnvironment() at its exit.
Attached is the POC patch that changes the startup process that way.

I've not tested the patch enough yet..

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

fix_assertion_failure_walreceiver.patchtext/plain; charset=UTF-8; name=fix_assertion_failure_walreceiver.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 22135d5e07..69077bd207 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -62,6 +62,9 @@ static volatile sig_atomic_t in_restore_command = false;
 static void StartupProcTriggerHandler(SIGNAL_ARGS);
 static void StartupProcSigHupHandler(SIGNAL_ARGS);
 
+/* Callbacks */
+static void StartupProcExit(int code, Datum arg);
+
 
 /* --------------------------------
  *		signal handler routines
@@ -183,6 +186,19 @@ HandleStartupProcInterrupts(void)
 }
 
 
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+static void
+StartupProcExit(int code, Datum arg)
+{
+	/* Shutdown the recovery environment */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+}
+
+
 /* ----------------------------------
  *	Startup Process main entry point
  * ----------------------------------
@@ -190,6 +206,9 @@ HandleStartupProcInterrupts(void)
 void
 StartupProcessMain(void)
 {
+	/* Arrange to clean up at startup process exit */
+	on_shmem_exit(StartupProcExit, 0);
+
 	/*
 	 * Properly accept or ignore signals the postmaster might send us.
 	 */
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 17de5a6d0e..aae6d28f93 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -131,6 +131,12 @@ InitRecoveryTransactionEnvironment(void)
 void
 ShutdownRecoveryTransactionEnvironment(void)
 {
+	static bool	called = false;
+
+	/* Prevent transaction tracking from being shut down more than once */
+	if (called)
+		return;
+
 	/* Mark all tracked in-progress transactions as finished. */
 	ExpireAllKnownAssignedTransactionIds();
 
@@ -143,6 +149,8 @@ ShutdownRecoveryTransactionEnvironment(void)
 
 	/* Cleanup our VirtualTransaction */
 	VirtualXactLockTableCleanup();
+
+	called = true;
 }
 
 
#3Maxim Orlov
m.orlov@postgrespro.ru
In reply to: Fujii Masao (#2)
Re: Failed assertion on standby while shutdown

On 2021-03-22 16:40, Fujii Masao wrote:

On 2021/03/20 2:25, Maxim Orlov wrote:

Hi, haсkers!

Recently, I was doing some experiments with primary/standby instances
interaction. In certain conditions I’ve got and was able to reproduce
crash on failed assertion.

The scenario is the following:
1. start primary server
2. start standby server by pg_basebackup -P -R -X stream -c fast
-p5432 -D data
3. apply some load to the primary server by pgbench -p5432 -i -s 150
postgres
4. kill primary server (with kill -9) and keep it down
5. stop standby server by pg_ctl
6. run standby server

Then any standby server termination will result in a failed assertion.

The log with a backtrace is following:

2021-03-19 18:54:25.352 MSK [3508443] LOG:  received fast shutdown
request
2021-03-19 18:54:25.379 MSK [3508443] LOG:  aborting any active
transactions
TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))",
File:
"/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c",
Line: 592, PID: 3508452)
postgres: walreceiver (ExceptionalCondition+0xd0)[0x555555d0526f]
postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x555555b43e31]
postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x55555574ae32]
postgres: walreceiver (+0x530bff)[0x555555a84bff]
postgres: walreceiver (+0x531044)[0x555555a85044]
postgres: walreceiver (+0x530959)[0x555555a84959]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ffff7a303c0]
/lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x7ffff72a40da]
postgres: walreceiver (+0x52bea4)[0x555555a7fea4]
postgres: walreceiver (PostmasterMain+0x129f)[0x555555a7f7c1]
postgres: walreceiver (+0x41ff1f)[0x555555973f1f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff71b30b3]
postgres: walreceiver (_start+0x2e)[0x55555561abfe]

After a brief investigation I found out that I can get this assert
with 100% probability if I insert a sleep for about 5 sec into
InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c:

diff --git a/src/backend/storage/lmgr/proc.c 
b/src/backend/storage/lmgr/proc.c
index 897045ee272..b5f365f426d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -525,7 +525,7 @@ InitAuxiliaryProcess(void)

        if (MyProc != NULL)
                elog(ERROR, "you already exist");
-
+       pg_usleep(5000000L);
        /*
         * We use the ProcStructLock to protect assignment and
releasing of
         * AuxiliaryProcs entries.

Thanks for the report! I could reproduce this issue by adding that
sleep
into InitAuxiliaryProcess().

Maybe, this kinda behaviour would appear if a computer hosting
instances is under significant side load, which cause delay to start
db-instances under a heavy load.

Configuration for a primary server is default with "wal_level =
logical"

Configuration for a standby server is default with "wal_level =
logical" and "primary_conninfo = 'port=5432'"

I'm puzzled with this behavor. I'm pretty sure it is not what should
be. Any ideas how this can be fixed?

ISTM that the cause of this issue is that the startup process exits
without releasing the locks that it was holding when shutdown is
requested. To address this issue, IMO the startup process should
call ShutdownRecoveryTransactionEnvironment() at its exit.
Attached is the POC patch that changes the startup process that way.

I've not tested the patch enough yet..

Regards,

Thank you for reply! As far as I understand, this is really the case.
I've test your patch a bit. This annoying failed assertion is gone now.

I think I should test more and report later about results.

Should we put this patch to CF?

---
Best regards,
Maxim Orlov.

#4Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Maxim Orlov (#3)
Re: Failed assertion on standby while shutdown

On 2021/03/24 14:02, Maxim Orlov wrote:

Thank you for reply! As far as I understand, this is really the case. I've test your patch a bit.

Thanks for testing the patch!

This annoying failed assertion is gone now.

Good news!

I think I should test more and report later about results.

Should we put this patch to CF?

Yes. Since this is a bug, we can review and commit the patch
without waiting for next CF. But I agree that it's better to
add the patch to next CF so that we don't forget the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#5Maxim Orlov
m.orlov@postgrespro.ru
In reply to: Fujii Masao (#4)
Re: Failed assertion on standby while shutdown

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

All the tests passed successfully.

The new status of this patch is: Ready for Committer

#6igor levshin
i.levshin@postgrespro.ru
In reply to: Maxim Orlov (#5)
Re: Failed assertion on standby while shutdown

отбой: Маша сказала: уже оплатили :)

вт 30.03.21 20:44, Maxim Orlov пишет:

Show quoted text

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

All the tests passed successfully.

The new status of this patch is: Ready for Committer

#7Maxim Orlov
m.orlov@postgrespro.ru
In reply to: Maxim Orlov (#5)
Re: Failed assertion on standby while shutdown

On 2021-03-30 20:44, Maxim Orlov wrote:

The following review has been posted through the commitfest
application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

All the tests passed successfully.

The new status of this patch is: Ready for Committer

The patch is good. One note, should we put a comment about
ShutdownRecoveryTransactionEnvironment not reentrant behaviour? Or maybe
rename it to ShutdownRecoveryTransactionEnvironmentOnce?

---
Best regards,
Maxim Orlov.

#8Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Maxim Orlov (#7)
1 attachment(s)
Re: Failed assertion on standby while shutdown

On 2021/03/31 19:51, Maxim Orlov wrote:

On 2021-03-30 20:44, Maxim Orlov wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

All the tests passed successfully.

The new status of this patch is: Ready for Committer

The patch is good. One note, should we put a comment about ShutdownRecoveryTransactionEnvironment not reentrant behaviour? Or maybe rename it to ShutdownRecoveryTransactionEnvironmentOnce?

+1 to add more comments into ShutdownRecoveryTransactionEnvironment().
I did that. What about the attached patch?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

fix_assertion_failure_walreceiver_v2.patchtext/plain; charset=UTF-8; name=fix_assertion_failure_walreceiver_v2.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 22135d5e07..69077bd207 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -62,6 +62,9 @@ static volatile sig_atomic_t in_restore_command = false;
 static void StartupProcTriggerHandler(SIGNAL_ARGS);
 static void StartupProcSigHupHandler(SIGNAL_ARGS);
 
+/* Callbacks */
+static void StartupProcExit(int code, Datum arg);
+
 
 /* --------------------------------
  *		signal handler routines
@@ -183,6 +186,19 @@ HandleStartupProcInterrupts(void)
 }
 
 
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+static void
+StartupProcExit(int code, Datum arg)
+{
+	/* Shutdown the recovery environment */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+}
+
+
 /* ----------------------------------
  *	Startup Process main entry point
  * ----------------------------------
@@ -190,6 +206,9 @@ HandleStartupProcInterrupts(void)
 void
 StartupProcessMain(void)
 {
+	/* Arrange to clean up at startup process exit */
+	on_shmem_exit(StartupProcExit, 0);
+
 	/*
 	 * Properly accept or ignore signals the postmaster might send us.
 	 */
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 17de5a6d0e..1465ee44a1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -127,10 +127,25 @@ InitRecoveryTransactionEnvironment(void)
  *
  * Prepare to switch from hot standby mode to normal operation. Shut down
  * recovery-time transaction tracking.
+ *
+ * This must be called even in shutdown of startup process if transaction
+ * tracking has been initialized. Otherwise some locks the tracked
+ * transactions were holding will not be released and and may interfere with
+ * the processes still running (but will exit soon later) at the exit of
+ * startup process.
  */
 void
 ShutdownRecoveryTransactionEnvironment(void)
 {
+	/*
+	 * Do nothing if RecoveryLockLists is NULL because which means that
+	 * transaction tracking has not been yet initialized or has been already
+	 * shutdowned. This prevents transaction tracking from being shutdowned
+	 * unexpectedly more than once.
+	 */
+	if (RecoveryLockLists == NULL)
+		return;
+
 	/* Mark all tracked in-progress transactions as finished. */
 	ExpireAllKnownAssignedTransactionIds();
 
#9Maxim Orlov
m.orlov@postgrespro.ru
In reply to: Fujii Masao (#8)
Re: Failed assertion on standby while shutdown

On 2021-04-01 15:02, Fujii Masao wrote:

On 2021/03/31 19:51, Maxim Orlov wrote:

On 2021-03-30 20:44, Maxim Orlov wrote:

The following review has been posted through the commitfest
application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

All the tests passed successfully.

The new status of this patch is: Ready for Committer

The patch is good. One note, should we put a comment about
ShutdownRecoveryTransactionEnvironment not reentrant behaviour? Or
maybe rename it to ShutdownRecoveryTransactionEnvironmentOnce?

+1 to add more comments into ShutdownRecoveryTransactionEnvironment().
I did that. What about the attached patch?

Regards,

Well done! In my view is just what it's needed.

---
Best regards,
Maxim Orlov.

#10Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Maxim Orlov (#9)
Re: Failed assertion on standby while shutdown

On 2021/04/05 16:30, Maxim Orlov wrote:

Well done! In my view is just what it's needed.

Thanks for the review! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION