Deadlock between backend and recovery may not be detected

Started by Fujii Masaoabout 5 years ago20 messages
#1Fujii Masao
masao.fujii@oss.nttdata.com
1 attachment(s)

Hi,

While reviewing the patch proposed at [1]/messages/by-id/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com, I found that there is the case
where deadlock that recovery conflict on lock is involved in may not be
detected. This deadlock can happen between backends and the startup
process, in the standby server. Please see the following procedure to
reproduce the deadlock.

#1. Set up streaming replication.

#2. Set max_standby_streaming_delay to -1 in the standby.

#3. Create two tables in the primary.

[PRIMARY: SESSION1]
CREATE TABLE t1 ();
CREATE TABLE t2 ();

#4. Start transaction and access to the table t1.

[STANDBY: SESSION2]
BEGIN;
SELECT * FROM t1;

#5. Start transaction and lock table t2 in access exclusive mode,
in the primary. Also execute pg_switch_wal() to transfer WAL record
for access exclusive lock to the standby.

[PRIMARY: SESSION1]
BEGIN;
LOCK TABLE t2 IN ACCESS EXCLUSIVE MODE;
SELECT pg_switch_wal();

#6. Access to the table t2 within the transaction that started at #4,
in the standby.

[STANDBY: SESSION2]
SELECT * FROM t2;

#7. Lock table t1 in access exclusive mode within the transaction that
started in #5, in the primary. Also execute pg_switch_wal() to transfer
WAL record for access exclusive lock to the standby.

[PRIMARY: SESSION1]
LOCK TABLE t1 IN ACCESS EXCLUSIVE MODE;
SELECT pg_switch_wal();

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

* Deadlocks involving the Startup process and an ordinary backend process
* will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

Thought?

Regards,

[1]: /messages/by-id/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

recovery_conflict_lock_deadlock_v1.patchtext/plain; charset=UTF-8; name=recovery_conflict_lock_deadlock_v1.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4ea3cf1f5c..c398f9a3a5 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -407,7 +407,7 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 
 	ltime = GetStandbyLimitTime();
 
-	if (GetCurrentTimestamp() >= ltime)
+	if (GetCurrentTimestamp() >= ltime && ltime > 0)
 	{
 		/*
 		 * We're already behind, so clear a path as quickly as possible.
@@ -431,12 +431,21 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 		/*
 		 * Wait (or wait again) until ltime
 		 */
-		EnableTimeoutParams timeouts[1];
+		EnableTimeoutParams timeouts[2];
+		int		i = 0;
 
-		timeouts[0].id = STANDBY_LOCK_TIMEOUT;
-		timeouts[0].type = TMPARAM_AT;
-		timeouts[0].fin_time = ltime;
-		enable_timeouts(timeouts, 1);
+		if (ltime > 0)
+		{
+			timeouts[i].id = STANDBY_LOCK_TIMEOUT;
+			timeouts[i].type = TMPARAM_AT;
+			timeouts[i].fin_time = ltime;
+			i++;
+		}
+		timeouts[i].id = STANDBY_DEADLOCK_TIMEOUT;
+		timeouts[i].type = TMPARAM_AFTER;
+		timeouts[i].delay_ms = DeadlockTimeout;
+		i++;
+		enable_timeouts(timeouts, i);
 	}
 
 	/* Wait to be signaled by the release of the Relation Lock */
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..3717381a6a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2923,7 +2923,11 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				 * more to do.
 				 */
 				if (!HoldingBufferPinThatDelaysRecovery())
+				{
+					if (reason == PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)
+						CheckDeadLockAlert();
 					return;
+				}
 
 				MyProc->recoveryConflictPending = true;
 
#2Victor Yegorov
vyegorov@gmail.com
In reply to: Fujii Masao (#1)
Re: Deadlock between backend and recovery may not be detected

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com>:

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock
remains
even after deadlock_timeout passes.

This seems a bug to me.

* Deadlocks involving the Startup process and an ordinary backend process
* will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the
above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

I agree that this is a bug.

Unfortunately, we've been hit by it in production.
Such deadlock will, eventually, make all sessions wait on the startup
process, making
streaming replica unusable. In case replica is used for balancing out RO
queries from the primary,
it causes downtime for the project.

If I understand things right, session will release it's locks
when max_standby_streaming_delay is reached.
But it'd be much better if conflict is resolved faster,
around deadlock_timeout.

So — huge +1 from me for fixing it.

--
Victor Yegorov

#3Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Victor Yegorov (#2)
Re: Deadlock between backend and recovery may not be detected

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do
not click links or open attachments unless you can confirm the sender
and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com
<mailto:masao.fujii@oss.nttdata.com>>:

After doing this procedure, you can see the startup process and
backend
wait for the table lock each other, i.e., deadlock. But this
deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

+1

* Deadlocks involving the Startup process and an ordinary

backend process

* will be detected by the deadlock detector within the ordinary

backend.

The cause of this issue seems that
ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens
doesn't
take care of deadlock case at all. You can see this fact by
reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable
STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup
process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT
shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already
set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to
fix it.

Bertrand

#4Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Drouvot, Bertrand (#3)
1 attachment(s)
Re: Deadlock between backend and recovery may not be detected

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

+1

* Deadlocks involving the Startup process and an ordinary backend process
* will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

recovery_conflict_lock_deadlock_v2.patchtext/plain; charset=UTF-8; name=recovery_conflict_lock_deadlock_v2.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ee912b9d5e..7c1a3f8b3f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3324,6 +3324,13 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
  */
 pid_t
 CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
+{
+	return SignalVirtualTransaction(vxid, sigmode, true);
+}
+
+pid_t
+SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
+						 bool conflictPending)
 {
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
@@ -3342,7 +3349,7 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
-			proc->recoveryConflictPending = true;
+			proc->recoveryConflictPending = conflictPending;
 			pid = proc->pid;
 			if (pid != 0)
 			{
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4ea3cf1f5c..4b7762644e 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -42,10 +42,15 @@ int			max_standby_streaming_delay = 30 * 1000;
 
 static HTAB *RecoveryLockLists;
 
+/* Is a deadlock check pending? */
+static volatile sig_atomic_t got_standby_deadlock_timeout;
+
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 												   ProcSignalReason reason,
 												   uint32 wait_event_info,
 												   bool report_waiting);
+static void SendRecoveryConflictWithLock(ProcSignalReason reason,
+										 LOCKTAG locktag);
 static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason);
 static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
@@ -395,8 +400,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
  * lock.  As we are already queued to be granted the lock, no new lock
  * requests conflicting with ours will be granted in the meantime.
  *
- * Deadlocks involving the Startup process and an ordinary backend process
- * will be detected by the deadlock detector within the ordinary backend.
+ * We also must check for deadlocks involving the Startup process and
+ * hot-standby backend processes. If deadlock_timeout is reached in
+ * this function, the backends are requested to check themselves for
+ * deadlocks.
  */
 void
 ResolveRecoveryConflictWithLock(LOCKTAG locktag)
@@ -407,7 +414,7 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 
 	ltime = GetStandbyLimitTime();
 
-	if (GetCurrentTimestamp() >= ltime)
+	if (GetCurrentTimestamp() >= ltime && ltime != 0)
 	{
 		/*
 		 * We're already behind, so clear a path as quickly as possible.
@@ -429,19 +436,47 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 	else
 	{
 		/*
-		 * Wait (or wait again) until ltime
+		 * Wait (or wait again) until ltime, and check for deadlocks as well
+		 * if we will be waiting longer than deadlock_timeout
 		 */
-		EnableTimeoutParams timeouts[1];
+		EnableTimeoutParams timeouts[2];
+		int			cnt = 0;
 
-		timeouts[0].id = STANDBY_LOCK_TIMEOUT;
-		timeouts[0].type = TMPARAM_AT;
-		timeouts[0].fin_time = ltime;
-		enable_timeouts(timeouts, 1);
+		if (ltime != 0)
+		{
+			timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
+			timeouts[cnt].type = TMPARAM_AT;
+			timeouts[cnt].fin_time = ltime;
+			cnt++;
+		}
+
+		timeouts[cnt].id = STANDBY_DEADLOCK_TIMEOUT;
+		timeouts[cnt].type = TMPARAM_AFTER;
+		timeouts[cnt].delay_ms = DeadlockTimeout;
+		cnt++;
+
+		enable_timeouts(timeouts, cnt);
 	}
 
 	/* Wait to be signaled by the release of the Relation Lock */
+	got_standby_deadlock_timeout = false;
 	ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
 
+	if (got_standby_deadlock_timeout)
+	{
+		/*
+		 * Send out a request for hot-standby backends to check themselves for
+		 * deadlocks.
+		 */
+		SendRecoveryConflictWithLock(
+									 PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+									 locktag);
+		got_standby_deadlock_timeout = false;
+
+		/* Wait again to be signaled by the release of the Relation Lock */
+		ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+	}
+
 	/*
 	 * Clear any timeout requests established above.  We assume here that the
 	 * Startup process doesn't have any other outstanding timeouts than those
@@ -520,8 +555,23 @@ ResolveRecoveryConflictWithBufferPin(void)
 	}
 
 	/* Wait to be signaled by UnpinBuffer() */
+	got_standby_deadlock_timeout = false;
 	ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
 
+	if (got_standby_deadlock_timeout)
+	{
+		/*
+		 * Send out a request for hot-standby backends to check themselves for
+		 * deadlocks.
+		 */
+		SendRecoveryConflictWithBufferPin(
+										  PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+		got_standby_deadlock_timeout = false;
+
+		/* Wait again to be signaled by UnpinBuffer() */
+		ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
+	}
+
 	/*
 	 * Clear any timeout requests established above.  We assume here that the
 	 * Startup process doesn't have any other timeouts than what this function
@@ -531,6 +581,30 @@ ResolveRecoveryConflictWithBufferPin(void)
 	disable_all_timeouts(false);
 }
 
+static void
+SendRecoveryConflictWithLock(ProcSignalReason reason, LOCKTAG locktag)
+{
+	VirtualTransactionId *backends;
+
+	Assert(reason == PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+
+	backends = GetLockConflicts(&locktag, AccessExclusiveLock, NULL);
+
+	/* Quick exit if there's no work to be done */
+	if (!VirtualTransactionIdIsValid(*backends))
+		return;
+
+	/*
+	 * We send signal to all the backends holding the conflicting locks, to
+	 * ask them to check for deadlocks.
+	 */
+	while (VirtualTransactionIdIsValid(*backends))
+	{
+		SignalVirtualTransaction(*backends, reason, false);
+		backends++;
+	}
+}
+
 static void
 SendRecoveryConflictWithBufferPin(ProcSignalReason reason)
 {
@@ -588,13 +662,17 @@ CheckRecoveryConflictDeadlock(void)
 
 /*
  * StandbyDeadLockHandler() will be called if STANDBY_DEADLOCK_TIMEOUT
- * occurs before STANDBY_TIMEOUT.  Send out a request for hot-standby
- * backends to check themselves for deadlocks.
+ * occurs before STANDBY_TIMEOUT.
  */
 void
 StandbyDeadLockHandler(void)
 {
-	SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+	int			save_errno = errno;
+
+	got_standby_deadlock_timeout = true;
+
+	SetLatch(MyLatch);
+	errno = save_errno;
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..385a749e46 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2919,11 +2919,21 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 
 				/*
-				 * If we aren't blocking the Startup process there is nothing
-				 * more to do.
+				 * If PROCSIG_RECOVERY_CONFLICT_BUFFERPIN is requested but we
+				 * aren't blocking the Startup process there is nothing more
+				 * to do.
+				 *
+				 * If PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK is requested
+				 * and we're waiting for locks but don't hold buffer pins
+				 * blocking the Startup process, we set the flag so that
+				 * ProcSleep() will check for deadlocks.
 				 */
 				if (!HoldingBufferPinThatDelaysRecovery())
+				{
+					if (reason == PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)
+						CheckDeadLockAlert();
 					return;
+				}
 
 				MyProc->recoveryConflictPending = true;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index ea8a876ca4..4d272c7287 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -72,6 +72,8 @@ extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
 												   int *nvxids);
 extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
 extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
+extern pid_t SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
+									  bool conflictPending);
 
 extern bool MinimumActiveBackends(int min);
 extern int	CountDBBackends(Oid databaseid);
#5Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Fujii Masao (#4)
1 attachment(s)
Re: Deadlock between backend and recovery may not be detected

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

    After doing this procedure, you can see the startup process and backend
    wait for the table lock each other, i.e., deadlock. But this deadlock remains
    even after deadlock_timeout passes.

    This seems a bug to me.

+1

    > * Deadlocks involving the Startup process and an ordinary backend process
    > * will be detected by the deadlock detector within the ordinary backend.

    The cause of this issue seems that ResolveRecoveryConflictWithLock() that
    the startup process calls when recovery conflict on lock happens doesn't
    take care of deadlock case at all. You can see this fact by reading the above
    source code comment for ResolveRecoveryConflictWithLock().

    To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
    timer in ResolveRecoveryConflictWithLock() so that the startup process can
    send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
    Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
    the backend should check whether the deadlock actually happens or not.
    Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

On the other hand, IMO we should avoid this issue while the startup
process is waiting for recovery conflict on locks since it can take
a long time to release the locks. So the patch tries to fix it.

Or I'm overthinking this? If this doesn't need to be handled,
the patch can be simplified more. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

recovery_conflict_lock_deadlock_v3.patchtext/plain; charset=UTF-8; name=recovery_conflict_lock_deadlock_v3.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ee912b9d5e..7c1a3f8b3f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3324,6 +3324,13 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
  */
 pid_t
 CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
+{
+	return SignalVirtualTransaction(vxid, sigmode, true);
+}
+
+pid_t
+SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
+						 bool conflictPending)
 {
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
@@ -3342,7 +3349,7 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
-			proc->recoveryConflictPending = true;
+			proc->recoveryConflictPending = conflictPending;
 			pid = proc->pid;
 			if (pid != 0)
 			{
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 92d9027776..3164aed423 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -42,6 +42,10 @@ int			max_standby_streaming_delay = 30 * 1000;
 
 static HTAB *RecoveryLockLists;
 
+/* Flags set by timeout handlers */
+static volatile sig_atomic_t got_standby_deadlock_timeout = false;
+static volatile sig_atomic_t got_standby_lock_timeout = false;
+
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 												   ProcSignalReason reason,
 												   uint32 wait_event_info,
@@ -395,8 +399,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
  * lock.  As we are already queued to be granted the lock, no new lock
  * requests conflicting with ours will be granted in the meantime.
  *
- * Deadlocks involving the Startup process and an ordinary backend process
- * will be detected by the deadlock detector within the ordinary backend.
+ * We also must check for deadlocks involving the Startup process and
+ * hot-standby backend processes. If deadlock_timeout is reached in
+ * this function, all the backends holding the conflicting locks are
+ * requested to check themselves for deadlocks.
  */
 void
 ResolveRecoveryConflictWithLock(LOCKTAG locktag)
@@ -407,7 +413,7 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 
 	ltime = GetStandbyLimitTime();
 
-	if (GetCurrentTimestamp() >= ltime)
+	if (GetCurrentTimestamp() >= ltime && ltime != 0)
 	{
 		/*
 		 * We're already behind, so clear a path as quickly as possible.
@@ -429,19 +435,76 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 	else
 	{
 		/*
-		 * Wait (or wait again) until ltime
+		 * Wait (or wait again) until ltime, and check for deadlocks as well
+		 * if we will be waiting longer than deadlock_timeout
 		 */
-		EnableTimeoutParams timeouts[1];
+		EnableTimeoutParams timeouts[2];
+		int			cnt = 0;
 
-		timeouts[0].id = STANDBY_LOCK_TIMEOUT;
-		timeouts[0].type = TMPARAM_AT;
-		timeouts[0].fin_time = ltime;
-		enable_timeouts(timeouts, 1);
+		if (ltime != 0)
+		{
+			got_standby_lock_timeout = false;
+			timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
+			timeouts[cnt].type = TMPARAM_AT;
+			timeouts[cnt].fin_time = ltime;
+			cnt++;
+		}
+
+		got_standby_deadlock_timeout = false;
+		timeouts[cnt].id = STANDBY_DEADLOCK_TIMEOUT;
+		timeouts[cnt].type = TMPARAM_AFTER;
+		timeouts[cnt].delay_ms = DeadlockTimeout;
+		cnt++;
+
+		enable_timeouts(timeouts, cnt);
 	}
 
 	/* Wait to be signaled by the release of the Relation Lock */
 	ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
 
+	/*
+	 * Exit if ltime is reached. Then all the backends holding conflicting
+	 * locks will be canceled in the next ResolveRecoveryConflictWithLock()
+	 * call.
+	 */
+	if (got_standby_lock_timeout)
+		goto cleanup;
+
+	if (got_standby_deadlock_timeout)
+	{
+		VirtualTransactionId *backends;
+
+		backends = GetLockConflicts(&locktag, AccessExclusiveLock, NULL);
+
+		/* Quick exit if there's no work to be done */
+		if (!VirtualTransactionIdIsValid(*backends))
+			goto cleanup;
+
+		/*
+		 * Send signals to all the backends holding the conflicting locks, to
+		 * ask them to check themselves for deadlocks.
+		 */
+		while (VirtualTransactionIdIsValid(*backends))
+		{
+			SignalVirtualTransaction(*backends,
+									 PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+									 false);
+			backends++;
+		}
+
+		/*
+		 * Wait again here to be signaled by the release of the Relation Lock,
+		 * to prevent the subsequent RecoveryConflictWithLock() from causing
+		 * deadlock_timeout and sending a request for deadlocks check again.
+		 * Otherwise the request continues to be sent every deadlock_timeout
+		 * until the relation locks are released or ltime is reached.
+		 */
+		got_standby_deadlock_timeout = false;
+		ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+	}
+
+cleanup:
+
 	/*
 	 * Clear any timeout requests established above.  We assume here that the
 	 * Startup process doesn't have any other outstanding timeouts than those
@@ -449,6 +512,8 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 	 * timeouts individually, but that'd be slower.
 	 */
 	disable_all_timeouts(false);
+	got_standby_lock_timeout = false;
+	got_standby_deadlock_timeout = false;
 }
 
 /*
@@ -513,9 +578,12 @@ ResolveRecoveryConflictWithBufferPin(void)
 		timeouts[0].id = STANDBY_TIMEOUT;
 		timeouts[0].type = TMPARAM_AT;
 		timeouts[0].fin_time = ltime;
+
+		got_standby_deadlock_timeout = false;
 		timeouts[1].id = STANDBY_DEADLOCK_TIMEOUT;
 		timeouts[1].type = TMPARAM_AFTER;
 		timeouts[1].delay_ms = DeadlockTimeout;
+
 		enable_timeouts(timeouts, 2);
 	}
 
@@ -529,6 +597,26 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 */
 	ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
 
+	if (got_standby_deadlock_timeout)
+	{
+		/*
+		 * Send out a request for hot-standby backends to check themselves for
+		 * deadlocks.
+		 *
+		 * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+		 * to be signaled by UnpinBuffer() again and send a request for
+		 * deadlocks check if deadlock_timeout happens. This causes the
+		 * request to continue to be sent every deadlock_timeout until the
+		 * buffer is unpinned or ltime is reached. This would increase the
+		 * workload in the startup process and backends. In practice it may
+		 * not be so harmful because the period that the buffer is kept pinned
+		 * is basically no so long. But we should fix this?
+		 */
+		SendRecoveryConflictWithBufferPin(
+										  PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+		got_standby_deadlock_timeout = false;
+	}
+
 	/*
 	 * Clear any timeout requests established above.  We assume here that the
 	 * Startup process doesn't have any other timeouts than what this function
@@ -595,13 +683,12 @@ CheckRecoveryConflictDeadlock(void)
 
 /*
  * StandbyDeadLockHandler() will be called if STANDBY_DEADLOCK_TIMEOUT
- * occurs before STANDBY_TIMEOUT.  Send out a request for hot-standby
- * backends to check themselves for deadlocks.
+ * occurs before STANDBY_TIMEOUT.
  */
 void
 StandbyDeadLockHandler(void)
 {
-	SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+	got_standby_deadlock_timeout = true;
 }
 
 /*
@@ -620,11 +707,11 @@ StandbyTimeoutHandler(void)
 
 /*
  * StandbyLockTimeoutHandler() will be called if STANDBY_LOCK_TIMEOUT is exceeded.
- * This doesn't need to do anything, simply waking up is enough.
  */
 void
 StandbyLockTimeoutHandler(void)
 {
+	got_standby_lock_timeout = true;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7dc3911590..45a99ac0a4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1793,6 +1793,9 @@ CheckDeadLockAlert(void)
 	 * Have to set the latch again, even if handle_sig_alarm already did. Back
 	 * then got_deadlock_timeout wasn't yet set... It's unlikely that this
 	 * ever would be a problem, but setting a set latch again is cheap.
+	 *
+	 * Note that, when this function runs inside procsignal_sigusr1_handler(),
+	 * the handler function sets the latch again after the latch is set here.
 	 */
 	SetLatch(MyLatch);
 	errno = save_errno;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..385a749e46 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2919,11 +2919,21 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 
 				/*
-				 * If we aren't blocking the Startup process there is nothing
-				 * more to do.
+				 * If PROCSIG_RECOVERY_CONFLICT_BUFFERPIN is requested but we
+				 * aren't blocking the Startup process there is nothing more
+				 * to do.
+				 *
+				 * If PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK is requested
+				 * and we're waiting for locks but don't hold buffer pins
+				 * blocking the Startup process, we set the flag so that
+				 * ProcSleep() will check for deadlocks.
 				 */
 				if (!HoldingBufferPinThatDelaysRecovery())
+				{
+					if (reason == PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)
+						CheckDeadLockAlert();
 					return;
+				}
 
 				MyProc->recoveryConflictPending = true;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index ea8a876ca4..4d272c7287 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -72,6 +72,8 @@ extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
 												   int *nvxids);
 extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
 extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
+extern pid_t SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
+									  bool conflictPending);
 
 extern bool MinimumActiveBackends(int min);
 extern int	CountDBBackends(Oid databaseid);
#6Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Fujii Masao (#5)
Re: Deadlock between backend and recovery may not be detected

Hi,

On 12/18/20 10:35 AM, Fujii Masao wrote:

CAUTION: This email originated from outside of the organization. Do
not click links or open attachments unless you can confirm the sender
and know the content is safe.

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization.
Do not click links or open attachments unless you can confirm the
sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao
<masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

    After doing this procedure, you can see the startup process and
backend
    wait for the table lock each other, i.e., deadlock. But this
deadlock remains
    even after deadlock_timeout passes.

    This seems a bug to me.

+1

    > * Deadlocks involving the Startup process and an ordinary
backend process
    > * will be detected by the deadlock detector within the
ordinary backend.

    The cause of this issue seems that
ResolveRecoveryConflictWithLock() that
    the startup process calls when recovery conflict on lock
happens doesn't
    take care of deadlock case at all. You can see this fact by
reading the above
    source code comment for ResolveRecoveryConflictWithLock().

    To fix this issue, I think that we should enable
STANDBY_DEADLOCK_TIMEOUT
    timer in ResolveRecoveryConflictWithLock() so that the startup
process can
    send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the
backend.
    Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
    the backend should check whether the deadlock actually happens
or not.
    Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT
shouldn't be set in ResolveRecoveryConflictWithLock() too (it is
already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes
to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that.

Agree.

IMHO that may need to be rethink (as we are already in a conflict
situation, i am tempted to say that the less is being done the better it
is), but i think that's outside the scope of this patch.

Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

Agree that chances are less to be in this mode for a "long" duration (as
compare to the lock conflict).

On the other hand, IMO we should avoid this issue while the startup
process is waiting for recovery conflict on locks since it can take
a long time to release the locks. So the patch tries to fix it.

Agree

Or I'm overthinking this? If this doesn't need to be handled,
the patch can be simplified more. Thought?

I do think that's good to handle it that way for the lock conflict: the
less work is done the better it is (specially in a conflict situation).

The patch does look good to me.

Bertrand

#7Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#5)
Re: Deadlock between backend and recovery may not be detected

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

+1

* Deadlocks involving the Startup process and an ordinary backend process
* will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
*/
ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+       /*
+        * Send out a request for hot-standby backends to check themselves for
+        * deadlocks.
+        *
+        * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+        * to be signaled by UnpinBuffer() again and send a request for
+        * deadlocks check if deadlock_timeout happens. This causes the
+        * request to continue to be sent every deadlock_timeout until the
+        * buffer is unpinned or ltime is reached. This would increase the
+        * workload in the startup process and backends. In practice it may
+        * not be so harmful because the period that the buffer is kept pinned
+        * is basically no so long. But we should fix this?
+        */
+       SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+       got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#8Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Drouvot, Bertrand (#6)
Re: Deadlock between backend and recovery may not be detected

On 2020/12/19 1:43, Drouvot, Bertrand wrote:

Hi,

On 12/18/20 10:35 AM, Fujii Masao wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

    After doing this procedure, you can see the startup process and backend
    wait for the table lock each other, i.e., deadlock. But this deadlock remains
    even after deadlock_timeout passes.

    This seems a bug to me.

+1

    > * Deadlocks involving the Startup process and an ordinary backend process
    > * will be detected by the deadlock detector within the ordinary backend.

    The cause of this issue seems that ResolveRecoveryConflictWithLock() that
    the startup process calls when recovery conflict on lock happens doesn't
    take care of deadlock case at all. You can see this fact by reading the above
    source code comment for ResolveRecoveryConflictWithLock().

    To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
    timer in ResolveRecoveryConflictWithLock() so that the startup process can
    send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
    Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
    the backend should check whether the deadlock actually happens or not.
    Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that.

Agree.

IMHO that may need to be rethink (as we are already in a conflict situation, i am tempted to say that the less is being done the better it is), but i think that's outside the scope of this patch.

Yes, I agree that's better. I think that we should improve that as a separate
patch only for master branch, after fixing the bug and back-patching that
at first.

Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

Agree that chances are less to be in this mode for a "long" duration (as compare to the lock conflict).

On the other hand, IMO we should avoid this issue while the startup
process is waiting for recovery conflict on locks since it can take
a long time to release the locks. So the patch tries to fix it.

Agree

Or I'm overthinking this? If this doesn't need to be handled,
the patch can be simplified more. Thought?

I do think that's good to handle it that way for the lock conflict: the less work is done the better it is (specially in a conflict situation).

The patch does look good to me.

Thanks for the review!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#9Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Masahiko Sawada (#7)
Re: Deadlock between backend and recovery may not be detected

On 2020/12/22 10:25, Masahiko Sawada wrote:

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

+1

* Deadlocks involving the Startup process and an ordinary backend process
* will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
*/
ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+       /*
+        * Send out a request for hot-standby backends to check themselves for
+        * deadlocks.
+        *
+        * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+        * to be signaled by UnpinBuffer() again and send a request for
+        * deadlocks check if deadlock_timeout happens. This causes the
+        * request to continue to be sent every deadlock_timeout until the
+        * buffer is unpinned or ltime is reached. This would increase the
+        * workload in the startup process and backends. In practice it may
+        * not be so harmful because the period that the buffer is kept pinned
+        * is basically no so long. But we should fix this?
+        */
+       SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+       got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Thanks for pointing out that issue! Using new signal is an idea. Another idea
is to make a backend skip check the deadlock if GetStartupBufferPinWaitBufId()
returns -1, i.e., the startup process is not waiting for buffer pin. So,
what I'm thinkins is;

In RecoveryConflictInterrupt(), when a backend receive
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,

1. If a backend isn't waiting for a lock, it does nothing .
2. If a backend is waiting for a lock and also holding a buffer pin that
delays recovery, it may be canceled.
3. If a backend is waiting for a lock and the startup process is not waiting
for buffer pin (i.e., the startup process is also waiting for a lock),
it checks for the deadlocks.
4. If a backend is waiting for a lock and isn't holding a buffer pin that
delays recovery though the startup process is waiting for buffer pin,
it does nothing.

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#10Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Fujii Masao (#9)
1 attachment(s)
Re: Deadlock between backend and recovery may not be detected

On 2020/12/22 20:42, Fujii Masao wrote:

On 2020/12/22 10:25, Masahiko Sawada wrote:

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

     After doing this procedure, you can see the startup process and backend
     wait for the table lock each other, i.e., deadlock. But this deadlock remains
     even after deadlock_timeout passes.

     This seems a bug to me.

+1

     > * Deadlocks involving the Startup process and an ordinary backend process
     > * will be detected by the deadlock detector within the ordinary backend.

     The cause of this issue seems that ResolveRecoveryConflictWithLock() that
     the startup process calls when recovery conflict on lock happens doesn't
     take care of deadlock case at all. You can see this fact by reading the above
     source code comment for ResolveRecoveryConflictWithLock().

     To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
     timer in ResolveRecoveryConflictWithLock() so that the startup process can
     send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
     Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
     the backend should check whether the deadlock actually happens or not.
     Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
      */
     ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+       /*
+        * Send out a request for hot-standby backends to check themselves for
+        * deadlocks.
+        *
+        * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+        * to be signaled by UnpinBuffer() again and send a request for
+        * deadlocks check if deadlock_timeout happens. This causes the
+        * request to continue to be sent every deadlock_timeout until the
+        * buffer is unpinned or ltime is reached. This would increase the
+        * workload in the startup process and backends. In practice it may
+        * not be so harmful because the period that the buffer is kept pinned
+        * is basically no so long. But we should fix this?
+        */
+       SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+       got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Thanks for pointing out that issue! Using new signal is an idea. Another idea
is to make a backend skip check the deadlock if GetStartupBufferPinWaitBufId()
returns -1, i.e., the startup process is not waiting for buffer pin. So,
what I'm thinkins is;

In RecoveryConflictInterrupt(), when a backend receive
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,

1. If a backend isn't waiting for a lock, it does nothing .
2. If a backend is waiting for a lock and also holding a buffer pin that
   delays recovery, it may be canceled.
3. If a backend is waiting for a lock and the startup process is not waiting
   for buffer pin (i.e., the startup process is also waiting for a lock),
   it checks for the deadlocks.
4. If a backend is waiting for a lock and isn't holding a buffer pin that
   delays recovery though the startup process is waiting for buffer pin,
   it does nothing.

I implemented this. Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

recovery_conflict_lock_deadlock_v4.patchtext/plain; charset=UTF-8; name=recovery_conflict_lock_deadlock_v4.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ee912b9d5e..7c1a3f8b3f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3324,6 +3324,13 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
  */
 pid_t
 CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
+{
+	return SignalVirtualTransaction(vxid, sigmode, true);
+}
+
+pid_t
+SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
+						 bool conflictPending)
 {
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
@@ -3342,7 +3349,7 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
-			proc->recoveryConflictPending = true;
+			proc->recoveryConflictPending = conflictPending;
 			pid = proc->pid;
 			if (pid != 0)
 			{
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 92d9027776..3164aed423 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -42,6 +42,10 @@ int			max_standby_streaming_delay = 30 * 1000;
 
 static HTAB *RecoveryLockLists;
 
+/* Flags set by timeout handlers */
+static volatile sig_atomic_t got_standby_deadlock_timeout = false;
+static volatile sig_atomic_t got_standby_lock_timeout = false;
+
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 												   ProcSignalReason reason,
 												   uint32 wait_event_info,
@@ -395,8 +399,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
  * lock.  As we are already queued to be granted the lock, no new lock
  * requests conflicting with ours will be granted in the meantime.
  *
- * Deadlocks involving the Startup process and an ordinary backend process
- * will be detected by the deadlock detector within the ordinary backend.
+ * We also must check for deadlocks involving the Startup process and
+ * hot-standby backend processes. If deadlock_timeout is reached in
+ * this function, all the backends holding the conflicting locks are
+ * requested to check themselves for deadlocks.
  */
 void
 ResolveRecoveryConflictWithLock(LOCKTAG locktag)
@@ -407,7 +413,7 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 
 	ltime = GetStandbyLimitTime();
 
-	if (GetCurrentTimestamp() >= ltime)
+	if (GetCurrentTimestamp() >= ltime && ltime != 0)
 	{
 		/*
 		 * We're already behind, so clear a path as quickly as possible.
@@ -429,19 +435,76 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 	else
 	{
 		/*
-		 * Wait (or wait again) until ltime
+		 * Wait (or wait again) until ltime, and check for deadlocks as well
+		 * if we will be waiting longer than deadlock_timeout
 		 */
-		EnableTimeoutParams timeouts[1];
+		EnableTimeoutParams timeouts[2];
+		int			cnt = 0;
 
-		timeouts[0].id = STANDBY_LOCK_TIMEOUT;
-		timeouts[0].type = TMPARAM_AT;
-		timeouts[0].fin_time = ltime;
-		enable_timeouts(timeouts, 1);
+		if (ltime != 0)
+		{
+			got_standby_lock_timeout = false;
+			timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
+			timeouts[cnt].type = TMPARAM_AT;
+			timeouts[cnt].fin_time = ltime;
+			cnt++;
+		}
+
+		got_standby_deadlock_timeout = false;
+		timeouts[cnt].id = STANDBY_DEADLOCK_TIMEOUT;
+		timeouts[cnt].type = TMPARAM_AFTER;
+		timeouts[cnt].delay_ms = DeadlockTimeout;
+		cnt++;
+
+		enable_timeouts(timeouts, cnt);
 	}
 
 	/* Wait to be signaled by the release of the Relation Lock */
 	ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
 
+	/*
+	 * Exit if ltime is reached. Then all the backends holding conflicting
+	 * locks will be canceled in the next ResolveRecoveryConflictWithLock()
+	 * call.
+	 */
+	if (got_standby_lock_timeout)
+		goto cleanup;
+
+	if (got_standby_deadlock_timeout)
+	{
+		VirtualTransactionId *backends;
+
+		backends = GetLockConflicts(&locktag, AccessExclusiveLock, NULL);
+
+		/* Quick exit if there's no work to be done */
+		if (!VirtualTransactionIdIsValid(*backends))
+			goto cleanup;
+
+		/*
+		 * Send signals to all the backends holding the conflicting locks, to
+		 * ask them to check themselves for deadlocks.
+		 */
+		while (VirtualTransactionIdIsValid(*backends))
+		{
+			SignalVirtualTransaction(*backends,
+									 PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+									 false);
+			backends++;
+		}
+
+		/*
+		 * Wait again here to be signaled by the release of the Relation Lock,
+		 * to prevent the subsequent RecoveryConflictWithLock() from causing
+		 * deadlock_timeout and sending a request for deadlocks check again.
+		 * Otherwise the request continues to be sent every deadlock_timeout
+		 * until the relation locks are released or ltime is reached.
+		 */
+		got_standby_deadlock_timeout = false;
+		ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+	}
+
+cleanup:
+
 	/*
 	 * Clear any timeout requests established above.  We assume here that the
 	 * Startup process doesn't have any other outstanding timeouts than those
@@ -449,6 +512,8 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 	 * timeouts individually, but that'd be slower.
 	 */
 	disable_all_timeouts(false);
+	got_standby_lock_timeout = false;
+	got_standby_deadlock_timeout = false;
 }
 
 /*
@@ -513,9 +578,12 @@ ResolveRecoveryConflictWithBufferPin(void)
 		timeouts[0].id = STANDBY_TIMEOUT;
 		timeouts[0].type = TMPARAM_AT;
 		timeouts[0].fin_time = ltime;
+
+		got_standby_deadlock_timeout = false;
 		timeouts[1].id = STANDBY_DEADLOCK_TIMEOUT;
 		timeouts[1].type = TMPARAM_AFTER;
 		timeouts[1].delay_ms = DeadlockTimeout;
+
 		enable_timeouts(timeouts, 2);
 	}
 
@@ -529,6 +597,26 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 */
 	ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
 
+	if (got_standby_deadlock_timeout)
+	{
+		/*
+		 * Send out a request for hot-standby backends to check themselves for
+		 * deadlocks.
+		 *
+		 * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+		 * to be signaled by UnpinBuffer() again and send a request for
+		 * deadlocks check if deadlock_timeout happens. This causes the
+		 * request to continue to be sent every deadlock_timeout until the
+		 * buffer is unpinned or ltime is reached. This would increase the
+		 * workload in the startup process and backends. In practice it may
+		 * not be so harmful because the period that the buffer is kept pinned
+		 * is basically no so long. But we should fix this?
+		 */
+		SendRecoveryConflictWithBufferPin(
+										  PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+		got_standby_deadlock_timeout = false;
+	}
+
 	/*
 	 * Clear any timeout requests established above.  We assume here that the
 	 * Startup process doesn't have any other timeouts than what this function
@@ -595,13 +683,12 @@ CheckRecoveryConflictDeadlock(void)
 
 /*
  * StandbyDeadLockHandler() will be called if STANDBY_DEADLOCK_TIMEOUT
- * occurs before STANDBY_TIMEOUT.  Send out a request for hot-standby
- * backends to check themselves for deadlocks.
+ * occurs before STANDBY_TIMEOUT.
  */
 void
 StandbyDeadLockHandler(void)
 {
-	SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+	got_standby_deadlock_timeout = true;
 }
 
 /*
@@ -620,11 +707,11 @@ StandbyTimeoutHandler(void)
 
 /*
  * StandbyLockTimeoutHandler() will be called if STANDBY_LOCK_TIMEOUT is exceeded.
- * This doesn't need to do anything, simply waking up is enough.
  */
 void
 StandbyLockTimeoutHandler(void)
 {
+	got_standby_lock_timeout = true;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7dc3911590..45a99ac0a4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1793,6 +1793,9 @@ CheckDeadLockAlert(void)
 	 * Have to set the latch again, even if handle_sig_alarm already did. Back
 	 * then got_deadlock_timeout wasn't yet set... It's unlikely that this
 	 * ever would be a problem, but setting a set latch again is cheap.
+	 *
+	 * Note that, when this function runs inside procsignal_sigusr1_handler(),
+	 * the handler function sets the latch again after the latch is set here.
 	 */
 	SetLatch(MyLatch);
 	errno = save_errno;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..708cd8b9e5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2919,11 +2919,23 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 
 				/*
-				 * If we aren't blocking the Startup process there is nothing
-				 * more to do.
+				 * If PROCSIG_RECOVERY_CONFLICT_BUFFERPIN is requested but we
+				 * aren't blocking the Startup process there is nothing more
+				 * to do.
+				 *
+				 * When PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK is
+				 * requested, if we're waiting for locks and the startup
+				 * process is not waiting for buffer pin (i.e., also waiting
+				 * for locks), we set the flag so that ProcSleep() will check
+				 * for deadlocks.
 				 */
 				if (!HoldingBufferPinThatDelaysRecovery())
+				{
+					if (reason == PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK &&
+						GetStartupBufferPinWaitBufId() < 0)
+						CheckDeadLockAlert();
 					return;
+				}
 
 				MyProc->recoveryConflictPending = true;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index ea8a876ca4..4d272c7287 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -72,6 +72,8 @@ extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
 												   int *nvxids);
 extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
 extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
+extern pid_t SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
+									  bool conflictPending);
 
 extern bool MinimumActiveBackends(int min);
 extern int	CountDBBackends(Oid databaseid);
#11Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#10)
Re: Deadlock between backend and recovery may not be detected

On Tue, Dec 22, 2020 at 11:58 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

On 2020/12/22 20:42, Fujii Masao wrote:

On 2020/12/22 10:25, Masahiko Sawada wrote:

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

+1

* Deadlocks involving the Startup process and an ordinary backend process
* will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
*/
ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+       /*
+        * Send out a request for hot-standby backends to check themselves for
+        * deadlocks.
+        *
+        * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+        * to be signaled by UnpinBuffer() again and send a request for
+        * deadlocks check if deadlock_timeout happens. This causes the
+        * request to continue to be sent every deadlock_timeout until the
+        * buffer is unpinned or ltime is reached. This would increase the
+        * workload in the startup process and backends. In practice it may
+        * not be so harmful because the period that the buffer is kept pinned
+        * is basically no so long. But we should fix this?
+        */
+       SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+       got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Thanks for pointing out that issue! Using new signal is an idea. Another idea
is to make a backend skip check the deadlock if GetStartupBufferPinWaitBufId()
returns -1, i.e., the startup process is not waiting for buffer pin. So,
what I'm thinkins is;

In RecoveryConflictInterrupt(), when a backend receive
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,

1. If a backend isn't waiting for a lock, it does nothing .
2. If a backend is waiting for a lock and also holding a buffer pin that
delays recovery, it may be canceled.
3. If a backend is waiting for a lock and the startup process is not waiting
for buffer pin (i.e., the startup process is also waiting for a lock),
it checks for the deadlocks.
4. If a backend is waiting for a lock and isn't holding a buffer pin that
delays recovery though the startup process is waiting for buffer pin,
it does nothing.

Good idea! It could still happen that if the startup process sets
startupBufferPinWaitBufId to -1 after sending the signal and before
the backend checks it, the backend will end up doing an unmeaningful
deadlock check. But the likelihood would be low in practice.

I have two small comments on ResolveRecoveryConflictWithBufferPin() in
the v4 patch:

The code currently has three branches as follow:

if (ltime == 0)
{
enable a timeout for deadlock;
}
else if (GetCurrentTimestamp() >= ltime)
{
send recovery conflict signal;
}
else
{
enable two timeouts: ltime and deadlock
}

I think we can rearrange the code similar to the changes you made on
ResolveRecoveryConflictWithLock():

if (GetCurrentTimestamp() >= ltime && ltime != 0)
{
Resolve recovery conflict;
}
else
{
Enable one or two timeouts: ltime and deadlock
}

It's more consistent with ResolveRecoveryConflictWithLock(). And
currently the patch doesn't reset got_standby_deadlock_timeout in
(ltime == 0) case but it will also be resolved by this rearrangement.

---
If we always reset got_standby_deadlock_timeout before waiting it's
not necessary but we might want to clear got_standby_deadlock_timeout
also after disabling all timeouts to ensure that it's cleared at the
end of the function. In ResolveRecoveryConflictWithLock() we clear
both got_standby_lock_timeout and got_standby_deadlock_timeout after
disabling all timeouts but we don't do that in
ResolveRecoveryConflictWithBufferPin().

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#12Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Masahiko Sawada (#11)
1 attachment(s)
Re: Deadlock between backend and recovery may not be detected

On 2020/12/23 19:28, Masahiko Sawada wrote:

On Tue, Dec 22, 2020 at 11:58 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

On 2020/12/22 20:42, Fujii Masao wrote:

On 2020/12/22 10:25, Masahiko Sawada wrote:

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

+1

* Deadlocks involving the Startup process and an ordinary backend process
* will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
*/
ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+       /*
+        * Send out a request for hot-standby backends to check themselves for
+        * deadlocks.
+        *
+        * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+        * to be signaled by UnpinBuffer() again and send a request for
+        * deadlocks check if deadlock_timeout happens. This causes the
+        * request to continue to be sent every deadlock_timeout until the
+        * buffer is unpinned or ltime is reached. This would increase the
+        * workload in the startup process and backends. In practice it may
+        * not be so harmful because the period that the buffer is kept pinned
+        * is basically no so long. But we should fix this?
+        */
+       SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+       got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Thanks for pointing out that issue! Using new signal is an idea. Another idea
is to make a backend skip check the deadlock if GetStartupBufferPinWaitBufId()
returns -1, i.e., the startup process is not waiting for buffer pin. So,
what I'm thinkins is;

In RecoveryConflictInterrupt(), when a backend receive
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,

1. If a backend isn't waiting for a lock, it does nothing .
2. If a backend is waiting for a lock and also holding a buffer pin that
delays recovery, it may be canceled.
3. If a backend is waiting for a lock and the startup process is not waiting
for buffer pin (i.e., the startup process is also waiting for a lock),
it checks for the deadlocks.
4. If a backend is waiting for a lock and isn't holding a buffer pin that
delays recovery though the startup process is waiting for buffer pin,
it does nothing.

Good idea! It could still happen that if the startup process sets
startupBufferPinWaitBufId to -1 after sending the signal and before
the backend checks it, the backend will end up doing an unmeaningful
deadlock check. But the likelihood would be low in practice.

I have two small comments on ResolveRecoveryConflictWithBufferPin() in
the v4 patch:

The code currently has three branches as follow:

if (ltime == 0)
{
enable a timeout for deadlock;
}
else if (GetCurrentTimestamp() >= ltime)
{
send recovery conflict signal;
}
else
{
enable two timeouts: ltime and deadlock
}

I think we can rearrange the code similar to the changes you made on
ResolveRecoveryConflictWithLock():

if (GetCurrentTimestamp() >= ltime && ltime != 0)
{
Resolve recovery conflict;
}
else
{
Enable one or two timeouts: ltime and deadlock
}

It's more consistent with ResolveRecoveryConflictWithLock(). And
currently the patch doesn't reset got_standby_deadlock_timeout in
(ltime == 0) case but it will also be resolved by this rearrangement.

I didn't want to change the code structure as much as possible because
the patch needs to be back-patched. But it's good idea to make the code
structures in ResolveRecoveryConflictWithLock() and ...WithBufferPin() similar,
to make the code simpler and easier-to-read. So I agree with you. Attached
is the updated of the patch. What about this version?

---
If we always reset got_standby_deadlock_timeout before waiting it's
not necessary but we might want to clear got_standby_deadlock_timeout
also after disabling all timeouts to ensure that it's cleared at the
end of the function. In ResolveRecoveryConflictWithLock() we clear
both got_standby_lock_timeout and got_standby_deadlock_timeout after
disabling all timeouts but we don't do that in
ResolveRecoveryConflictWithBufferPin().

I agree that it's better to reset got_standby_deadlock_timeout after
all the timeouts are disabled. So I changed the patch that way. OTOH
got_standby_lock_timeout doesn't need to be reset because it's never
enabled in ResolveRecoveryConflictWithBufferPin(). No?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

recovery_conflict_lock_deadlock_v5.patchtext/plain; charset=UTF-8; name=recovery_conflict_lock_deadlock_v5.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ee912b9d5e..7c1a3f8b3f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3324,6 +3324,13 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
  */
 pid_t
 CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
+{
+	return SignalVirtualTransaction(vxid, sigmode, true);
+}
+
+pid_t
+SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
+						 bool conflictPending)
 {
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
@@ -3342,7 +3349,7 @@ CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode)
 		if (procvxid.backendId == vxid.backendId &&
 			procvxid.localTransactionId == vxid.localTransactionId)
 		{
-			proc->recoveryConflictPending = true;
+			proc->recoveryConflictPending = conflictPending;
 			pid = proc->pid;
 			if (pid != 0)
 			{
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 92d9027776..fd2cf8ae8a 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -42,6 +42,10 @@ int			max_standby_streaming_delay = 30 * 1000;
 
 static HTAB *RecoveryLockLists;
 
+/* Flags set by timeout handlers */
+static volatile sig_atomic_t got_standby_deadlock_timeout = false;
+static volatile sig_atomic_t got_standby_lock_timeout = false;
+
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 												   ProcSignalReason reason,
 												   uint32 wait_event_info,
@@ -395,8 +399,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
  * lock.  As we are already queued to be granted the lock, no new lock
  * requests conflicting with ours will be granted in the meantime.
  *
- * Deadlocks involving the Startup process and an ordinary backend process
- * will be detected by the deadlock detector within the ordinary backend.
+ * We also must check for deadlocks involving the Startup process and
+ * hot-standby backend processes. If deadlock_timeout is reached in
+ * this function, all the backends holding the conflicting locks are
+ * requested to check themselves for deadlocks.
  */
 void
 ResolveRecoveryConflictWithLock(LOCKTAG locktag)
@@ -407,7 +413,7 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 
 	ltime = GetStandbyLimitTime();
 
-	if (GetCurrentTimestamp() >= ltime)
+	if (GetCurrentTimestamp() >= ltime && ltime != 0)
 	{
 		/*
 		 * We're already behind, so clear a path as quickly as possible.
@@ -429,19 +435,76 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 	else
 	{
 		/*
-		 * Wait (or wait again) until ltime
+		 * Wait (or wait again) until ltime, and check for deadlocks as well
+		 * if we will be waiting longer than deadlock_timeout
 		 */
-		EnableTimeoutParams timeouts[1];
+		EnableTimeoutParams timeouts[2];
+		int			cnt = 0;
 
-		timeouts[0].id = STANDBY_LOCK_TIMEOUT;
-		timeouts[0].type = TMPARAM_AT;
-		timeouts[0].fin_time = ltime;
-		enable_timeouts(timeouts, 1);
+		if (ltime != 0)
+		{
+			got_standby_lock_timeout = false;
+			timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
+			timeouts[cnt].type = TMPARAM_AT;
+			timeouts[cnt].fin_time = ltime;
+			cnt++;
+		}
+
+		got_standby_deadlock_timeout = false;
+		timeouts[cnt].id = STANDBY_DEADLOCK_TIMEOUT;
+		timeouts[cnt].type = TMPARAM_AFTER;
+		timeouts[cnt].delay_ms = DeadlockTimeout;
+		cnt++;
+
+		enable_timeouts(timeouts, cnt);
 	}
 
 	/* Wait to be signaled by the release of the Relation Lock */
 	ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
 
+	/*
+	 * Exit if ltime is reached. Then all the backends holding conflicting
+	 * locks will be canceled in the next ResolveRecoveryConflictWithLock()
+	 * call.
+	 */
+	if (got_standby_lock_timeout)
+		goto cleanup;
+
+	if (got_standby_deadlock_timeout)
+	{
+		VirtualTransactionId *backends;
+
+		backends = GetLockConflicts(&locktag, AccessExclusiveLock, NULL);
+
+		/* Quick exit if there's no work to be done */
+		if (!VirtualTransactionIdIsValid(*backends))
+			goto cleanup;
+
+		/*
+		 * Send signals to all the backends holding the conflicting locks, to
+		 * ask them to check themselves for deadlocks.
+		 */
+		while (VirtualTransactionIdIsValid(*backends))
+		{
+			SignalVirtualTransaction(*backends,
+									 PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+									 false);
+			backends++;
+		}
+
+		/*
+		 * Wait again here to be signaled by the release of the Relation Lock,
+		 * to prevent the subsequent RecoveryConflictWithLock() from causing
+		 * deadlock_timeout and sending a request for deadlocks check again.
+		 * Otherwise the request continues to be sent every deadlock_timeout
+		 * until the relation locks are released or ltime is reached.
+		 */
+		got_standby_deadlock_timeout = false;
+		ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+	}
+
+cleanup:
+
 	/*
 	 * Clear any timeout requests established above.  We assume here that the
 	 * Startup process doesn't have any other outstanding timeouts than those
@@ -449,6 +512,8 @@ ResolveRecoveryConflictWithLock(LOCKTAG locktag)
 	 * timeouts individually, but that'd be slower.
 	 */
 	disable_all_timeouts(false);
+	got_standby_lock_timeout = false;
+	got_standby_deadlock_timeout = false;
 }
 
 /*
@@ -487,15 +552,7 @@ ResolveRecoveryConflictWithBufferPin(void)
 
 	ltime = GetStandbyLimitTime();
 
-	if (ltime == 0)
-	{
-		/*
-		 * We're willing to wait forever for conflicts, so set timeout for
-		 * deadlock check only
-		 */
-		enable_timeout_after(STANDBY_DEADLOCK_TIMEOUT, DeadlockTimeout);
-	}
-	else if (GetCurrentTimestamp() >= ltime)
+	if (GetCurrentTimestamp() >= ltime && ltime != 0)
 	{
 		/*
 		 * We're already behind, so clear a path as quickly as possible.
@@ -509,14 +566,23 @@ ResolveRecoveryConflictWithBufferPin(void)
 		 * waiting longer than deadlock_timeout
 		 */
 		EnableTimeoutParams timeouts[2];
+		int			cnt = 0;
 
-		timeouts[0].id = STANDBY_TIMEOUT;
-		timeouts[0].type = TMPARAM_AT;
-		timeouts[0].fin_time = ltime;
-		timeouts[1].id = STANDBY_DEADLOCK_TIMEOUT;
-		timeouts[1].type = TMPARAM_AFTER;
-		timeouts[1].delay_ms = DeadlockTimeout;
-		enable_timeouts(timeouts, 2);
+		if (ltime != 0)
+		{
+			timeouts[cnt].id = STANDBY_TIMEOUT;
+			timeouts[cnt].type = TMPARAM_AT;
+			timeouts[cnt].fin_time = ltime;
+			cnt++;
+		}
+
+		got_standby_deadlock_timeout = false;
+		timeouts[cnt].id = STANDBY_DEADLOCK_TIMEOUT;
+		timeouts[cnt].type = TMPARAM_AFTER;
+		timeouts[cnt].delay_ms = DeadlockTimeout;
+		cnt++;
+
+		enable_timeouts(timeouts, cnt);
 	}
 
 	/*
@@ -529,6 +595,25 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 */
 	ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
 
+	if (got_standby_deadlock_timeout)
+	{
+		/*
+		 * Send out a request for hot-standby backends to check themselves for
+		 * deadlocks.
+		 *
+		 * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+		 * to be signaled by UnpinBuffer() again and send a request for
+		 * deadlocks check if deadlock_timeout happens. This causes the
+		 * request to continue to be sent every deadlock_timeout until the
+		 * buffer is unpinned or ltime is reached. This would increase the
+		 * workload in the startup process and backends. In practice it may
+		 * not be so harmful because the period that the buffer is kept pinned
+		 * is basically no so long. But we should fix this?
+		 */
+		SendRecoveryConflictWithBufferPin(
+										  PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+	}
+
 	/*
 	 * Clear any timeout requests established above.  We assume here that the
 	 * Startup process doesn't have any other timeouts than what this function
@@ -536,6 +621,7 @@ ResolveRecoveryConflictWithBufferPin(void)
 	 * individually, but that'd be slower.
 	 */
 	disable_all_timeouts(false);
+	got_standby_deadlock_timeout = false;
 }
 
 static void
@@ -595,13 +681,12 @@ CheckRecoveryConflictDeadlock(void)
 
 /*
  * StandbyDeadLockHandler() will be called if STANDBY_DEADLOCK_TIMEOUT
- * occurs before STANDBY_TIMEOUT.  Send out a request for hot-standby
- * backends to check themselves for deadlocks.
+ * occurs before STANDBY_TIMEOUT.
  */
 void
 StandbyDeadLockHandler(void)
 {
-	SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+	got_standby_deadlock_timeout = true;
 }
 
 /*
@@ -620,11 +705,11 @@ StandbyTimeoutHandler(void)
 
 /*
  * StandbyLockTimeoutHandler() will be called if STANDBY_LOCK_TIMEOUT is exceeded.
- * This doesn't need to do anything, simply waking up is enough.
  */
 void
 StandbyLockTimeoutHandler(void)
 {
+	got_standby_lock_timeout = true;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7dc3911590..45a99ac0a4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1793,6 +1793,9 @@ CheckDeadLockAlert(void)
 	 * Have to set the latch again, even if handle_sig_alarm already did. Back
 	 * then got_deadlock_timeout wasn't yet set... It's unlikely that this
 	 * ever would be a problem, but setting a set latch again is cheap.
+	 *
+	 * Note that, when this function runs inside procsignal_sigusr1_handler(),
+	 * the handler function sets the latch again after the latch is set here.
 	 */
 	SetLatch(MyLatch);
 	errno = save_errno;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..708cd8b9e5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2919,11 +2919,23 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 			case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
 
 				/*
-				 * If we aren't blocking the Startup process there is nothing
-				 * more to do.
+				 * If PROCSIG_RECOVERY_CONFLICT_BUFFERPIN is requested but we
+				 * aren't blocking the Startup process there is nothing more
+				 * to do.
+				 *
+				 * When PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK is
+				 * requested, if we're waiting for locks and the startup
+				 * process is not waiting for buffer pin (i.e., also waiting
+				 * for locks), we set the flag so that ProcSleep() will check
+				 * for deadlocks.
 				 */
 				if (!HoldingBufferPinThatDelaysRecovery())
+				{
+					if (reason == PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK &&
+						GetStartupBufferPinWaitBufId() < 0)
+						CheckDeadLockAlert();
 					return;
+				}
 
 				MyProc->recoveryConflictPending = true;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index ea8a876ca4..4d272c7287 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -72,6 +72,8 @@ extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
 												   int *nvxids);
 extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
 extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
+extern pid_t SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
+									  bool conflictPending);
 
 extern bool MinimumActiveBackends(int min);
 extern int	CountDBBackends(Oid databaseid);
#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#12)
Re: Deadlock between backend and recovery may not be detected

On Wed, Dec 23, 2020 at 9:42 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/23 19:28, Masahiko Sawada wrote:

On Tue, Dec 22, 2020 at 11:58 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

On 2020/12/22 20:42, Fujii Masao wrote:

On 2020/12/22 10:25, Masahiko Sawada wrote:

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>:

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

+1

* Deadlocks involving the Startup process and an ordinary backend process
* will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be set in ResolveRecoveryConflictWithLock() too (it is already set in ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
*/
ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+       /*
+        * Send out a request for hot-standby backends to check themselves for
+        * deadlocks.
+        *
+        * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+        * to be signaled by UnpinBuffer() again and send a request for
+        * deadlocks check if deadlock_timeout happens. This causes the
+        * request to continue to be sent every deadlock_timeout until the
+        * buffer is unpinned or ltime is reached. This would increase the
+        * workload in the startup process and backends. In practice it may
+        * not be so harmful because the period that the buffer is kept pinned
+        * is basically no so long. But we should fix this?
+        */
+       SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+       got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Thanks for pointing out that issue! Using new signal is an idea. Another idea
is to make a backend skip check the deadlock if GetStartupBufferPinWaitBufId()
returns -1, i.e., the startup process is not waiting for buffer pin. So,
what I'm thinkins is;

In RecoveryConflictInterrupt(), when a backend receive
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,

1. If a backend isn't waiting for a lock, it does nothing .
2. If a backend is waiting for a lock and also holding a buffer pin that
delays recovery, it may be canceled.
3. If a backend is waiting for a lock and the startup process is not waiting
for buffer pin (i.e., the startup process is also waiting for a lock),
it checks for the deadlocks.
4. If a backend is waiting for a lock and isn't holding a buffer pin that
delays recovery though the startup process is waiting for buffer pin,
it does nothing.

Good idea! It could still happen that if the startup process sets
startupBufferPinWaitBufId to -1 after sending the signal and before
the backend checks it, the backend will end up doing an unmeaningful
deadlock check. But the likelihood would be low in practice.

I have two small comments on ResolveRecoveryConflictWithBufferPin() in
the v4 patch:

The code currently has three branches as follow:

if (ltime == 0)
{
enable a timeout for deadlock;
}
else if (GetCurrentTimestamp() >= ltime)
{
send recovery conflict signal;
}
else
{
enable two timeouts: ltime and deadlock
}

I think we can rearrange the code similar to the changes you made on
ResolveRecoveryConflictWithLock():

if (GetCurrentTimestamp() >= ltime && ltime != 0)
{
Resolve recovery conflict;
}
else
{
Enable one or two timeouts: ltime and deadlock
}

It's more consistent with ResolveRecoveryConflictWithLock(). And
currently the patch doesn't reset got_standby_deadlock_timeout in
(ltime == 0) case but it will also be resolved by this rearrangement.

I didn't want to change the code structure as much as possible because
the patch needs to be back-patched. But it's good idea to make the code
structures in ResolveRecoveryConflictWithLock() and ...WithBufferPin() similar,
to make the code simpler and easier-to-read. So I agree with you. Attached
is the updated of the patch. What about this version?

Thank you for updating the patch! The patch looks good to me.

---
If we always reset got_standby_deadlock_timeout before waiting it's
not necessary but we might want to clear got_standby_deadlock_timeout
also after disabling all timeouts to ensure that it's cleared at the
end of the function. In ResolveRecoveryConflictWithLock() we clear
both got_standby_lock_timeout and got_standby_deadlock_timeout after
disabling all timeouts but we don't do that in
ResolveRecoveryConflictWithBufferPin().

I agree that it's better to reset got_standby_deadlock_timeout after
all the timeouts are disabled. So I changed the patch that way. OTOH
got_standby_lock_timeout doesn't need to be reset because it's never
enabled in ResolveRecoveryConflictWithBufferPin(). No?

Yes, you're right. We need to clear only got_standby_deadlock_timeout.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#12)
Re: Deadlock between backend and recovery may not be detected

At Wed, 23 Dec 2020 21:42:47 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

you. Attached
is the updated of the patch. What about this version?

The patch contains a hunk in the following structure.

+	if (got_standby_lock_timeout)
+		goto cleanup;
+
+	if (got_standby_deadlock_timeout)
+	{
...
+	}
+
+cleanup:

It is eqivalent to

+	if (!got_standby_lock_timeout && got_standby_deadlock_timeout)
+	{
...
+	}

Is there any reason for the goto?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#14)
Re: Deadlock between backend and recovery may not be detected

On 2020/12/25 13:16, Kyotaro Horiguchi wrote:

At Wed, 23 Dec 2020 21:42:47 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

you. Attached
is the updated of the patch. What about this version?

The patch contains a hunk in the following structure.

+	if (got_standby_lock_timeout)
+		goto cleanup;
+
+	if (got_standby_deadlock_timeout)
+	{
...
+	}
+
+cleanup:

It is eqivalent to

+	if (!got_standby_lock_timeout && got_standby_deadlock_timeout)
+	{
...
+	}

Is there any reason for the goto?

Yes. That's because we have the following code using goto.

+               /* Quick exit if there's no work to be done */
+               if (!VirtualTransactionIdIsValid(*backends))
+                       goto cleanup;

Regarding the back-patch, I was thinking to back-patch this to all the
supported branches. But I found that it's not easy to do that to v9.5
because v9.5 doesn't have some infrastructure code that this bug fix
patch depends on. So at least the commit 37c54863cf as the infrastructure
also needs to be back-patched to v9.5. And ISTM that some related commits
f868a8143a and 8f0de712c3 need to be back-patched. Probably there might
be some other changes to be back-patched. Unfortunately they cannot be
applied to v9.5 cleanly and additional changes are necessary.

This situation makes me feel that I'm inclined to skip the back-patch to v9.5.
Because the next minor version release is the final one for v9.5. So if we
unexpectedly introduce the bug to v9.5 by the back-patch, there is no
chance to fix that. OTOH, of course, if we don't do the back-patch, there is
no chance to fix the deadlock detection bug since the final minor version
for v9.5 doesn't include that bug fix. But I'm afraid that the back-patch
to v9.5 may give more risk than gain.

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#16Drouvot, Bertrand
bdrouvot@amazon.com
In reply to: Fujii Masao (#15)
Re: Deadlock between backend and recovery may not be detected

Hi,

On 1/5/21 7:26 AM, Fujii Masao wrote:

CAUTION: This email originated from outside of the organization. Do
not click links or open attachments unless you can confirm the sender
and know the content is safe.

On 2020/12/25 13:16, Kyotaro Horiguchi wrote:

At Wed, 23 Dec 2020 21:42:47 +0900, Fujii Masao
<masao.fujii@oss.nttdata.com> wrote in

you. Attached
is the updated of the patch. What about this version?

The patch contains a hunk in the following structure.

+     if (got_standby_lock_timeout)
+             goto cleanup;
+
+     if (got_standby_deadlock_timeout)
+     {
...
+     }
+
+cleanup:

It is eqivalent to

+     if (!got_standby_lock_timeout && got_standby_deadlock_timeout)
+     {
...
+     }

Is there any reason for the goto?

Yes. That's because we have the following code using goto.

+               /* Quick exit if there's no work to be done */
+               if (!VirtualTransactionIdIsValid(*backends))
+                       goto cleanup;

Regarding the back-patch, I was thinking to back-patch this to all the
supported branches. But I found that it's not easy to do that to v9.5
because v9.5 doesn't have some infrastructure code that this bug fix
patch depends on. So at least the commit 37c54863cf as the infrastructure
also needs to be back-patched to v9.5. And ISTM that some related commits
f868a8143a and 8f0de712c3 need to be back-patched. Probably there might
be some other changes to be back-patched. Unfortunately they cannot be
applied to v9.5 cleanly and additional changes are necessary.

This situation makes me feel that I'm inclined to skip the back-patch
to v9.5.
Because the next minor version release is the final one for v9.5. So
if we
unexpectedly introduce the bug to v9.5 by the back-patch, there is no
chance to fix that. OTOH, of course, if we don't do the back-patch,
there is
no chance to fix the deadlock detection bug since the final minor version
for v9.5 doesn't include that bug fix. But I'm afraid that the back-patch
to v9.5 may give more risk than gain.

Thought?

Reading your arguments, I am inclined to think the same.

Bertrand

#17Victor Yegorov
vyegorov@gmail.com
In reply to: Fujii Masao (#15)
Re: Deadlock between backend and recovery may not be detected

вт, 5 янв. 2021 г. в 07:26, Fujii Masao <masao.fujii@oss.nttdata.com>:

This situation makes me feel that I'm inclined to skip the back-patch to
v9.5.
Because the next minor version release is the final one for v9.5. So if we
unexpectedly introduce the bug to v9.5 by the back-patch, there is no
chance to fix that. OTOH, of course, if we don't do the back-patch, there
is
no chance to fix the deadlock detection bug since the final minor version
for v9.5 doesn't include that bug fix. But I'm afraid that the back-patch
to v9.5 may give more risk than gain.

Thought?

Honestly, I was thinking that this will not be backpatched at all
and really am glad we're getting this fixed in the back branches as well.

Therefore I think it's fine to skip 9.5, though I
would've mentioned this in the commit message.

--
Victor Yegorov

#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#15)
Re: Deadlock between backend and recovery may not be detected

At Tue, 5 Jan 2021 15:26:50 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/12/25 13:16, Kyotaro Horiguchi wrote:

At Wed, 23 Dec 2020 21:42:47 +0900, Fujii Masao
<masao.fujii@oss.nttdata.com> wrote in

you. Attached
is the updated of the patch. What about this version?

The patch contains a hunk in the following structure.
+	if (got_standby_lock_timeout)
+		goto cleanup;
+
+	if (got_standby_deadlock_timeout)
+	{
...
+	}
+
+cleanup:
It is eqivalent to
+	if (!got_standby_lock_timeout && got_standby_deadlock_timeout)
+	{
...
+	}
Is there any reason for the goto?

Yes. That's because we have the following code using goto.

+               /* Quick exit if there's no work to be done */
+               if (!VirtualTransactionIdIsValid(*backends))
+                       goto cleanup;

It doesn't seem to be the *cause*. Such straight-forward logic with
three-depth indentation is not a thing that should be avoided using
goto even if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK is too lengty
and sticks out of 80 coloumns.

Regarding the back-patch, I was thinking to back-patch this to all the
supported branches. But I found that it's not easy to do that to v9.5
because v9.5 doesn't have some infrastructure code that this bug fix
patch depends on. So at least the commit 37c54863cf as the
infrastructure
also needs to be back-patched to v9.5. And ISTM that some related
commits
f868a8143a and 8f0de712c3 need to be back-patched. Probably there
might
be some other changes to be back-patched. Unfortunately they cannot be
applied to v9.5 cleanly and additional changes are necessary.

This situation makes me feel that I'm inclined to skip the back-patch
to v9.5.
Because the next minor version release is the final one for v9.5. So
if we
unexpectedly introduce the bug to v9.5 by the back-patch, there is no
chance to fix that. OTOH, of course, if we don't do the back-patch,
there is
no chance to fix the deadlock detection bug since the final minor
version
for v9.5 doesn't include that bug fix. But I'm afraid that the
back-patch
to v9.5 may give more risk than gain.

Thought?

It seems to me that the final minor release should get fixes only for
issues that we have actually gotten complaints on, or critical-ish
known issues such as ones lead to server crash in normal paths. This
particular issue is neither of them.

So +1 for not back-patching to 9.5.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#15)
Re: Deadlock between backend and recovery may not be detected

On Tue, Jan 5, 2021 at 3:26 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/25 13:16, Kyotaro Horiguchi wrote:

At Wed, 23 Dec 2020 21:42:47 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

you. Attached
is the updated of the patch. What about this version?

The patch contains a hunk in the following structure.

+     if (got_standby_lock_timeout)
+             goto cleanup;
+
+     if (got_standby_deadlock_timeout)
+     {
...
+     }
+
+cleanup:

It is eqivalent to

+     if (!got_standby_lock_timeout && got_standby_deadlock_timeout)
+     {
...
+     }

Is there any reason for the goto?

Yes. That's because we have the following code using goto.

+               /* Quick exit if there's no work to be done */
+               if (!VirtualTransactionIdIsValid(*backends))
+                       goto cleanup;

Regarding the back-patch, I was thinking to back-patch this to all the
supported branches. But I found that it's not easy to do that to v9.5
because v9.5 doesn't have some infrastructure code that this bug fix
patch depends on. So at least the commit 37c54863cf as the infrastructure
also needs to be back-patched to v9.5. And ISTM that some related commits
f868a8143a and 8f0de712c3 need to be back-patched. Probably there might
be some other changes to be back-patched. Unfortunately they cannot be
applied to v9.5 cleanly and additional changes are necessary.

This situation makes me feel that I'm inclined to skip the back-patch to v9.5.
Because the next minor version release is the final one for v9.5. So if we
unexpectedly introduce the bug to v9.5 by the back-patch, there is no
chance to fix that. OTOH, of course, if we don't do the back-patch, there is
no chance to fix the deadlock detection bug since the final minor version
for v9.5 doesn't include that bug fix. But I'm afraid that the back-patch
to v9.5 may give more risk than gain.

Thought?

+1 for not-backpatching to 9.5.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#20Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Masahiko Sawada (#19)
Re: Deadlock between backend and recovery may not be detected

On 2021/01/06 11:48, Masahiko Sawada wrote:

On Tue, Jan 5, 2021 at 3:26 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/12/25 13:16, Kyotaro Horiguchi wrote:

At Wed, 23 Dec 2020 21:42:47 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

you. Attached
is the updated of the patch. What about this version?

The patch contains a hunk in the following structure.

+     if (got_standby_lock_timeout)
+             goto cleanup;
+
+     if (got_standby_deadlock_timeout)
+     {
...
+     }
+
+cleanup:

It is eqivalent to

+     if (!got_standby_lock_timeout && got_standby_deadlock_timeout)
+     {
...
+     }

Is there any reason for the goto?

Yes. That's because we have the following code using goto.

+               /* Quick exit if there's no work to be done */
+               if (!VirtualTransactionIdIsValid(*backends))
+                       goto cleanup;

Regarding the back-patch, I was thinking to back-patch this to all the
supported branches. But I found that it's not easy to do that to v9.5
because v9.5 doesn't have some infrastructure code that this bug fix
patch depends on. So at least the commit 37c54863cf as the infrastructure
also needs to be back-patched to v9.5. And ISTM that some related commits
f868a8143a and 8f0de712c3 need to be back-patched. Probably there might
be some other changes to be back-patched. Unfortunately they cannot be
applied to v9.5 cleanly and additional changes are necessary.

This situation makes me feel that I'm inclined to skip the back-patch to v9.5.
Because the next minor version release is the final one for v9.5. So if we
unexpectedly introduce the bug to v9.5 by the back-patch, there is no
chance to fix that. OTOH, of course, if we don't do the back-patch, there is
no chance to fix the deadlock detection bug since the final minor version
for v9.5 doesn't include that bug fix. But I'm afraid that the back-patch
to v9.5 may give more risk than gain.

Thought?

+1 for not-backpatching to 9.5.

Thanks all! I pushed the patch and back-patched to v9.6.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION