Latches vs lwlock contention

Started by Thomas Munroabout 3 years ago12 messages
#1Thomas Munro
thomas.munro@gmail.com
6 attachment(s)

Hi,

We usually want to release lwlocks, and definitely spinlocks, before
calling SetLatch(), to avoid putting a system call into the locked
region so that we minimise the time held. There are a few places
where we don't do that, possibly because it's not just a simple latch
to hold a pointer to but rather a set of them that needs to be
collected from some data structure and we don't have infrastructure to
help with that. There are also cases where we semi-reliably create
lock contention, because the backends that wake up immediately try to
acquire the very same lock.

One example is heavyweight lock wakeups. If you run BEGIN; LOCK TABLE
t; ... and then N other sessions wait in SELECT * FROM t;, and then
you run ... COMMIT;, you'll see the first session wake all the others
while it still holds the partition lock itself. They'll all wake up
and begin to re-acquire the same partition lock in exclusive mode,
immediately go back to sleep on *that* wait list, and then wake each
other up one at a time in a chain. We could avoid the first
double-bounce by not setting the latches until after we've released
the partition lock. We could avoid the rest of them by not
re-acquiring the partition lock at all, which ... if I'm reading right
... shouldn't actually be necessary in modern PostgreSQL? Or if there
is another reason to re-acquire then maybe the comment should be
updated.

Presumably no one really does that repeatedly while there is a long
queue of non-conflicting waiters, so I'm not claiming it's a major
improvement, but it's at least a micro-optimisation.

There are some other simpler mechanical changes including synchronous
replication, SERIALIZABLE DEFERRABLE and condition variables (this one
inspired by Yura Sokolov's patches[1]/messages/by-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel@postgrespro.ru). Actually I'm not at all sure
about the CV implementation, I feel like a more ambitious change is
needed to make our CVs perform.

See attached sketch patches. I guess the main thing that may not be
good enough is the use of a fixed sized latch buffer. Memory
allocation in don't-throw-here environments like the guts of lock code
might be an issue, which is why it just gives up and flushes when
full; maybe it should try to allocate and fall back to flushing only
if that fails. These sketch patches aren't proposals, just
observations in need of more study.

[1]: /messages/by-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel@postgrespro.ru

Attachments:

0001-Provide-SetLatches-for-batched-deferred-latches.patchtext/x-patch; charset=US-ASCII; name=0001-Provide-SetLatches-for-batched-deferred-latches.patchDownload
From b64d5782e2c3a2e34274a3bf9df4449afaee94dc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 15:51:45 +1300
Subject: [PATCH 1/8] Provide SetLatches() for batched deferred latches.

If we have a way to buffer a set of wakeup targets and process them at a
later time, we can:

* move SetLatch() system calls out from under LWLocks, so that locks can
  be released faster; this is especially interesting in cases where the
  target backends will immediately try to acquire the same lock, or
  generally when the lock is heavily contended

* possibly gain some micro-opimization from issuing only two memory
  barriers for the whole batch of latches, not two for each latch to be
  set

* provide the opportunity for potential future latch implementation
  mechanisms to deliver wakeups in a single system call

Individual users of this facility will follow in separate patches.
---
 src/backend/storage/ipc/latch.c  | 187 ++++++++++++++++++-------------
 src/include/storage/latch.h      |  13 +++
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 123 insertions(+), 78 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index eb3a569aae..71fdc388c8 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -576,105 +576,136 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 }
 
 /*
- * Sets a latch and wakes up anyone waiting on it.
- *
- * This is cheap if the latch is already set, otherwise not so much.
- *
- * NB: when calling this in a signal handler, be sure to save and restore
- * errno around it.  (That's standard practice in most signal handlers, of
- * course, but we used to omit it in handlers that only set a flag.)
- *
- * NB: this function is called from critical sections and signal handlers so
- * throwing an error is not a good idea.
+ * Set multiple latches at the same time.
+ * Note: modifies input array.
  */
-void
-SetLatch(Latch *latch)
+static void
+SetLatchV(Latch **latches, int nlatches)
 {
-#ifndef WIN32
-	pid_t		owner_pid;
-#else
-	HANDLE		handle;
-#endif
-
-	/*
-	 * The memory barrier has to be placed here to ensure that any flag
-	 * variables possibly changed by this process have been flushed to main
-	 * memory, before we check/set is_set.
-	 */
+	/* Flush any other changes out to main memory just once. */
 	pg_memory_barrier();
 
-	/* Quick exit if already set */
-	if (latch->is_set)
-		return;
+	/* Keep only latches that are not already set, and set them. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch *latch = latches[i];
 
-	latch->is_set = true;
+		if (!latch->is_set)
+			latch->is_set = true;
+		else
+			latches[i] = NULL;
+	}
 
 	pg_memory_barrier();
-	if (!latch->maybe_sleeping)
-		return;
 
+	/* Wake the remaining latches that might be sleeping. */
 #ifndef WIN32
-
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler. We use the self-pipe or SIGURG to ourselves
-	 * to wake up WaitEventSetWaitBlock() without races in that case. If it's
-	 * another process, send a signal.
-	 *
-	 * Fetch owner_pid only once, in case the latch is concurrently getting
-	 * owned or disowned. XXX: This assumes that pid_t is atomic, which isn't
-	 * guaranteed to be true! In practice, the effective range of pid_t fits
-	 * in a 32 bit integer, and so should be atomic. In the worst case, we
-	 * might end up signaling the wrong process. Even then, you're very
-	 * unlucky if a process with that bogus pid exists and belongs to
-	 * Postgres; and PG database processes should handle excess SIGUSR1
-	 * interrupts without a problem anyhow.
-	 *
-	 * Another sort of race condition that's possible here is for a new
-	 * process to own the latch immediately after we look, so we don't signal
-	 * it. This is okay so long as all callers of ResetLatch/WaitLatch follow
-	 * the standard coding convention of waiting at the bottom of their loops,
-	 * not the top, so that they'll correctly process latch-setting events
-	 * that happen before they enter the loop.
-	 */
-	owner_pid = latch->owner_pid;
-	if (owner_pid == 0)
-		return;
-	else if (owner_pid == MyProcPid)
+	for (int i = 0; i < nlatches; ++i)
 	{
+		Latch *latch = latches[i];
+		pid_t		owner_pid;
+
+		if (!latch || !latch->maybe_sleeping)
+			continue;
+
+		/*
+		 * See if anyone's waiting for the latch. It can be the current process
+		 * if we're in a signal handler. We use the self-pipe or SIGURG to
+		 * ourselves to wake up WaitEventSetWaitBlock() without races in that
+		 * case. If it's another process, send a signal.
+		 *
+		 * Fetch owner_pid only once, in case the latch is concurrently getting
+		 * owned or disowned. XXX: This assumes that pid_t is atomic, which
+		 * isn't guaranteed to be true! In practice, the effective range of
+		 * pid_t fits in a 32 bit integer, and so should be atomic. In the
+		 * worst case, we might end up signaling the wrong process. Even then,
+		 * you're very unlucky if a process with that bogus pid exists and
+		 * belongs to Postgres; and PG database processes should handle excess
+		 * SIGURG interrupts without a problem anyhow.
+		 *
+		 * Another sort of race condition that's possible here is for a new
+		 * process to own the latch immediately after we look, so we don't
+		 * signal it. This is okay so long as all callers of
+		 * ResetLatch/WaitLatch follow the standard coding convention of
+		 * waiting at the bottom of their loops, not the top, so that they'll
+		 * correctly process latch-setting events that happen before they enter
+		 * the loop.
+		 */
+		owner_pid = latch->owner_pid;
+
+		if (owner_pid == MyProcPid)
+		{
+			if (waiting)
+			{
 #if defined(WAIT_USE_SELF_PIPE)
-		if (waiting)
-			sendSelfPipeByte();
+				sendSelfPipeByte();
 #else
-		if (waiting)
-			kill(MyProcPid, SIGURG);
+				kill(MyProcPid, SIGURG);
 #endif
+			}
+		}
+		else
+			kill(owner_pid, SIGURG);
 	}
-	else
-		kill(owner_pid, SIGURG);
-
 #else
-
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler.
-	 *
-	 * Use a local variable here just in case somebody changes the event field
-	 * concurrently (which really should not happen).
-	 */
-	handle = latch->event;
-	if (handle)
+	for (int i = 0; i < nlatches; ++i)
 	{
-		SetEvent(handle);
+		Latch *latch = latches[i];
 
-		/*
-		 * Note that we silently ignore any errors. We might be in a signal
-		 * handler or other critical path where it's not safe to call elog().
-		 */
+		if (latch && latch->maybe_sleeping)
+		{
+			HANDLE event = latch->event;
+
+			if (event)
+				SetEvent(event);
+		}
 	}
 #endif
 }
 
+/*
+ * Sets a latch and wakes up anyone waiting on it.
+ *
+ * This is cheap if the latch is already set, otherwise not so much.
+ *
+ * NB: when calling this in a signal handler, be sure to save and restore
+ * errno around it.  (That's standard practice in most signal handlers, of
+ * course, but we used to omit it in handlers that only set a flag.)
+ *
+ * NB: this function is called from critical sections and signal handlers so
+ * throwing an error is not a good idea.
+ */
+void
+SetLatch(Latch *latch)
+{
+	SetLatchV(&latch, 1);
+}
+
+/*
+ * Add a latch to a batch, to be set later as a group.
+ */
+void
+AddLatch(LatchBatch *batch, Latch *latch)
+{
+	if (batch->size == lengthof(batch->latches))
+		SetLatches(batch);
+
+	batch->latches[batch->size++] = latch;
+}
+
+/*
+ * Set all the latches accumulated in 'batch'.
+ */
+void
+SetLatches(LatchBatch *batch)
+{
+	if (batch->size > 0)
+	{
+		SetLatchV(batch->latches, batch->size);
+		batch->size = 0;
+	}
+}
+
 /*
  * Clear the latch. Calling WaitLatch after this will sleep, unless
  * the latch is set again before the WaitLatch call.
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 68ab740f16..0edf364637 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -118,6 +118,17 @@ typedef struct Latch
 #endif
 } Latch;
 
+#define LATCH_BATCH_SIZE 64
+
+/*
+ * Container for setting multiple latches at a time.
+ */
+typedef struct LatchBatch
+{
+	int			size;
+	Latch	   *latches[LATCH_BATCH_SIZE];
+}			LatchBatch;
+
 /*
  * Bitmasks for events that may wake-up WaitLatch(), WaitLatchOrSocket(), or
  * WaitEventSetWait().
@@ -163,6 +174,8 @@ extern void InitSharedLatch(Latch *latch);
 extern void OwnLatch(Latch *latch);
 extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
+extern void AddLatch(LatchBatch * batch, Latch *latch);
+extern void SetLatches(LatchBatch * batch);
 extern void ResetLatch(Latch *latch);
 extern void ShutdownLatchSupport(void);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f02cc8f42..f05c41043d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1389,6 +1389,7 @@ LagTracker
 LargeObjectDesc
 LastAttnumInfo
 Latch
+LatchBatch
 LerpFunc
 LexDescr
 LexemeEntry
-- 
2.35.1

0002-Use-SetLatches-for-condition-variables.patchtext/x-patch; charset=US-ASCII; name=0002-Use-SetLatches-for-condition-variables.patchDownload
From 7d68b9bad4e172fc5d56df8630af2c22986b4f81 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 17:36:34 +1300
Subject: [PATCH 2/8] Use SetLatches() for condition variables.

Drain condition variable wait lists in larger batches, not one at a
time, while broadcasting.  This is an idea from Yura Sokolov, which I've
now combined with SetLatches() for slightly more efficiency.

Since we're now performing loops, change the internal spinlock to an
lwlock.

XXX There is probably a better data structure/arrangement that would
allow us to hold the lock for a shorter time.  This patch is not very
satisfying yet.

Discussion: https://postgr.es/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru
---
 src/backend/storage/lmgr/condition_variable.c | 97 +++++++++----------
 src/backend/storage/lmgr/lwlock.c             |  2 +
 src/include/storage/condition_variable.h      |  4 +-
 src/include/storage/lwlock.h                  |  1 +
 4 files changed, 49 insertions(+), 55 deletions(-)

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index de65dac3ae..4b9749ecdd 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -36,7 +36,7 @@ static ConditionVariable *cv_sleep_target = NULL;
 void
 ConditionVariableInit(ConditionVariable *cv)
 {
-	SpinLockInit(&cv->mutex);
+	LWLockInitialize(&cv->mutex, LWTRANCHE_CONDITION_VARIABLE);
 	proclist_init(&cv->wakeup);
 }
 
@@ -74,9 +74,9 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	cv_sleep_target = cv;
 
 	/* Add myself to the wait queue. */
-	SpinLockAcquire(&cv->mutex);
+	LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
 	proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
-	SpinLockRelease(&cv->mutex);
+	LWLockRelease(&cv->mutex);
 }
 
 /*
@@ -180,13 +180,13 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		 * by something other than ConditionVariableSignal; though we don't
 		 * guarantee not to return spuriously, we'll avoid this obvious case.
 		 */
-		SpinLockAcquire(&cv->mutex);
+		LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
 		if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
 		{
 			done = true;
 			proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
 		}
-		SpinLockRelease(&cv->mutex);
+		LWLockRelease(&cv->mutex);
 
 		/*
 		 * Check for interrupts, and return spuriously if that caused the
@@ -233,12 +233,12 @@ ConditionVariableCancelSleep(void)
 	if (cv == NULL)
 		return;
 
-	SpinLockAcquire(&cv->mutex);
+	LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
 	if (proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
 		proclist_delete(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
 	else
 		signaled = true;
-	SpinLockRelease(&cv->mutex);
+	LWLockRelease(&cv->mutex);
 
 	/*
 	 * If we've received a signal, pass it on to another waiting process, if
@@ -265,10 +265,10 @@ ConditionVariableSignal(ConditionVariable *cv)
 	PGPROC	   *proc = NULL;
 
 	/* Remove the first process from the wakeup queue (if any). */
-	SpinLockAcquire(&cv->mutex);
+	LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
 	if (!proclist_is_empty(&cv->wakeup))
 		proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
-	SpinLockRelease(&cv->mutex);
+	LWLockRelease(&cv->mutex);
 
 	/* If we found someone sleeping, set their latch to wake them up. */
 	if (proc != NULL)
@@ -287,6 +287,7 @@ ConditionVariableBroadcast(ConditionVariable *cv)
 {
 	int			pgprocno = MyProc->pgprocno;
 	PGPROC	   *proc = NULL;
+	bool		inserted_sentinel = false;
 	bool		have_sentinel = false;
 
 	/*
@@ -313,52 +314,42 @@ ConditionVariableBroadcast(ConditionVariable *cv)
 	if (cv_sleep_target != NULL)
 		ConditionVariableCancelSleep();
 
-	/*
-	 * Inspect the state of the queue.  If it's empty, we have nothing to do.
-	 * If there's exactly one entry, we need only remove and signal that
-	 * entry.  Otherwise, remove the first entry and insert our sentinel.
-	 */
-	SpinLockAcquire(&cv->mutex);
-	/* While we're here, let's assert we're not in the list. */
-	Assert(!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink));
-
-	if (!proclist_is_empty(&cv->wakeup))
+	do
 	{
-		proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
-		if (!proclist_is_empty(&cv->wakeup))
-		{
-			proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
-			have_sentinel = true;
-		}
-	}
-	SpinLockRelease(&cv->mutex);
-
-	/* Awaken first waiter, if there was one. */
-	if (proc != NULL)
-		SetLatch(&proc->procLatch);
+		int max_batch_size = Min(LATCH_BATCH_SIZE, 8);	/* XXX */
+		LatchBatch batch = {0};
 
-	while (have_sentinel)
-	{
-		/*
-		 * Each time through the loop, remove the first wakeup list entry, and
-		 * signal it unless it's our sentinel.  Repeat as long as the sentinel
-		 * remains in the list.
-		 *
-		 * Notice that if someone else removes our sentinel, we will waken one
-		 * additional process before exiting.  That's intentional, because if
-		 * someone else signals the CV, they may be intending to waken some
-		 * third process that added itself to the list after we added the
-		 * sentinel.  Better to give a spurious wakeup (which should be
-		 * harmless beyond wasting some cycles) than to lose a wakeup.
-		 */
-		proc = NULL;
-		SpinLockAcquire(&cv->mutex);
-		if (!proclist_is_empty(&cv->wakeup))
+		LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
+		while (!proclist_is_empty(&cv->wakeup))
+		{
+			if (batch.size == max_batch_size)
+			{
+				if (!inserted_sentinel)
+				{
+					/* First batch of many.  Add sentinel. */
+					proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
+					inserted_sentinel = true;
+					have_sentinel = true;
+				}
+				break;
+			}
 			proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
-		have_sentinel = proclist_contains(&cv->wakeup, pgprocno, cvWaitLink);
-		SpinLockRelease(&cv->mutex);
+			if (proc == MyProc)
+			{
+				/* We hit our sentinel.  We're done. */
+				have_sentinel = false;
+				break;
+			}
+			else if (!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink))
+			{
+				/* Someone else hit our sentinel.  We're done. */
+				have_sentinel = false;
+			}
+			AddLatch(&batch, &proc->procLatch);
+		}
+		LWLockRelease(&cv->mutex);
 
-		if (proc != NULL && proc != MyProc)
-			SetLatch(&proc->procLatch);
-	}
+		/* Awaken this batch of waiters, if there were some. */
+		SetLatches(&batch);
+	} while (have_sentinel);
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d274c9b1dc..03186da0a3 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_CONDITION_VARIABLE: */
+	"ConditionVariable",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index e89175ebd5..82b8eeabf5 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -22,12 +22,12 @@
 #ifndef CONDITION_VARIABLE_H
 #define CONDITION_VARIABLE_H
 
+#include "storage/lwlock.h"
 #include "storage/proclist_types.h"
-#include "storage/spin.h"
 
 typedef struct
 {
-	slock_t		mutex;			/* spinlock protecting the wakeup list */
+	LWLock		mutex;			/* lock protecting the wakeup list */
 	proclist_head wakeup;		/* list of wake-able processes */
 } ConditionVariable;
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ca4eca76f4..dcbffe11a2 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -193,6 +193,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_CONDITION_VARIABLE,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
-- 
2.35.1

0003-Use-SetLatches-for-heavyweight-locks.patchtext/x-patch; charset=US-ASCII; name=0003-Use-SetLatches-for-heavyweight-locks.patchDownload
From 1dca369e595f1259736a015efcf13b1041d7780d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 16:43:31 +1300
Subject: [PATCH 3/8] Use SetLatches() for heavyweight locks.

Collect wakeups into a LatchBatch to be set at once after the
LockManager's internal partition lock is released.  This avoids holding
busy locks while issuing system calls.

Currently, waiters immediately try to acquire that lock themselves, so
deferring until it's released makes sense to avoid contention (a later
patch may remove that lock acquisition though).
---
 src/backend/storage/lmgr/deadlock.c |  4 ++--
 src/backend/storage/lmgr/lock.c     | 36 +++++++++++++++++++----------
 src/backend/storage/lmgr/proc.c     | 29 ++++++++++++++---------
 src/include/storage/lock.h          |  5 ++--
 src/include/storage/proc.h          |  5 ++--
 5 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index cd9c0418ec..1b7454a8fe 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -214,7 +214,7 @@ InitDeadLockChecking(void)
  * and (b) we are typically invoked inside a signal handler.
  */
 DeadLockState
-DeadLockCheck(PGPROC *proc)
+DeadLockCheck(PGPROC *proc, LatchBatch *wakeups)
 {
 	int			i,
 				j;
@@ -272,7 +272,7 @@ DeadLockCheck(PGPROC *proc)
 #endif
 
 		/* See if any waiters for the lock can be woken up now */
-		ProcLockWakeup(GetLocksMethodTable(lock), lock);
+		ProcLockWakeup(GetLocksMethodTable(lock), lock, wakeups);
 	}
 
 	/* Return code tells caller if we had to escape a deadlock or not */
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3d1049cf75..f59ae20069 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -374,14 +374,14 @@ static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
-static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
+static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchBatch *wakeups);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
 static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 						LockMethod lockMethodTable, uint32 hashcode,
-						bool wakeupNeeded);
+						bool wakeupNeeded, LatchBatch *wakeups);
 static void LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 								 LOCKTAG *locktag, LOCKMODE lockmode,
 								 bool decrement_strong_lock_count);
@@ -787,6 +787,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	LWLock	   *partitionLock;
 	bool		found_conflict;
 	bool		log_lock = false;
+	LatchBatch	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -1098,7 +1099,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 										 locktag->locktag_type,
 										 lockmode);
 
-		WaitOnLock(locallock, owner);
+		WaitOnLock(locallock, owner, &wakeups);
 
 		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
 										locktag->locktag_field2,
@@ -1138,6 +1139,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
 	 * standby server.
@@ -1634,7 +1637,7 @@ UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 static void
 CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 			LockMethod lockMethodTable, uint32 hashcode,
-			bool wakeupNeeded)
+			bool wakeupNeeded, LatchBatch *wakeups)
 {
 	/*
 	 * If this was my last hold on this lock, delete my entry in the proclock
@@ -1674,7 +1677,7 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 	else if (wakeupNeeded)
 	{
 		/* There are waiters on this lock, so wake them up. */
-		ProcLockWakeup(lockMethodTable, lock);
+		ProcLockWakeup(lockMethodTable, lock, wakeups);
 	}
 }
 
@@ -1811,7 +1814,7 @@ MarkLockClear(LOCALLOCK *locallock)
  * The appropriate partition lock must be held at entry.
  */
 static void
-WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
+WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchBatch *wakeups)
 {
 	LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallock);
 	LockMethod	lockMethodTable = LockMethods[lockmethodid];
@@ -1856,7 +1859,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 	 */
 	PG_TRY();
 	{
-		if (ProcSleep(locallock, lockMethodTable) != PROC_WAIT_STATUS_OK)
+		if (ProcSleep(locallock, lockMethodTable, wakeups) != PROC_WAIT_STATUS_OK)
 		{
 			/*
 			 * We failed as a result of a deadlock, see CheckDeadLock(). Quit
@@ -1915,7 +1918,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
  * NB: this does not clean up any locallock object that may exist for the lock.
  */
 void
-RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
+RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode, LatchBatch *wakeups)
 {
 	LOCK	   *waitLock = proc->waitLock;
 	PROCLOCK   *proclock = proc->waitProcLock;
@@ -1957,7 +1960,7 @@ RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
 	 */
 	CleanUpLock(waitLock, proclock,
 				LockMethods[lockmethodid], hashcode,
-				true);
+				true, wakeups);
 }
 
 /*
@@ -1982,6 +1985,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	PROCLOCK   *proclock;
 	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
+	LatchBatch	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2160,10 +2164,12 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 
 	CleanUpLock(lock, proclock,
 				lockMethodTable, locallock->hashcode,
-				wakeupNeeded);
+				wakeupNeeded, &wakeups);
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	RemoveLocalLock(locallock);
 	return true;
 }
@@ -2188,6 +2194,7 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	PROCLOCK   *proclock;
 	int			partition;
 	bool		have_fast_path_lwlock = false;
+	LatchBatch	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2434,10 +2441,12 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 			CleanUpLock(lock, proclock,
 						lockMethodTable,
 						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
+						wakeupNeeded, &wakeups);
 		}						/* loop over PROCLOCKs within this partition */
 
 		LWLockRelease(partitionLock);
+
+		SetLatches(&wakeups);
 	}							/* loop over partitions */
 
 #ifdef LOCK_DEBUG
@@ -3137,6 +3146,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	uint32		proclock_hashcode;
 	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
+	LatchBatch	wakeups = {0};
 
 	hashcode = LockTagHashCode(locktag);
 	partitionLock = LockHashPartitionLock(hashcode);
@@ -3190,10 +3200,12 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 
 	CleanUpLock(lock, proclock,
 				lockMethodTable, hashcode,
-				wakeupNeeded);
+				wakeupNeeded, &wakeups);
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	/*
 	 * Decrement strong lock count.  This logic is needed only for 2PC.
 	 */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 13fa07b0ff..f0e19b74a7 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -700,6 +700,7 @@ LockErrorCleanup(void)
 {
 	LWLock	   *partitionLock;
 	DisableTimeoutParams timeouts[2];
+	LatchBatch	wakeups = {0};
 
 	HOLD_INTERRUPTS();
 
@@ -733,7 +734,7 @@ LockErrorCleanup(void)
 	if (MyProc->links.next != NULL)
 	{
 		/* We could not have been granted the lock yet */
-		RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
+		RemoveFromWaitQueue(MyProc, lockAwaited->hashcode, &wakeups);
 	}
 	else
 	{
@@ -750,6 +751,7 @@ LockErrorCleanup(void)
 	lockAwaited = NULL;
 
 	LWLockRelease(partitionLock);
+	SetLatches(&wakeups);
 
 	RESUME_INTERRUPTS();
 }
@@ -1042,7 +1044,7 @@ ProcQueueInit(PROC_QUEUE *queue)
  * NOTES: The process queue is now a priority queue for locking.
  */
 ProcWaitStatus
-ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
+ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchBatch *wakeups)
 {
 	LOCKMODE	lockmode = locallock->tag.mode;
 	LOCK	   *lock = locallock->lock;
@@ -1188,7 +1190,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 */
 	if (early_deadlock)
 	{
-		RemoveFromWaitQueue(MyProc, hashcode);
+		RemoveFromWaitQueue(MyProc, hashcode, wakeups);
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
@@ -1204,6 +1206,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 * LockErrorCleanup will clean up if cancel/die happens.
 	 */
 	LWLockRelease(partitionLock);
+	SetLatches(wakeups);
 
 	/*
 	 * Also, now that we will successfully clean up after an ereport, it's
@@ -1662,8 +1665,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  * to twiddle the lock's request counts too --- see RemoveFromWaitQueue.
  * Hence, in practice the waitStatus parameter must be PROC_WAIT_STATUS_OK.
  */
-PGPROC *
-ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
+static PGPROC *
+ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus, LatchBatch *wakeups)
 {
 	PGPROC	   *retProc;
 
@@ -1686,8 +1689,8 @@ ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
 	proc->waitStatus = waitStatus;
 	pg_atomic_write_u64(&MyProc->waitStart, 0);
 
-	/* And awaken it */
-	SetLatch(&proc->procLatch);
+	/* Schedule it to be awoken */
+	AddLatch(wakeups, &proc->procLatch);
 
 	return retProc;
 }
@@ -1700,7 +1703,7 @@ ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
  * The appropriate lock partition lock must be held by caller.
  */
 void
-ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock)
+ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock, LatchBatch *wakeups)
 {
 	PROC_QUEUE *waitQueue = &(lock->waitProcs);
 	int			queue_size = waitQueue->size;
@@ -1728,7 +1731,7 @@ ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock)
 		{
 			/* OK to waken */
 			GrantLock(lock, proc->waitProcLock, lockmode);
-			proc = ProcWakeup(proc, PROC_WAIT_STATUS_OK);
+			proc = ProcWakeup(proc, PROC_WAIT_STATUS_OK, wakeups);
 
 			/*
 			 * ProcWakeup removes proc from the lock's waiting process queue
@@ -1762,6 +1765,7 @@ static void
 CheckDeadLock(void)
 {
 	int			i;
+	LatchBatch	wakeups = {0};
 
 	/*
 	 * Acquire exclusive lock on the entire shared lock data structures. Must
@@ -1796,7 +1800,7 @@ CheckDeadLock(void)
 #endif
 
 	/* Run the deadlock check, and set deadlock_state for use by ProcSleep */
-	deadlock_state = DeadLockCheck(MyProc);
+	deadlock_state = DeadLockCheck(MyProc, &wakeups);
 
 	if (deadlock_state == DS_HARD_DEADLOCK)
 	{
@@ -1813,7 +1817,8 @@ CheckDeadLock(void)
 		 * return from the signal handler.
 		 */
 		Assert(MyProc->waitLock != NULL);
-		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)));
+		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)),
+							&wakeups);
 
 		/*
 		 * We're done here.  Transaction abort caused by the error that
@@ -1837,6 +1842,8 @@ CheckDeadLock(void)
 check_done:
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
 		LWLockRelease(LockHashPartitionLockByIndex(i));
+
+	SetLatches(&wakeups);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index e4e1495b24..163bf41293 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -19,6 +19,7 @@
 #endif
 
 #include "storage/backendid.h"
+#include "storage/latch.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
@@ -575,7 +576,7 @@ extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCK *lock, PROCLOCK *proclock);
 extern void GrantLock(LOCK *lock, PROCLOCK *proclock, LOCKMODE lockmode);
 extern void GrantAwaitedLock(void);
-extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
+extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode, LatchBatch *wakeups);
 extern Size LockShmemSize(void);
 extern LockData *GetLockStatusData(void);
 extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
@@ -592,7 +593,7 @@ extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
-extern DeadLockState DeadLockCheck(PGPROC *proc);
+extern DeadLockState DeadLockCheck(PGPROC *proc, LatchBatch *wakeups);
 extern PGPROC *GetBlockingAutoVacuumPgproc(void);
 extern void DeadLockReport(void) pg_attribute_noreturn();
 extern void RememberSimpleDeadLock(PGPROC *proc1,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 8d096fdeeb..4dba46cfcb 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -449,9 +449,8 @@ extern bool HaveNFreeProcs(int n);
 extern void ProcReleaseLocks(bool isCommit);
 
 extern void ProcQueueInit(PROC_QUEUE *queue);
-extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable);
-extern PGPROC *ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus);
-extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock);
+extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchBatch *wakeups);
+extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock, LatchBatch *wakeups);
 extern void CheckDeadLockAlert(void);
 extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
-- 
2.35.1

0004-Don-t-re-acquire-LockManager-partition-lock-after-wa.patchtext/x-patch; charset=US-ASCII; name=0004-Don-t-re-acquire-LockManager-partition-lock-after-wa.patchDownload
From 2cef9eb9236a19845f967de3f392124016bbbba9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 10:56:39 +1300
Subject: [PATCH 4/8] Don't re-acquire LockManager partition lock after wakeup.

Change the contract of WaitOnLock() and ProcSleep() to return with the
partition lock released.  ProcSleep() re-acquired the lock to prevent
interrupts, which was a problem at the time the ancestor of that code
was committed in fe548629c50, because then we had signal handlers that
longjmp'd out of there.  Now, that can happen only at points where we
explicitly run CHECK_FOR_INTERRUPTS(), so we can remove the lock, which
leads to the idea of making the function always release.

While an earlier commit fixed the problem that the backend woke us up
before it had even released the lock itself, this lock reacquisition
point was still contended when multiple backends woke at the same time.

XXX Right?  Or what am I missing?
---
 src/backend/storage/lmgr/lock.c | 18 ++++++++++++------
 src/backend/storage/lmgr/proc.c | 20 ++++++++++++--------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index f59ae20069..42c244f5fb 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -787,6 +787,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	LWLock	   *partitionLock;
 	bool		found_conflict;
 	bool		log_lock = false;
+	bool		partitionLocked = false;
 	LatchBatch	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -995,6 +996,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	partitionLock = LockHashPartitionLock(hashcode);
 
 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+	partitionLocked = true;
 
 	/*
 	 * Find or create lock and proclock entries with this tag
@@ -1099,7 +1101,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
 										 locktag->locktag_type,
 										 lockmode);
 
+		Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
 		WaitOnLock(locallock, owner, &wakeups);
+		Assert(!LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+		partitionLocked = false;
 
 		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
 										locktag->locktag_field2,
@@ -1124,7 +1129,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 			PROCLOCK_PRINT("LockAcquire: INCONSISTENT", proclock);
 			LOCK_PRINT("LockAcquire: INCONSISTENT", lock, lockmode);
 			/* Should we retry ? */
-			LWLockRelease(partitionLock);
 			elog(ERROR, "LockAcquire failed");
 		}
 		PROCLOCK_PRINT("LockAcquire: granted", proclock);
@@ -1137,9 +1141,11 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	FinishStrongLockAcquire();
 
-	LWLockRelease(partitionLock);
-
-	SetLatches(&wakeups);
+	if (partitionLocked)
+	{
+		LWLockRelease(partitionLock);
+		SetLatches(&wakeups);
+	}
 
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
@@ -1811,7 +1817,8 @@ MarkLockClear(LOCALLOCK *locallock)
  * Caller must have set MyProc->heldLocks to reflect locks already held
  * on the lockable object by this process.
  *
- * The appropriate partition lock must be held at entry.
+ * The appropriate partition lock must be held at entry.  It is not held on
+ * exit.
  */
 static void
 WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchBatch *wakeups)
@@ -1868,7 +1875,6 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchBatch *wakeups)
 			awaitedLock = NULL;
 			LOCK_PRINT("WaitOnLock: aborting on lock",
 					   locallock->lock, locallock->tag.mode);
-			LWLockRelease(LockHashPartitionLock(locallock->hashcode));
 
 			/*
 			 * Now that we aren't holding the partition lock, we can give an
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f0e19b74a7..2a03cd4b1f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1033,7 +1033,7 @@ ProcQueueInit(PROC_QUEUE *queue)
  * Caller must have set MyProc->heldLocks to reflect locks already held
  * on the lockable object by this process (under all XIDs).
  *
- * The lock table's partition lock must be held at entry, and will be held
+ * The lock table's partition lock must be held at entry, and will be released
  * at exit.
  *
  * Result: PROC_WAIT_STATUS_OK if we acquired the lock, PROC_WAIT_STATUS_ERROR if not (deadlock).
@@ -1062,6 +1062,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchBatch *wakeups)
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 	int			i;
 
+	/*
+	 * Every way out of this function will release the partition lock and send
+	 * buffered latch wakeups.
+	 */
+	Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
@@ -1148,6 +1154,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchBatch *wakeups)
 					/* Skip the wait and just grant myself the lock. */
 					GrantLock(lock, proclock, lockmode);
 					GrantAwaitedLock();
+					LWLockRelease(partitionLock);
+					SetLatches(wakeups);
 					return PROC_WAIT_STATUS_OK;
 				}
 				/* Break out of loop to put myself before him */
@@ -1191,6 +1199,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchBatch *wakeups)
 	if (early_deadlock)
 	{
 		RemoveFromWaitQueue(MyProc, hashcode, wakeups);
+		LWLockRelease(partitionLock);
+		SetLatches(wakeups);
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
@@ -1525,6 +1535,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchBatch *wakeups)
 			}
 
 			LWLockRelease(partitionLock);
+			SetLatches(wakeups);
 
 			if (deadlock_state == DS_SOFT_DEADLOCK)
 				ereport(LOG,
@@ -1626,13 +1637,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchBatch *wakeups)
 							standbyWaitStart, GetCurrentTimestamp(),
 							NULL, false);
 
-	/*
-	 * Re-acquire the lock table's partition lock.  We have to do this to hold
-	 * off cancel/die interrupts before we can mess with lockAwaited (else we
-	 * might have a missed or duplicated locallock update).
-	 */
-	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
 	/*
 	 * We no longer want LockErrorCleanup to do anything.
 	 */
-- 
2.35.1

0005-Use-SetLatches-for-SERIALIZABLE-DEFERRABLE-wakeups.patchtext/x-patch; charset=US-ASCII; name=0005-Use-SetLatches-for-SERIALIZABLE-DEFERRABLE-wakeups.patchDownload
From e1ab9a2f56127c3731c0d5e0aa4fd0fc8243cdfa Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 12:40:12 +1300
Subject: [PATCH 5/8] Use SetLatches() for SERIALIZABLE DEFERRABLE wakeups.

Don't issue SetLatch()'s system call while holding a highly contended
LWLock.  Collect them in a LatchBatch to be set after the lock is
released.  Once woken, other backends will immediately try to acquire
that lock anyway so it's better to wake after releasing.

Take this opportunity to retire the confusingly named ProcSendSignal()
and replace it with something that gives the latch pointer we need.
There was only one other caller, in bufmgr.c, which is easily changed.
---
 src/backend/storage/buffer/bufmgr.c  |  2 +-
 src/backend/storage/ipc/procsignal.c |  2 --
 src/backend/storage/lmgr/predicate.c |  5 ++++-
 src/backend/storage/lmgr/proc.c      | 12 ------------
 src/include/storage/proc.h           |  4 +++-
 5 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..4d754caa30 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1918,7 +1918,7 @@ UnpinBuffer(BufferDesc *buf)
 
 				buf_state &= ~BM_PIN_COUNT_WAITER;
 				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
+				SetLatch(GetProcLatchByNumber(wait_backend_pgprocno));
 			}
 			else
 				UnlockBufHdr(buf, buf_state);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..280e4b91dd 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -254,8 +254,6 @@ CleanupProcSignalState(int status, Datum arg)
  *
  * On success (a signal was sent), zero is returned.
  * On error, -1 is returned, and errno is set (typically to ESRCH or EPERM).
- *
- * Not to be confused with ProcSendSignal
  */
 int
 SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e8120174d6..a9e9fa01e7 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3339,6 +3339,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 				nextConflict,
 				possibleUnsafeConflict;
 	SERIALIZABLEXACT *roXact;
+	LatchBatch	wakeups = {0};
 
 	/*
 	 * We can't trust XactReadOnly here, because a transaction which started
@@ -3672,7 +3673,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 			 */
 			if (SxactIsDeferrableWaiting(roXact) &&
 				(SxactIsROUnsafe(roXact) || SxactIsROSafe(roXact)))
-				ProcSendSignal(roXact->pgprocno);
+				AddLatch(&wakeups, GetProcLatchByNumber(roXact->pgprocno));
 
 			possibleUnsafeConflict = nextConflict;
 		}
@@ -3698,6 +3699,8 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 
 	LWLockRelease(SerializableXactHashLock);
 
+	SetLatches(&wakeups);
+
 	LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
 
 	/* Add this to the list of transactions to check for later cleanup. */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 2a03cd4b1f..c644e7351d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1890,18 +1890,6 @@ ProcWaitForSignal(uint32 wait_event_info)
 	CHECK_FOR_INTERRUPTS();
 }
 
-/*
- * ProcSendSignal - set the latch of a backend identified by pgprocno
- */
-void
-ProcSendSignal(int pgprocno)
-{
-	if (pgprocno < 0 || pgprocno >= ProcGlobal->allProcCount)
-		elog(ERROR, "pgprocno out of range");
-
-	SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
-}
-
 /*
  * BecomeLockGroupLeader - designate process as lock group leader
  *
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4dba46cfcb..735e863ed1 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -413,6 +413,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
 /* Accessor for PGPROC given a pgprocno. */
 #define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
 
+/* Accessor for procLatch given a pgprocno. */
+#define GetProcLatchByNumber(n) (&(GetPGProcByNumber(n))->procLatch)
+
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
@@ -456,7 +459,6 @@ extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
 
 extern void ProcWaitForSignal(uint32 wait_event_info);
-extern void ProcSendSignal(int pgprocno);
 
 extern PGPROC *AuxiliaryPidGetProc(int pid);
 
-- 
2.35.1

0006-Use-SetLatches-for-synchronous-replication-wakeups.patchtext/x-patch; charset=US-ASCII; name=0006-Use-SetLatches-for-synchronous-replication-wakeups.patchDownload
From 351ce81b4d1b2a61fcfcc9902d7122e5410c8b63 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 12:59:45 +1300
Subject: [PATCH 6/8] Use SetLatches() for synchronous replication wakeups.

Don't issue SetLatch() system calls while holding the central
synchronous replication lock.
---
 src/backend/replication/syncrep.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 1a022b11a0..31cf05ed2f 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -100,7 +100,7 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, LatchBatch *wakeups);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 								 XLogRecPtr *flushPtr,
@@ -449,6 +449,7 @@ SyncRepReleaseWaiters(void)
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			numapply = 0;
+	LatchBatch	wakeups = {0};
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
@@ -518,21 +519,23 @@ SyncRepReleaseWaiters(void)
 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
+		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, &wakeups);
 	}
 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, &wakeups);
 	}
 	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, &wakeups);
 	}
 
 	LWLockRelease(SyncRepLock);
 
+	SetLatches(&wakeups);
+
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
 		 numwrite, LSN_FORMAT_ARGS(writePtr),
 		 numflush, LSN_FORMAT_ARGS(flushPtr),
@@ -876,7 +879,7 @@ SyncRepGetStandbyPriority(void)
  * The caller must hold SyncRepLock in exclusive mode.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, LatchBatch *wakeups)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
@@ -929,7 +932,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Wake only when we have set state and removed from queue.
 		 */
-		SetLatch(&(thisproc->procLatch));
+		AddLatch(wakeups, &thisproc->procLatch);
 
 		numprocs++;
 	}
@@ -951,6 +954,8 @@ SyncRepUpdateSyncStandbysDefined(void)
 
 	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
 	{
+		LatchBatch		wakeups = {0};
+
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 		/*
@@ -963,7 +968,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, &wakeups);
 		}
 
 		/*
@@ -976,6 +981,8 @@ SyncRepUpdateSyncStandbysDefined(void)
 		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
 
 		LWLockRelease(SyncRepLock);
+
+		SetLatches(&wakeups);
 	}
 }
 
-- 
2.35.1

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#1)
7 attachment(s)
Re: Latches vs lwlock contention

On Fri, Oct 28, 2022 at 4:56 PM Thomas Munro <thomas.munro@gmail.com> wrote:

See attached sketch patches. I guess the main thing that may not be
good enough is the use of a fixed sized latch buffer. Memory
allocation in don't-throw-here environments like the guts of lock code
might be an issue, which is why it just gives up and flushes when
full; maybe it should try to allocate and fall back to flushing only
if that fails.

Here's an attempt at that. There aren't actually any cases of uses of
this stuff in critical sections here, so perhaps I shouldn't bother
with that part. The part I'd most like some feedback on is the
heavyweight lock bits. I'll add this to the commitfest.

Attachments:

v2-0001-Allow-palloc_extended-NO_OOM-in-critical-sections.patchapplication/octet-stream; name=v2-0001-Allow-palloc_extended-NO_OOM-in-critical-sections.patchDownload
From 383e2d9b0d24e02ca468b32f8bd3775bbecf1e86 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Nov 2022 23:29:38 +1300
Subject: [PATCH v2 1/9] Allow palloc_extended(NO_OOM) in critical sections.

Commit 4a170ee9e0e banned palloc() and similar in critical sections, because an
allocation failure would produce a panic.  Make an exception for allocation
with NULL on failure, for code that has a backup plan.

Discussion: https://postgr.es/m/CA%2BhUKGJHudZMG_vh6GiPB61pE%2BGgiBk5jxzd7inijqx5nEZLCw%40mail.gmail.com
---
 src/backend/utils/mmgr/mcxt.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index f526ca82c1..460fa9d6c0 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -1112,7 +1112,8 @@ MemoryContextAllocExtended(MemoryContext context, Size size, int flags)
 	void	   *ret;
 
 	Assert(MemoryContextIsValid(context));
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	if (!((flags & MCXT_ALLOC_HUGE) != 0 ? AllocHugeSizeIsValid(size) :
 		  AllocSizeIsValid(size)))
@@ -1267,7 +1268,8 @@ palloc_extended(Size size, int flags)
 	MemoryContext context = CurrentMemoryContext;
 
 	Assert(MemoryContextIsValid(context));
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	if (!((flags & MCXT_ALLOC_HUGE) != 0 ? AllocHugeSizeIsValid(size) :
 		  AllocSizeIsValid(size)))
@@ -1368,7 +1370,8 @@ repalloc_extended(void *pointer, Size size, int flags)
 		  AllocSizeIsValid(size)))
 		elog(ERROR, "invalid memory alloc request size %zu", size);
 
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	/* isReset must be false already */
 	Assert(!context->isReset);
-- 
2.37.0 (Apple Git-136)

v2-0002-Provide-SetLatches-for-batched-deferred-latches.patchapplication/octet-stream; name=v2-0002-Provide-SetLatches-for-batched-deferred-latches.patchDownload
From e30aab2276b05212dbfe23d2a28636036610645d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 15:51:45 +1300
Subject: [PATCH v2 2/9] Provide SetLatches() for batched deferred latches.

If we have a way to buffer a set of wakeup targets and process them at a
later time, we can:

* move SetLatch() system calls out from under LWLocks, so that locks can
  be released faster; this is especially interesting in cases where the
  target backends will immediately try to acquire the same lock, or
  generally when the lock is heavily contended

* possibly gain some micro-opimization from issuing only two memory
  barriers for the whole batch of latches, not two for each latch to be
  set

* prepare for future IPC mechanisms which might allow us to wake
  multiple backends in a single system call

Individual users of this facility will follow in separate patches.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c  | 223 +++++++++++++++++++++----------
 src/include/storage/latch.h      |  13 ++
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 163 insertions(+), 74 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index eb3a569aae..199ccc1e65 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -575,6 +575,94 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	return ret;
 }
 
+/*
+ * Set multiple latches at the same time.
+ * Note: modifies input array.
+ */
+static void
+SetLatchV(Latch **latches, int nlatches)
+{
+	/* Flush any other changes out to main memory just once. */
+	pg_memory_barrier();
+
+	/* Keep only latches that are not already set, and set them. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+
+		if (!latch->is_set)
+			latch->is_set = true;
+		else
+			latches[i] = NULL;
+	}
+
+	pg_memory_barrier();
+
+	/* Wake the remaining latches that might be sleeping. */
+#ifndef WIN32
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+		pid_t		owner_pid;
+
+		if (!latch || !latch->maybe_sleeping)
+			continue;
+
+		/*
+		 * See if anyone's waiting for the latch. It can be the current
+		 * process if we're in a signal handler. We use the self-pipe or
+		 * SIGURG to ourselves to wake up WaitEventSetWaitBlock() without
+		 * races in that case. If it's another process, send a signal.
+		 *
+		 * Fetch owner_pid only once, in case the latch is concurrently
+		 * getting owned or disowned. XXX: This assumes that pid_t is atomic,
+		 * which isn't guaranteed to be true! In practice, the effective range
+		 * of pid_t fits in a 32 bit integer, and so should be atomic. In the
+		 * worst case, we might end up signaling the wrong process. Even then,
+		 * you're very unlucky if a process with that bogus pid exists and
+		 * belongs to Postgres; and PG database processes should handle excess
+		 * SIGURG interrupts without a problem anyhow.
+		 *
+		 * Another sort of race condition that's possible here is for a new
+		 * process to own the latch immediately after we look, so we don't
+		 * signal it. This is okay so long as all callers of
+		 * ResetLatch/WaitLatch follow the standard coding convention of
+		 * waiting at the bottom of their loops, not the top, so that they'll
+		 * correctly process latch-setting events that happen before they
+		 * enter the loop.
+		 */
+		owner_pid = latch->owner_pid;
+
+		if (owner_pid == MyProcPid)
+		{
+			if (waiting)
+			{
+#if defined(WAIT_USE_SELF_PIPE)
+				sendSelfPipeByte();
+#else
+				kill(MyProcPid, SIGURG);
+#endif
+			}
+		}
+		else
+			kill(owner_pid, SIGURG);
+	}
+#else
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+
+		if (latch && latch->maybe_sleeping)
+		{
+			HANDLE		event = latch->event;
+
+			if (event)
+				SetEvent(event);
+		}
+	}
+#endif
+}
+
 /*
  * Sets a latch and wakes up anyone waiting on it.
  *
@@ -590,89 +678,76 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 void
 SetLatch(Latch *latch)
 {
-#ifndef WIN32
-	pid_t		owner_pid;
-#else
-	HANDLE		handle;
-#endif
-
-	/*
-	 * The memory barrier has to be placed here to ensure that any flag
-	 * variables possibly changed by this process have been flushed to main
-	 * memory, before we check/set is_set.
-	 */
-	pg_memory_barrier();
-
-	/* Quick exit if already set */
-	if (latch->is_set)
-		return;
-
-	latch->is_set = true;
-
-	pg_memory_barrier();
-	if (!latch->maybe_sleeping)
-		return;
+	SetLatchV(&latch, 1);
+}
 
-#ifndef WIN32
+/*
+ * Add a latch to a group, to be set later.
+ */
+void
+AddLatch(LatchGroup *group, Latch *latch)
+{
+	/* First time.  Set up the in-place buffer. */
+	if (!group->latches)
+	{
+		group->latches = group->in_place_buffer;
+		group->capacity = lengthof(group->in_place_buffer);
+		Assert(group->size == 0);
+	}
 
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler. We use the self-pipe or SIGURG to ourselves
-	 * to wake up WaitEventSetWaitBlock() without races in that case. If it's
-	 * another process, send a signal.
-	 *
-	 * Fetch owner_pid only once, in case the latch is concurrently getting
-	 * owned or disowned. XXX: This assumes that pid_t is atomic, which isn't
-	 * guaranteed to be true! In practice, the effective range of pid_t fits
-	 * in a 32 bit integer, and so should be atomic. In the worst case, we
-	 * might end up signaling the wrong process. Even then, you're very
-	 * unlucky if a process with that bogus pid exists and belongs to
-	 * Postgres; and PG database processes should handle excess SIGUSR1
-	 * interrupts without a problem anyhow.
-	 *
-	 * Another sort of race condition that's possible here is for a new
-	 * process to own the latch immediately after we look, so we don't signal
-	 * it. This is okay so long as all callers of ResetLatch/WaitLatch follow
-	 * the standard coding convention of waiting at the bottom of their loops,
-	 * not the top, so that they'll correctly process latch-setting events
-	 * that happen before they enter the loop.
-	 */
-	owner_pid = latch->owner_pid;
-	if (owner_pid == 0)
-		return;
-	else if (owner_pid == MyProcPid)
+	/* Are we full? */
+	if (group->size == group->capacity)
 	{
-#if defined(WAIT_USE_SELF_PIPE)
-		if (waiting)
-			sendSelfPipeByte();
-#else
-		if (waiting)
-			kill(MyProcPid, SIGURG);
-#endif
+		Latch	  **new_latches;
+		int			new_capacity;
+
+		/* Try to allocate more space. */
+		new_capacity = group->capacity * 2;
+		new_latches = palloc_extended(sizeof(latch) * new_capacity,
+									  MCXT_ALLOC_NO_OOM);
+		if (!new_latches)
+		{
+			/*
+			 * Allocation failed.  This is very unlikely to happen, but it
+			 * might be useful to be able to use this function in critical
+			 * sections, so we handle allocation failure by flushing instead
+			 * of throwing.
+			 */
+			SetLatches(group);
+		}
+		else
+		{
+			memcpy(new_latches, group->latches, sizeof(latch) * group->size);
+			if (group->latches != group->in_place_buffer)
+				pfree(group->latches);
+			group->latches = new_latches;
+			group->capacity = new_capacity;
+		}
 	}
-	else
-		kill(owner_pid, SIGURG);
 
-#else
+	Assert(group->size < group->capacity);
+	group->latches[group->size++] = latch;
+}
 
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler.
-	 *
-	 * Use a local variable here just in case somebody changes the event field
-	 * concurrently (which really should not happen).
-	 */
-	handle = latch->event;
-	if (handle)
+/*
+ * Set all the latches accumulated in 'group'.
+ */
+void
+SetLatches(LatchGroup *group)
+{
+	if (group->size > 0)
 	{
-		SetEvent(handle);
+		SetLatchV(group->latches, group->size);
+		group->size = 0;
 
-		/*
-		 * Note that we silently ignore any errors. We might be in a signal
-		 * handler or other critical path where it's not safe to call elog().
-		 */
+		/* If we allocated a large buffer, it's time to free it. */
+		if (group->latches != group->in_place_buffer)
+		{
+			pfree(group->latches);
+			group->latches = group->in_place_buffer;
+			group->capacity = lengthof(group->in_place_buffer);
+		}
 	}
-#endif
 }
 
 /*
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 68ab740f16..09c2b338c6 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -118,6 +118,17 @@ typedef struct Latch
 #endif
 } Latch;
 
+/*
+ * A buffer for setting multiple latches at a time.
+ */
+typedef struct LatchGroup
+{
+	int			size;
+	int			capacity;
+	Latch	  **latches;
+	Latch	   *in_place_buffer[64];
+} LatchGroup;
+
 /*
  * Bitmasks for events that may wake-up WaitLatch(), WaitLatchOrSocket(), or
  * WaitEventSetWait().
@@ -163,6 +174,8 @@ extern void InitSharedLatch(Latch *latch);
 extern void OwnLatch(Latch *latch);
 extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
+extern void AddLatch(LatchGroup *group, Latch *latch);
+extern void SetLatches(LatchGroup *group);
 extern void ResetLatch(Latch *latch);
 extern void ShutdownLatchSupport(void);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f02cc8f42..c86e49c758 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1389,6 +1389,7 @@ LagTracker
 LargeObjectDesc
 LastAttnumInfo
 Latch
+LatchGroup
 LerpFunc
 LexDescr
 LexemeEntry
-- 
2.37.0 (Apple Git-136)

v2-0003-Use-SetLatches-for-condition-variables.patchapplication/octet-stream; name=v2-0003-Use-SetLatches-for-condition-variables.patchDownload
From cc97393e11ed3027ae5b78623a2903e70134aade Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 17:36:34 +1300
Subject: [PATCH v2 3/9] Use SetLatches() for condition variables.

Drain condition variable wait lists in larger batches, not one at a
time, while broadcasting.  This is an idea from Yura Sokolov, which I've
now combined with SetLatches() for slightly more efficiency.

Since we're now performing loops, change the internal spinlock to an
lwlock.

XXX There is probably a better data structure/arrangement that would
allow us to hold the lock for a shorter time.  This patch is not very
satisfying yet.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
Discussion: https://postgr.es/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru
---
 src/backend/storage/lmgr/condition_variable.c | 96 +++++++++----------
 src/backend/storage/lmgr/lwlock.c             |  2 +
 src/include/storage/condition_variable.h      |  4 +-
 src/include/storage/lwlock.h                  |  1 +
 4 files changed, 49 insertions(+), 54 deletions(-)

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index de65dac3ae..b975a91172 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -36,7 +36,7 @@ static ConditionVariable *cv_sleep_target = NULL;
 void
 ConditionVariableInit(ConditionVariable *cv)
 {
-	SpinLockInit(&cv->mutex);
+	LWLockInitialize(&cv->mutex, LWTRANCHE_CONDITION_VARIABLE);
 	proclist_init(&cv->wakeup);
 }
 
@@ -74,9 +74,9 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	cv_sleep_target = cv;
 
 	/* Add myself to the wait queue. */
-	SpinLockAcquire(&cv->mutex);
+	LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
 	proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
-	SpinLockRelease(&cv->mutex);
+	LWLockRelease(&cv->mutex);
 }
 
 /*
@@ -180,13 +180,13 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		 * by something other than ConditionVariableSignal; though we don't
 		 * guarantee not to return spuriously, we'll avoid this obvious case.
 		 */
-		SpinLockAcquire(&cv->mutex);
+		LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
 		if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
 		{
 			done = true;
 			proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
 		}
-		SpinLockRelease(&cv->mutex);
+		LWLockRelease(&cv->mutex);
 
 		/*
 		 * Check for interrupts, and return spuriously if that caused the
@@ -233,12 +233,12 @@ ConditionVariableCancelSleep(void)
 	if (cv == NULL)
 		return;
 
-	SpinLockAcquire(&cv->mutex);
+	LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
 	if (proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
 		proclist_delete(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
 	else
 		signaled = true;
-	SpinLockRelease(&cv->mutex);
+	LWLockRelease(&cv->mutex);
 
 	/*
 	 * If we've received a signal, pass it on to another waiting process, if
@@ -265,10 +265,10 @@ ConditionVariableSignal(ConditionVariable *cv)
 	PGPROC	   *proc = NULL;
 
 	/* Remove the first process from the wakeup queue (if any). */
-	SpinLockAcquire(&cv->mutex);
+	LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
 	if (!proclist_is_empty(&cv->wakeup))
 		proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
-	SpinLockRelease(&cv->mutex);
+	LWLockRelease(&cv->mutex);
 
 	/* If we found someone sleeping, set their latch to wake them up. */
 	if (proc != NULL)
@@ -287,6 +287,7 @@ ConditionVariableBroadcast(ConditionVariable *cv)
 {
 	int			pgprocno = MyProc->pgprocno;
 	PGPROC	   *proc = NULL;
+	bool		inserted_sentinel = false;
 	bool		have_sentinel = false;
 
 	/*
@@ -313,52 +314,43 @@ ConditionVariableBroadcast(ConditionVariable *cv)
 	if (cv_sleep_target != NULL)
 		ConditionVariableCancelSleep();
 
-	/*
-	 * Inspect the state of the queue.  If it's empty, we have nothing to do.
-	 * If there's exactly one entry, we need only remove and signal that
-	 * entry.  Otherwise, remove the first entry and insert our sentinel.
-	 */
-	SpinLockAcquire(&cv->mutex);
-	/* While we're here, let's assert we're not in the list. */
-	Assert(!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink));
-
-	if (!proclist_is_empty(&cv->wakeup))
+	do
 	{
-		proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
-		if (!proclist_is_empty(&cv->wakeup))
-		{
-			proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
-			have_sentinel = true;
-		}
-	}
-	SpinLockRelease(&cv->mutex);
+		LatchGroup	wakeups = {0};
 
-	/* Awaken first waiter, if there was one. */
-	if (proc != NULL)
-		SetLatch(&proc->procLatch);
+		LWLockAcquire(&cv->mutex, LW_EXCLUSIVE);
+		while (!proclist_is_empty(&cv->wakeup))
+		{
+			/* Wake up in small sized batches.  XXX tune */
+			if (wakeups.size == 8)
+			{
+				if (!inserted_sentinel)
+				{
+					/* First batch of many to wake up.  Add sentinel. */
+					proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
+					inserted_sentinel = true;
+					have_sentinel = true;
+				}
+				break;
+			}
 
-	while (have_sentinel)
-	{
-		/*
-		 * Each time through the loop, remove the first wakeup list entry, and
-		 * signal it unless it's our sentinel.  Repeat as long as the sentinel
-		 * remains in the list.
-		 *
-		 * Notice that if someone else removes our sentinel, we will waken one
-		 * additional process before exiting.  That's intentional, because if
-		 * someone else signals the CV, they may be intending to waken some
-		 * third process that added itself to the list after we added the
-		 * sentinel.  Better to give a spurious wakeup (which should be
-		 * harmless beyond wasting some cycles) than to lose a wakeup.
-		 */
-		proc = NULL;
-		SpinLockAcquire(&cv->mutex);
-		if (!proclist_is_empty(&cv->wakeup))
 			proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
-		have_sentinel = proclist_contains(&cv->wakeup, pgprocno, cvWaitLink);
-		SpinLockRelease(&cv->mutex);
+			if (proc == MyProc)
+			{
+				/* We hit our sentinel.  We're done. */
+				have_sentinel = false;
+				break;
+			}
+			else if (!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink))
+			{
+				/* Someone else hit our sentinel.  We're done. */
+				have_sentinel = false;
+			}
+			AddLatch(&wakeups, &proc->procLatch);
+		}
+		LWLockRelease(&cv->mutex);
 
-		if (proc != NULL && proc != MyProc)
-			SetLatch(&proc->procLatch);
-	}
+		/* Awaken this batch of waiters, if there were some. */
+		SetLatches(&wakeups);
+	} while (have_sentinel);
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0fc0cf6ebb..77986bac7e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_CONDITION_VARIABLE: */
+	"ConditionVariable",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index e89175ebd5..82b8eeabf5 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -22,12 +22,12 @@
 #ifndef CONDITION_VARIABLE_H
 #define CONDITION_VARIABLE_H
 
+#include "storage/lwlock.h"
 #include "storage/proclist_types.h"
-#include "storage/spin.h"
 
 typedef struct
 {
-	slock_t		mutex;			/* spinlock protecting the wakeup list */
+	LWLock		mutex;			/* lock protecting the wakeup list */
 	proclist_head wakeup;		/* list of wake-able processes */
 } ConditionVariable;
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ca4eca76f4..dcbffe11a2 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -193,6 +193,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_CONDITION_VARIABLE,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
-- 
2.37.0 (Apple Git-136)

v2-0004-Use-SetLatches-for-heavyweight-locks.patchapplication/octet-stream; name=v2-0004-Use-SetLatches-for-heavyweight-locks.patchDownload
From fda454d2246ef050465f48ba18c0144c25dd2834 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 16:43:31 +1300
Subject: [PATCH v2 4/9] Use SetLatches() for heavyweight locks.

Collect wakeups into a LatchGroup to be set at once after the
LockManager's internal partition lock is released.  This avoids holding
busy locks while issuing system calls.

Currently, waiters immediately try to acquire that lock themselves, so
deferring until it's released makes sense to avoid contention (a later
patch may remove that lock acquisition though).

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  4 ++--
 src/backend/storage/lmgr/lock.c     | 36 +++++++++++++++++++----------
 src/backend/storage/lmgr/proc.c     | 29 ++++++++++++++---------
 src/include/storage/lock.h          |  5 ++--
 src/include/storage/proc.h          |  5 ++--
 5 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index cd9c0418ec..2876be16f3 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -214,7 +214,7 @@ InitDeadLockChecking(void)
  * and (b) we are typically invoked inside a signal handler.
  */
 DeadLockState
-DeadLockCheck(PGPROC *proc)
+DeadLockCheck(PGPROC *proc, LatchGroup *wakeups)
 {
 	int			i,
 				j;
@@ -272,7 +272,7 @@ DeadLockCheck(PGPROC *proc)
 #endif
 
 		/* See if any waiters for the lock can be woken up now */
-		ProcLockWakeup(GetLocksMethodTable(lock), lock);
+		ProcLockWakeup(GetLocksMethodTable(lock), lock, wakeups);
 	}
 
 	/* Return code tells caller if we had to escape a deadlock or not */
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3d1049cf75..4ce7d53368 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -374,14 +374,14 @@ static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
-static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
+static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
 static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 						LockMethod lockMethodTable, uint32 hashcode,
-						bool wakeupNeeded);
+						bool wakeupNeeded, LatchGroup *wakeups);
 static void LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 								 LOCKTAG *locktag, LOCKMODE lockmode,
 								 bool decrement_strong_lock_count);
@@ -787,6 +787,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	LWLock	   *partitionLock;
 	bool		found_conflict;
 	bool		log_lock = false;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -1098,7 +1099,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 										 locktag->locktag_type,
 										 lockmode);
 
-		WaitOnLock(locallock, owner);
+		WaitOnLock(locallock, owner, &wakeups);
 
 		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
 										locktag->locktag_field2,
@@ -1138,6 +1139,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
 	 * standby server.
@@ -1634,7 +1637,7 @@ UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 static void
 CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 			LockMethod lockMethodTable, uint32 hashcode,
-			bool wakeupNeeded)
+			bool wakeupNeeded, LatchGroup *wakeups)
 {
 	/*
 	 * If this was my last hold on this lock, delete my entry in the proclock
@@ -1674,7 +1677,7 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 	else if (wakeupNeeded)
 	{
 		/* There are waiters on this lock, so wake them up. */
-		ProcLockWakeup(lockMethodTable, lock);
+		ProcLockWakeup(lockMethodTable, lock, wakeups);
 	}
 }
 
@@ -1811,7 +1814,7 @@ MarkLockClear(LOCALLOCK *locallock)
  * The appropriate partition lock must be held at entry.
  */
 static void
-WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
+WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
 {
 	LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallock);
 	LockMethod	lockMethodTable = LockMethods[lockmethodid];
@@ -1856,7 +1859,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 	 */
 	PG_TRY();
 	{
-		if (ProcSleep(locallock, lockMethodTable) != PROC_WAIT_STATUS_OK)
+		if (ProcSleep(locallock, lockMethodTable, wakeups) != PROC_WAIT_STATUS_OK)
 		{
 			/*
 			 * We failed as a result of a deadlock, see CheckDeadLock(). Quit
@@ -1915,7 +1918,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
  * NB: this does not clean up any locallock object that may exist for the lock.
  */
 void
-RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
+RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode, LatchGroup *wakeups)
 {
 	LOCK	   *waitLock = proc->waitLock;
 	PROCLOCK   *proclock = proc->waitProcLock;
@@ -1957,7 +1960,7 @@ RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
 	 */
 	CleanUpLock(waitLock, proclock,
 				LockMethods[lockmethodid], hashcode,
-				true);
+				true, wakeups);
 }
 
 /*
@@ -1982,6 +1985,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	PROCLOCK   *proclock;
 	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2160,10 +2164,12 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 
 	CleanUpLock(lock, proclock,
 				lockMethodTable, locallock->hashcode,
-				wakeupNeeded);
+				wakeupNeeded, &wakeups);
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	RemoveLocalLock(locallock);
 	return true;
 }
@@ -2188,6 +2194,7 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	PROCLOCK   *proclock;
 	int			partition;
 	bool		have_fast_path_lwlock = false;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2434,10 +2441,12 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 			CleanUpLock(lock, proclock,
 						lockMethodTable,
 						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
+						wakeupNeeded, &wakeups);
 		}						/* loop over PROCLOCKs within this partition */
 
 		LWLockRelease(partitionLock);
+
+		SetLatches(&wakeups);
 	}							/* loop over partitions */
 
 #ifdef LOCK_DEBUG
@@ -3137,6 +3146,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	uint32		proclock_hashcode;
 	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
+	LatchGroup	wakeups = {0};
 
 	hashcode = LockTagHashCode(locktag);
 	partitionLock = LockHashPartitionLock(hashcode);
@@ -3190,10 +3200,12 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 
 	CleanUpLock(lock, proclock,
 				lockMethodTable, hashcode,
-				wakeupNeeded);
+				wakeupNeeded, &wakeups);
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	/*
 	 * Decrement strong lock count.  This logic is needed only for 2PC.
 	 */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 13fa07b0ff..a593c11f45 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -700,6 +700,7 @@ LockErrorCleanup(void)
 {
 	LWLock	   *partitionLock;
 	DisableTimeoutParams timeouts[2];
+	LatchGroup	wakeups = {0};
 
 	HOLD_INTERRUPTS();
 
@@ -733,7 +734,7 @@ LockErrorCleanup(void)
 	if (MyProc->links.next != NULL)
 	{
 		/* We could not have been granted the lock yet */
-		RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
+		RemoveFromWaitQueue(MyProc, lockAwaited->hashcode, &wakeups);
 	}
 	else
 	{
@@ -750,6 +751,7 @@ LockErrorCleanup(void)
 	lockAwaited = NULL;
 
 	LWLockRelease(partitionLock);
+	SetLatches(&wakeups);
 
 	RESUME_INTERRUPTS();
 }
@@ -1042,7 +1044,7 @@ ProcQueueInit(PROC_QUEUE *queue)
  * NOTES: The process queue is now a priority queue for locking.
  */
 ProcWaitStatus
-ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
+ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 {
 	LOCKMODE	lockmode = locallock->tag.mode;
 	LOCK	   *lock = locallock->lock;
@@ -1188,7 +1190,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 */
 	if (early_deadlock)
 	{
-		RemoveFromWaitQueue(MyProc, hashcode);
+		RemoveFromWaitQueue(MyProc, hashcode, wakeups);
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
@@ -1204,6 +1206,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 * LockErrorCleanup will clean up if cancel/die happens.
 	 */
 	LWLockRelease(partitionLock);
+	SetLatches(wakeups);
 
 	/*
 	 * Also, now that we will successfully clean up after an ereport, it's
@@ -1662,8 +1665,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  * to twiddle the lock's request counts too --- see RemoveFromWaitQueue.
  * Hence, in practice the waitStatus parameter must be PROC_WAIT_STATUS_OK.
  */
-PGPROC *
-ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
+static PGPROC *
+ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus, LatchGroup *wakeups)
 {
 	PGPROC	   *retProc;
 
@@ -1686,8 +1689,8 @@ ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
 	proc->waitStatus = waitStatus;
 	pg_atomic_write_u64(&MyProc->waitStart, 0);
 
-	/* And awaken it */
-	SetLatch(&proc->procLatch);
+	/* Schedule it to be awoken */
+	AddLatch(wakeups, &proc->procLatch);
 
 	return retProc;
 }
@@ -1700,7 +1703,7 @@ ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
  * The appropriate lock partition lock must be held by caller.
  */
 void
-ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock)
+ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock, LatchGroup *wakeups)
 {
 	PROC_QUEUE *waitQueue = &(lock->waitProcs);
 	int			queue_size = waitQueue->size;
@@ -1728,7 +1731,7 @@ ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock)
 		{
 			/* OK to waken */
 			GrantLock(lock, proc->waitProcLock, lockmode);
-			proc = ProcWakeup(proc, PROC_WAIT_STATUS_OK);
+			proc = ProcWakeup(proc, PROC_WAIT_STATUS_OK, wakeups);
 
 			/*
 			 * ProcWakeup removes proc from the lock's waiting process queue
@@ -1762,6 +1765,7 @@ static void
 CheckDeadLock(void)
 {
 	int			i;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * Acquire exclusive lock on the entire shared lock data structures. Must
@@ -1796,7 +1800,7 @@ CheckDeadLock(void)
 #endif
 
 	/* Run the deadlock check, and set deadlock_state for use by ProcSleep */
-	deadlock_state = DeadLockCheck(MyProc);
+	deadlock_state = DeadLockCheck(MyProc, &wakeups);
 
 	if (deadlock_state == DS_HARD_DEADLOCK)
 	{
@@ -1813,7 +1817,8 @@ CheckDeadLock(void)
 		 * return from the signal handler.
 		 */
 		Assert(MyProc->waitLock != NULL);
-		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)));
+		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)),
+							&wakeups);
 
 		/*
 		 * We're done here.  Transaction abort caused by the error that
@@ -1837,6 +1842,8 @@ CheckDeadLock(void)
 check_done:
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
 		LWLockRelease(LockHashPartitionLockByIndex(i));
+
+	SetLatches(&wakeups);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index e4e1495b24..790c15c7ea 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -19,6 +19,7 @@
 #endif
 
 #include "storage/backendid.h"
+#include "storage/latch.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
@@ -575,7 +576,7 @@ extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCK *lock, PROCLOCK *proclock);
 extern void GrantLock(LOCK *lock, PROCLOCK *proclock, LOCKMODE lockmode);
 extern void GrantAwaitedLock(void);
-extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
+extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode, LatchGroup *wakeups);
 extern Size LockShmemSize(void);
 extern LockData *GetLockStatusData(void);
 extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
@@ -592,7 +593,7 @@ extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
-extern DeadLockState DeadLockCheck(PGPROC *proc);
+extern DeadLockState DeadLockCheck(PGPROC *proc, LatchGroup *wakeups);
 extern PGPROC *GetBlockingAutoVacuumPgproc(void);
 extern void DeadLockReport(void) pg_attribute_noreturn();
 extern void RememberSimpleDeadLock(PGPROC *proc1,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 8d096fdeeb..0ae47325e4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -449,9 +449,8 @@ extern bool HaveNFreeProcs(int n);
 extern void ProcReleaseLocks(bool isCommit);
 
 extern void ProcQueueInit(PROC_QUEUE *queue);
-extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable);
-extern PGPROC *ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus);
-extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock);
+extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups);
+extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock, LatchGroup *wakeups);
 extern void CheckDeadLockAlert(void);
 extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
-- 
2.37.0 (Apple Git-136)

v2-0005-Don-t-re-acquire-LockManager-partition-lock-after.patchapplication/octet-stream; name=v2-0005-Don-t-re-acquire-LockManager-partition-lock-after.patchDownload
From 162b7203e5f355649de2381ad6b3abc92acc3f3a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 10:56:39 +1300
Subject: [PATCH v2 5/9] Don't re-acquire LockManager partition lock after
 wakeup.

Change the contract of WaitOnLock() and ProcSleep() to return with the
partition lock released.  ProcSleep() re-acquired the lock to prevent
interrupts, which was a problem at the time the ancestor of that code
was committed in fe548629c50, because then we had signal handlers that
longjmp'd out of there.  Now, that can happen only at points where we
explicitly run CHECK_FOR_INTERRUPTS(), so we can remove the lock, which
leads to the idea of making the function always release.

While an earlier commit fixed the problem that the backend woke us up
before it had even released the lock itself, this lock reacquisition
point was still contended when multiple backends woke at the same time.

XXX Right?  Or what am I missing?

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 18 ++++++++++++------
 src/backend/storage/lmgr/proc.c | 20 ++++++++++++--------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4ce7d53368..cd3b84101c 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -787,6 +787,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	LWLock	   *partitionLock;
 	bool		found_conflict;
 	bool		log_lock = false;
+	bool		partitionLocked = false;
 	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -995,6 +996,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	partitionLock = LockHashPartitionLock(hashcode);
 
 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+	partitionLocked = true;
 
 	/*
 	 * Find or create lock and proclock entries with this tag
@@ -1099,7 +1101,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
 										 locktag->locktag_type,
 										 lockmode);
 
+		Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
 		WaitOnLock(locallock, owner, &wakeups);
+		Assert(!LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+		partitionLocked = false;
 
 		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
 										locktag->locktag_field2,
@@ -1124,7 +1129,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 			PROCLOCK_PRINT("LockAcquire: INCONSISTENT", proclock);
 			LOCK_PRINT("LockAcquire: INCONSISTENT", lock, lockmode);
 			/* Should we retry ? */
-			LWLockRelease(partitionLock);
 			elog(ERROR, "LockAcquire failed");
 		}
 		PROCLOCK_PRINT("LockAcquire: granted", proclock);
@@ -1137,9 +1141,11 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	FinishStrongLockAcquire();
 
-	LWLockRelease(partitionLock);
-
-	SetLatches(&wakeups);
+	if (partitionLocked)
+	{
+		LWLockRelease(partitionLock);
+		SetLatches(&wakeups);
+	}
 
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
@@ -1811,7 +1817,8 @@ MarkLockClear(LOCALLOCK *locallock)
  * Caller must have set MyProc->heldLocks to reflect locks already held
  * on the lockable object by this process.
  *
- * The appropriate partition lock must be held at entry.
+ * The appropriate partition lock must be held at entry.  It is not held on
+ * exit.
  */
 static void
 WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
@@ -1868,7 +1875,6 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
 			awaitedLock = NULL;
 			LOCK_PRINT("WaitOnLock: aborting on lock",
 					   locallock->lock, locallock->tag.mode);
-			LWLockRelease(LockHashPartitionLock(locallock->hashcode));
 
 			/*
 			 * Now that we aren't holding the partition lock, we can give an
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index a593c11f45..281260dab3 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1033,7 +1033,7 @@ ProcQueueInit(PROC_QUEUE *queue)
  * Caller must have set MyProc->heldLocks to reflect locks already held
  * on the lockable object by this process (under all XIDs).
  *
- * The lock table's partition lock must be held at entry, and will be held
+ * The lock table's partition lock must be held at entry, and will be released
  * at exit.
  *
  * Result: PROC_WAIT_STATUS_OK if we acquired the lock, PROC_WAIT_STATUS_ERROR if not (deadlock).
@@ -1062,6 +1062,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 	int			i;
 
+	/*
+	 * Every way out of this function will release the partition lock and send
+	 * buffered latch wakeups.
+	 */
+	Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
@@ -1148,6 +1154,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 					/* Skip the wait and just grant myself the lock. */
 					GrantLock(lock, proclock, lockmode);
 					GrantAwaitedLock();
+					LWLockRelease(partitionLock);
+					SetLatches(wakeups);
 					return PROC_WAIT_STATUS_OK;
 				}
 				/* Break out of loop to put myself before him */
@@ -1191,6 +1199,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 	if (early_deadlock)
 	{
 		RemoveFromWaitQueue(MyProc, hashcode, wakeups);
+		LWLockRelease(partitionLock);
+		SetLatches(wakeups);
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
@@ -1525,6 +1535,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 			}
 
 			LWLockRelease(partitionLock);
+			SetLatches(wakeups);
 
 			if (deadlock_state == DS_SOFT_DEADLOCK)
 				ereport(LOG,
@@ -1626,13 +1637,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 							standbyWaitStart, GetCurrentTimestamp(),
 							NULL, false);
 
-	/*
-	 * Re-acquire the lock table's partition lock.  We have to do this to hold
-	 * off cancel/die interrupts before we can mess with lockAwaited (else we
-	 * might have a missed or duplicated locallock update).
-	 */
-	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
 	/*
 	 * We no longer want LockErrorCleanup to do anything.
 	 */
-- 
2.37.0 (Apple Git-136)

v2-0006-Use-SetLatches-for-SERIALIZABLE-DEFERRABLE-wakeup.patchapplication/octet-stream; name=v2-0006-Use-SetLatches-for-SERIALIZABLE-DEFERRABLE-wakeup.patchDownload
From 0e7ac0c8ae5f7d151412fb61ba582394e60271cf Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 12:40:12 +1300
Subject: [PATCH v2 6/9] Use SetLatches() for SERIALIZABLE DEFERRABLE wakeups.

Don't issue SetLatch()'s system call while holding a highly contended
LWLock.  Collect them in a LatchBatch to be set after the lock is
released.  Once woken, other backends will immediately try to acquire
that lock anyway so it's better to wake after releasing.

Take this opportunity to retire the confusingly named ProcSendSignal()
and replace it with something that gives the latch pointer we need.
There was only one other caller, in bufmgr.c, which is easily changed.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c  |  2 +-
 src/backend/storage/ipc/procsignal.c |  2 --
 src/backend/storage/lmgr/predicate.c |  5 ++++-
 src/backend/storage/lmgr/proc.c      | 12 ------------
 src/include/storage/proc.h           |  4 +++-
 5 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..df15f5e6c2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1918,7 +1918,7 @@ UnpinBuffer(BufferDesc *buf)
 
 				buf_state &= ~BM_PIN_COUNT_WAITER;
 				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
+				SetLatch(GetProcLatchByNumber(wait_backend_pgprocno));
 			}
 			else
 				UnlockBufHdr(buf, buf_state);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7767657f27..280e4b91dd 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -254,8 +254,6 @@ CleanupProcSignalState(int status, Datum arg)
  *
  * On success (a signal was sent), zero is returned.
  * On error, -1 is returned, and errno is set (typically to ESRCH or EPERM).
- *
- * Not to be confused with ProcSendSignal
  */
 int
 SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e8120174d6..c62adea6b3 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3339,6 +3339,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 				nextConflict,
 				possibleUnsafeConflict;
 	SERIALIZABLEXACT *roXact;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * We can't trust XactReadOnly here, because a transaction which started
@@ -3672,7 +3673,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 			 */
 			if (SxactIsDeferrableWaiting(roXact) &&
 				(SxactIsROUnsafe(roXact) || SxactIsROSafe(roXact)))
-				ProcSendSignal(roXact->pgprocno);
+				AddLatch(&wakeups, GetProcLatchByNumber(roXact->pgprocno));
 
 			possibleUnsafeConflict = nextConflict;
 		}
@@ -3698,6 +3699,8 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 
 	LWLockRelease(SerializableXactHashLock);
 
+	SetLatches(&wakeups);
+
 	LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
 
 	/* Add this to the list of transactions to check for later cleanup. */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 281260dab3..48f962e6bd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1890,18 +1890,6 @@ ProcWaitForSignal(uint32 wait_event_info)
 	CHECK_FOR_INTERRUPTS();
 }
 
-/*
- * ProcSendSignal - set the latch of a backend identified by pgprocno
- */
-void
-ProcSendSignal(int pgprocno)
-{
-	if (pgprocno < 0 || pgprocno >= ProcGlobal->allProcCount)
-		elog(ERROR, "pgprocno out of range");
-
-	SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
-}
-
 /*
  * BecomeLockGroupLeader - designate process as lock group leader
  *
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 0ae47325e4..ab17cb66b8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -413,6 +413,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
 /* Accessor for PGPROC given a pgprocno. */
 #define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
 
+/* Accessor for procLatch given a pgprocno. */
+#define GetProcLatchByNumber(n) (&(GetPGProcByNumber(n))->procLatch)
+
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
@@ -456,7 +459,6 @@ extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
 
 extern void ProcWaitForSignal(uint32 wait_event_info);
-extern void ProcSendSignal(int pgprocno);
 
 extern PGPROC *AuxiliaryPidGetProc(int pid);
 
-- 
2.37.0 (Apple Git-136)

v2-0007-Use-SetLatches-for-synchronous-replication-wakeup.patchapplication/octet-stream; name=v2-0007-Use-SetLatches-for-synchronous-replication-wakeup.patchDownload
From e195322f6ba9f2bfbd830b6d13f0a5d8e6f0d67c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 12:59:45 +1300
Subject: [PATCH v2 7/9] Use SetLatches() for synchronous replication wakeups.

Don't issue SetLatch() system calls while holding SyncRepLock.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/replication/syncrep.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 1a022b11a0..e2410ff494 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -100,7 +100,7 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, LatchGroup *wakeups);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 								 XLogRecPtr *flushPtr,
@@ -449,6 +449,7 @@ SyncRepReleaseWaiters(void)
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			numapply = 0;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
@@ -518,21 +519,23 @@ SyncRepReleaseWaiters(void)
 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
+		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, &wakeups);
 	}
 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, &wakeups);
 	}
 	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, &wakeups);
 	}
 
 	LWLockRelease(SyncRepLock);
 
+	SetLatches(&wakeups);
+
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
 		 numwrite, LSN_FORMAT_ARGS(writePtr),
 		 numflush, LSN_FORMAT_ARGS(flushPtr),
@@ -876,7 +879,7 @@ SyncRepGetStandbyPriority(void)
  * The caller must hold SyncRepLock in exclusive mode.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, LatchGroup *wakeups)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
@@ -929,7 +932,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Wake only when we have set state and removed from queue.
 		 */
-		SetLatch(&(thisproc->procLatch));
+		AddLatch(wakeups, &thisproc->procLatch);
 
 		numprocs++;
 	}
@@ -951,6 +954,8 @@ SyncRepUpdateSyncStandbysDefined(void)
 
 	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
 	{
+		LatchGroup		wakeups = {0};
+
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 		/*
@@ -963,7 +968,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, &wakeups);
 		}
 
 		/*
@@ -976,6 +981,8 @@ SyncRepUpdateSyncStandbysDefined(void)
 		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
 
 		LWLockRelease(SyncRepLock);
+
+		SetLatches(&wakeups);
 	}
 }
 
-- 
2.37.0 (Apple Git-136)

#3vignesh C
vignesh21@gmail.com
In reply to: Thomas Munro (#2)
Re: Latches vs lwlock contention

On Tue, 1 Nov 2022 at 16:40, Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Oct 28, 2022 at 4:56 PM Thomas Munro <thomas.munro@gmail.com> wrote:

See attached sketch patches. I guess the main thing that may not be
good enough is the use of a fixed sized latch buffer. Memory
allocation in don't-throw-here environments like the guts of lock code
might be an issue, which is why it just gives up and flushes when
full; maybe it should try to allocate and fall back to flushing only
if that fails.

Here's an attempt at that. There aren't actually any cases of uses of
this stuff in critical sections here, so perhaps I shouldn't bother
with that part. The part I'd most like some feedback on is the
heavyweight lock bits. I'll add this to the commitfest.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_3998.log, please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
456fa635a909ee36f73ca84d340521bd730f265f ===
=== applying patch ./v2-0003-Use-SetLatches-for-condition-variables.patch
patching file src/backend/storage/lmgr/condition_variable.c
patching file src/backend/storage/lmgr/lwlock.c
Hunk #1 FAILED at 183.
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/storage/lmgr/lwlock.c.rej
patching file src/include/storage/condition_variable.h
patching file src/include/storage/lwlock.h
Hunk #1 FAILED at 193.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/storage/lwlock.h.rej

[1]: http://cfbot.cputube.org/patch_41_3998.log

Regards,
Vignesh

#4Thomas Munro
thomas.munro@gmail.com
In reply to: vignesh C (#3)
6 attachment(s)
Re: Latches vs lwlock contention

On Sat, Jan 28, 2023 at 3:39 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 1 Nov 2022 at 16:40, Thomas Munro <thomas.munro@gmail.com> wrote:

Here's an attempt at that. There aren't actually any cases of uses of
this stuff in critical sections here, so perhaps I shouldn't bother
with that part. The part I'd most like some feedback on is the
heavyweight lock bits. I'll add this to the commitfest.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

Rebased. I dropped the CV patch for now.

Attachments:

v3-0001-Allow-palloc_extended-NO_OOM-in-critical-sections.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Allow-palloc_extended-NO_OOM-in-critical-sections.patchDownload
From d678e740c61ed5408ad254d2ca5f8e2fff7d2cd1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Nov 2022 23:29:38 +1300
Subject: [PATCH v3 1/6] Allow palloc_extended(NO_OOM) in critical sections.

Commit 4a170ee9e0e banned palloc() and similar in critical sections, because an
allocation failure would produce a panic.  Make an exception for allocation
with NULL on failure, for code that has a backup plan.

Discussion: https://postgr.es/m/CA%2BhUKGJHudZMG_vh6GiPB61pE%2BGgiBk5jxzd7inijqx5nEZLCw%40mail.gmail.com
---
 src/backend/utils/mmgr/mcxt.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 0b00802df7..eb18b89368 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -1123,7 +1123,8 @@ MemoryContextAllocExtended(MemoryContext context, Size size, int flags)
 	void	   *ret;
 
 	Assert(MemoryContextIsValid(context));
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	if (!((flags & MCXT_ALLOC_HUGE) != 0 ? AllocHugeSizeIsValid(size) :
 		  AllocSizeIsValid(size)))
@@ -1278,7 +1279,8 @@ palloc_extended(Size size, int flags)
 	MemoryContext context = CurrentMemoryContext;
 
 	Assert(MemoryContextIsValid(context));
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	if (!((flags & MCXT_ALLOC_HUGE) != 0 ? AllocHugeSizeIsValid(size) :
 		  AllocSizeIsValid(size)))
@@ -1509,7 +1511,8 @@ repalloc_extended(void *pointer, Size size, int flags)
 		  AllocSizeIsValid(size)))
 		elog(ERROR, "invalid memory alloc request size %zu", size);
 
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	/* isReset must be false already */
 	Assert(!context->isReset);
-- 
2.39.2

v3-0002-Provide-SetLatches-for-batched-deferred-latches.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Provide-SetLatches-for-batched-deferred-latches.patchDownload
From a3917505d7bd7cec7e98010d05bc016ddcd0dbac Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 15:51:45 +1300
Subject: [PATCH v3 2/6] Provide SetLatches() for batched deferred latches.

If we have a way to buffer a set of wakeup targets and process them at a
later time, we can:

* move SetLatch() system calls out from under LWLocks, so that locks can
  be released faster; this is especially interesting in cases where the
  target backends will immediately try to acquire the same lock, or
  generally when the lock is heavily contended

* possibly gain some micro-opimization from issuing only two memory
  barriers for the whole batch of latches, not two for each latch to be
  set

* prepare for future IPC mechanisms which might allow us to wake
  multiple backends in a single system call

Individual users of this facility will follow in separate patches.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c  | 214 ++++++++++++++++++++-----------
 src/include/storage/latch.h      |  13 ++
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 154 insertions(+), 74 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f4123e7de7..96dd47575a 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -591,6 +591,85 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	return ret;
 }
 
+/*
+ * Set multiple latches at the same time.
+ * Note: modifies input array.
+ */
+static void
+SetLatchV(Latch **latches, int nlatches)
+{
+	/* Flush any other changes out to main memory just once. */
+	pg_memory_barrier();
+
+	/* Keep only latches that are not already set, and set them. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+
+		if (!latch->is_set)
+			latch->is_set = true;
+		else
+			latches[i] = NULL;
+	}
+
+	pg_memory_barrier();
+
+	/* Wake the remaining latches that might be sleeping. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+
+		/*
+		 * See if anyone's waiting for the latch. It can be the current
+		 * process if we're in a signal handler. We use the self-pipe or
+		 * SIGURG to ourselves to wake up WaitEventSetWaitBlock() without
+		 * races in that case. If it's another process, send a signal.
+		 *
+		 * Fetch owner_pid only once, in case the latch is concurrently
+		 * getting owned or disowned. XXX: This assumes that pid_t is atomic,
+		 * which isn't guaranteed to be true! In practice, the effective range
+		 * of pid_t fits in a 32 bit integer, and so should be atomic. In the
+		 * worst case, we might end up signaling the wrong process. Even then,
+		 * you're very unlucky if a process with that bogus pid exists and
+		 * belongs to Postgres; and PG database processes should handle excess
+		 * SIGURG interrupts without a problem anyhow.
+		 *
+		 * Another sort of race condition that's possible here is for a new
+		 * process to own the latch immediately after we look, so we don't
+		 * signal it. This is okay so long as all callers of
+		 * ResetLatch/WaitLatch follow the standard coding convention of
+		 * waiting at the bottom of their loops, not the top, so that they'll
+		 * correctly process latch-setting events that happen before they
+		 * enter the loop.
+		 */
+		if (latch && latch->maybe_sleeping)
+		{
+#ifndef WIN32
+			pid_t		owner_pid = latch->owner_pid;
+
+			if (owner_pid == MyProcPid)
+			{
+				if (waiting)
+				{
+#if defined(WAIT_USE_SELF_PIPE)
+					sendSelfPipeByte();
+#else
+					kill(MyProcPid, SIGURG);
+#endif
+				}
+			}
+			else
+				kill(owner_pid, SIGURG);
+#else
+			HANDLE		event = latch->event;
+
+			if (event)
+				SetEvent(event);
+#endif
+		}
+	}
+}
+
 /*
  * Sets a latch and wakes up anyone waiting on it.
  *
@@ -606,89 +685,76 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 void
 SetLatch(Latch *latch)
 {
-#ifndef WIN32
-	pid_t		owner_pid;
-#else
-	HANDLE		handle;
-#endif
-
-	/*
-	 * The memory barrier has to be placed here to ensure that any flag
-	 * variables possibly changed by this process have been flushed to main
-	 * memory, before we check/set is_set.
-	 */
-	pg_memory_barrier();
-
-	/* Quick exit if already set */
-	if (latch->is_set)
-		return;
-
-	latch->is_set = true;
-
-	pg_memory_barrier();
-	if (!latch->maybe_sleeping)
-		return;
+	SetLatchV(&latch, 1);
+}
 
-#ifndef WIN32
+/*
+ * Add a latch to a group, to be set later.
+ */
+void
+AddLatch(LatchGroup *group, Latch *latch)
+{
+	/* First time.  Set up the in-place buffer. */
+	if (!group->latches)
+	{
+		group->latches = group->in_place_buffer;
+		group->capacity = lengthof(group->in_place_buffer);
+		Assert(group->size == 0);
+	}
 
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler. We use the self-pipe or SIGURG to ourselves
-	 * to wake up WaitEventSetWaitBlock() without races in that case. If it's
-	 * another process, send a signal.
-	 *
-	 * Fetch owner_pid only once, in case the latch is concurrently getting
-	 * owned or disowned. XXX: This assumes that pid_t is atomic, which isn't
-	 * guaranteed to be true! In practice, the effective range of pid_t fits
-	 * in a 32 bit integer, and so should be atomic. In the worst case, we
-	 * might end up signaling the wrong process. Even then, you're very
-	 * unlucky if a process with that bogus pid exists and belongs to
-	 * Postgres; and PG database processes should handle excess SIGUSR1
-	 * interrupts without a problem anyhow.
-	 *
-	 * Another sort of race condition that's possible here is for a new
-	 * process to own the latch immediately after we look, so we don't signal
-	 * it. This is okay so long as all callers of ResetLatch/WaitLatch follow
-	 * the standard coding convention of waiting at the bottom of their loops,
-	 * not the top, so that they'll correctly process latch-setting events
-	 * that happen before they enter the loop.
-	 */
-	owner_pid = latch->owner_pid;
-	if (owner_pid == 0)
-		return;
-	else if (owner_pid == MyProcPid)
+	/* Are we full? */
+	if (group->size == group->capacity)
 	{
-#if defined(WAIT_USE_SELF_PIPE)
-		if (waiting)
-			sendSelfPipeByte();
-#else
-		if (waiting)
-			kill(MyProcPid, SIGURG);
-#endif
+		Latch	  **new_latches;
+		int			new_capacity;
+
+		/* Try to allocate more space. */
+		new_capacity = group->capacity * 2;
+		new_latches = palloc_extended(sizeof(latch) * new_capacity,
+									  MCXT_ALLOC_NO_OOM);
+		if (!new_latches)
+		{
+			/*
+			 * Allocation failed.  This is very unlikely to happen, but it
+			 * might be useful to be able to use this function in critical
+			 * sections, so we handle allocation failure by flushing instead
+			 * of throwing.
+			 */
+			SetLatches(group);
+		}
+		else
+		{
+			memcpy(new_latches, group->latches, sizeof(latch) * group->size);
+			if (group->latches != group->in_place_buffer)
+				pfree(group->latches);
+			group->latches = new_latches;
+			group->capacity = new_capacity;
+		}
 	}
-	else
-		kill(owner_pid, SIGURG);
 
-#else
+	Assert(group->size < group->capacity);
+	group->latches[group->size++] = latch;
+}
 
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler.
-	 *
-	 * Use a local variable here just in case somebody changes the event field
-	 * concurrently (which really should not happen).
-	 */
-	handle = latch->event;
-	if (handle)
+/*
+ * Set all the latches accumulated in 'group'.
+ */
+void
+SetLatches(LatchGroup *group)
+{
+	if (group->size > 0)
 	{
-		SetEvent(handle);
+		SetLatchV(group->latches, group->size);
+		group->size = 0;
 
-		/*
-		 * Note that we silently ignore any errors. We might be in a signal
-		 * handler or other critical path where it's not safe to call elog().
-		 */
+		/* If we allocated a large buffer, it's time to free it. */
+		if (group->latches != group->in_place_buffer)
+		{
+			pfree(group->latches);
+			group->latches = group->in_place_buffer;
+			group->capacity = lengthof(group->in_place_buffer);
+		}
 	}
-#endif
 }
 
 /*
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 99cc47874a..7b81806825 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -118,6 +118,17 @@ typedef struct Latch
 #endif
 } Latch;
 
+/*
+ * A buffer for setting multiple latches at a time.
+ */
+typedef struct LatchGroup
+{
+	int			size;
+	int			capacity;
+	Latch	  **latches;
+	Latch	   *in_place_buffer[64];
+} LatchGroup;
+
 /*
  * Bitmasks for events that may wake-up WaitLatch(), WaitLatchOrSocket(), or
  * WaitEventSetWait().
@@ -170,6 +181,8 @@ extern void InitSharedLatch(Latch *latch);
 extern void OwnLatch(Latch *latch);
 extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
+extern void AddLatch(LatchGroup *group, Latch *latch);
+extern void SetLatches(LatchGroup *group);
 extern void ResetLatch(Latch *latch);
 extern void ShutdownLatchSupport(void);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86a9303bf5..a5bf7d699d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1399,6 +1399,7 @@ LagTracker
 LargeObjectDesc
 LastAttnumInfo
 Latch
+LatchGroup
 LerpFunc
 LexDescr
 LexemeEntry
-- 
2.39.2

v3-0003-Use-SetLatches-for-synchronous-replication-wakeup.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Use-SetLatches-for-synchronous-replication-wakeup.patchDownload
From 631100e021206afe23a6801737fe57fae3a356e0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 12:59:45 +1300
Subject: [PATCH v3 3/6] Use SetLatches() for synchronous replication wakeups.

Don't issue SetLatch() system calls while holding SyncRepLock.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/replication/syncrep.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 889e20b5dd..446a9e1239 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -100,7 +100,7 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, LatchGroup *wakeups);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 								 XLogRecPtr *flushPtr,
@@ -440,6 +440,7 @@ SyncRepReleaseWaiters(void)
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			numapply = 0;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
@@ -509,21 +510,23 @@ SyncRepReleaseWaiters(void)
 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
+		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, &wakeups);
 	}
 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, &wakeups);
 	}
 	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, &wakeups);
 	}
 
 	LWLockRelease(SyncRepLock);
 
+	SetLatches(&wakeups);
+
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
 		 numwrite, LSN_FORMAT_ARGS(writePtr),
 		 numflush, LSN_FORMAT_ARGS(flushPtr),
@@ -867,7 +870,7 @@ SyncRepGetStandbyPriority(void)
  * The caller must hold SyncRepLock in exclusive mode.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, LatchGroup *wakeups)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	int			numprocs = 0;
@@ -908,7 +911,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Wake only when we have set state and removed from queue.
 		 */
-		SetLatch(&(proc->procLatch));
+		AddLatch(wakeups, &proc->procLatch);
 
 		numprocs++;
 	}
@@ -930,6 +933,8 @@ SyncRepUpdateSyncStandbysDefined(void)
 
 	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
 	{
+		LatchGroup		wakeups = {0};
+
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 		/*
@@ -942,7 +947,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, &wakeups);
 		}
 
 		/*
@@ -955,6 +960,8 @@ SyncRepUpdateSyncStandbysDefined(void)
 		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
 
 		LWLockRelease(SyncRepLock);
+
+		SetLatches(&wakeups);
 	}
 }
 
-- 
2.39.2

v3-0004-Use-SetLatches-for-SERIALIZABLE-DEFERRABLE-wakeup.patchtext/x-patch; charset=US-ASCII; name=v3-0004-Use-SetLatches-for-SERIALIZABLE-DEFERRABLE-wakeup.patchDownload
From 0c31264ae68785095d7678e5ee60f0378dae5b93 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 12:40:12 +1300
Subject: [PATCH v3 4/6] Use SetLatches() for SERIALIZABLE DEFERRABLE wakeups.

Don't issue SetLatch()'s system call while holding a highly contended
LWLock.  Collect them in a LatchGroup to be set after the lock is
released.  Once woken, other backends will immediately try to acquire
that lock anyway so it's better to wake after releasing.

Take this opportunity to retire the confusingly named ProcSendSignal()
and replace it with something that gives the latch pointer we need.
There was only one other caller, in bufmgr.c, which is easily changed.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c  |  2 +-
 src/backend/storage/ipc/procsignal.c |  2 --
 src/backend/storage/lmgr/predicate.c |  5 ++++-
 src/backend/storage/lmgr/proc.c      | 12 ------------
 src/include/storage/proc.h           |  4 +++-
 5 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..f297958f1b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1966,7 +1966,7 @@ UnpinBuffer(BufferDesc *buf)
 
 				buf_state &= ~BM_PIN_COUNT_WAITER;
 				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
+				SetLatch(GetProcLatchByNumber(wait_backend_pgprocno));
 			}
 			else
 				UnlockBufHdr(buf, buf_state);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 395b2cf690..13ec2317fc 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -255,8 +255,6 @@ CleanupProcSignalState(int status, Datum arg)
  *
  * On success (a signal was sent), zero is returned.
  * On error, -1 is returned, and errno is set (typically to ESRCH or EPERM).
- *
- * Not to be confused with ProcSendSignal
  */
 int
 SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 0f7f7ba5f1..86b046999b 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3235,6 +3235,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 	bool		needToClear;
 	SERIALIZABLEXACT *roXact;
 	dlist_mutable_iter iter;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * We can't trust XactReadOnly here, because a transaction which started
@@ -3538,7 +3539,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 			 */
 			if (SxactIsDeferrableWaiting(roXact) &&
 				(SxactIsROUnsafe(roXact) || SxactIsROSafe(roXact)))
-				ProcSendSignal(roXact->pgprocno);
+				AddLatch(&wakeups, GetProcLatchByNumber(roXact->pgprocno));
 		}
 	}
 
@@ -3562,6 +3563,8 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 
 	LWLockRelease(SerializableXactHashLock);
 
+	SetLatches(&wakeups);
+
 	LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
 
 	/* Add this to the list of transactions to check for later cleanup. */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..09eda65fd1 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1802,18 +1802,6 @@ ProcWaitForSignal(uint32 wait_event_info)
 	CHECK_FOR_INTERRUPTS();
 }
 
-/*
- * ProcSendSignal - set the latch of a backend identified by pgprocno
- */
-void
-ProcSendSignal(int pgprocno)
-{
-	if (pgprocno < 0 || pgprocno >= ProcGlobal->allProcCount)
-		elog(ERROR, "pgprocno out of range");
-
-	SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
-}
-
 /*
  * BecomeLockGroupLeader - designate process as lock group leader
  *
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4258cd92c9..ca39ae9486 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -413,6 +413,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
 /* Accessor for PGPROC given a pgprocno. */
 #define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
 
+/* Accessor for procLatch given a pgprocno. */
+#define GetProcLatchByNumber(n) (&(GetPGProcByNumber(n))->procLatch)
+
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
@@ -456,7 +459,6 @@ extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
 
 extern void ProcWaitForSignal(uint32 wait_event_info);
-extern void ProcSendSignal(int pgprocno);
 
 extern PGPROC *AuxiliaryPidGetProc(int pid);
 
-- 
2.39.2

v3-0005-Use-SetLatches-for-heavyweight-locks.patchtext/x-patch; charset=US-ASCII; name=v3-0005-Use-SetLatches-for-heavyweight-locks.patchDownload
From 4fea49d303dfc55edef7dc61350a6ae4f160c6a7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 16:43:31 +1300
Subject: [PATCH v3 5/6] Use SetLatches() for heavyweight locks.

Collect wakeups into a LatchGroup to be set at once after the
LockManager's internal partition lock is released.  This avoids holding
busy locks while issuing system calls.

Currently, waiters immediately try to acquire that lock themselves, so
deferring until it's released makes sense to avoid contention (a later
patch may remove that lock acquisition though).

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  4 ++--
 src/backend/storage/lmgr/lock.c     | 36 +++++++++++++++++++----------
 src/backend/storage/lmgr/proc.c     | 29 ++++++++++++++---------
 src/include/storage/lock.h          |  5 ++--
 src/include/storage/proc.h          |  6 ++---
 5 files changed, 50 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 617ebacdd4..63192647c8 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -214,7 +214,7 @@ InitDeadLockChecking(void)
  * and (b) we are typically invoked inside a signal handler.
  */
 DeadLockState
-DeadLockCheck(PGPROC *proc)
+DeadLockCheck(PGPROC *proc, LatchGroup *wakeups)
 {
 	/* Initialize to "no constraints" */
 	nCurConstraints = 0;
@@ -266,7 +266,7 @@ DeadLockCheck(PGPROC *proc)
 #endif
 
 		/* See if any waiters for the lock can be woken up now */
-		ProcLockWakeup(GetLocksMethodTable(lock), lock);
+		ProcLockWakeup(GetLocksMethodTable(lock), lock, wakeups);
 	}
 
 	/* Return code tells caller if we had to escape a deadlock or not */
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 42595b38b2..fec6e52145 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -374,14 +374,14 @@ static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
-static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
+static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
 static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 						LockMethod lockMethodTable, uint32 hashcode,
-						bool wakeupNeeded);
+						bool wakeupNeeded, LatchGroup *wakeups);
 static void LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 								 LOCKTAG *locktag, LOCKMODE lockmode,
 								 bool decrement_strong_lock_count);
@@ -787,6 +787,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	LWLock	   *partitionLock;
 	bool		found_conflict;
 	bool		log_lock = false;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -1098,7 +1099,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 										 locktag->locktag_type,
 										 lockmode);
 
-		WaitOnLock(locallock, owner);
+		WaitOnLock(locallock, owner, &wakeups);
 
 		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
 										locktag->locktag_field2,
@@ -1138,6 +1139,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
 	 * standby server.
@@ -1629,7 +1632,7 @@ UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 static void
 CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 			LockMethod lockMethodTable, uint32 hashcode,
-			bool wakeupNeeded)
+			bool wakeupNeeded, LatchGroup *wakeups)
 {
 	/*
 	 * If this was my last hold on this lock, delete my entry in the proclock
@@ -1669,7 +1672,7 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 	else if (wakeupNeeded)
 	{
 		/* There are waiters on this lock, so wake them up. */
-		ProcLockWakeup(lockMethodTable, lock);
+		ProcLockWakeup(lockMethodTable, lock, wakeups);
 	}
 }
 
@@ -1806,7 +1809,7 @@ MarkLockClear(LOCALLOCK *locallock)
  * The appropriate partition lock must be held at entry.
  */
 static void
-WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
+WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
 {
 	LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallock);
 	LockMethod	lockMethodTable = LockMethods[lockmethodid];
@@ -1839,7 +1842,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 	 */
 	PG_TRY();
 	{
-		if (ProcSleep(locallock, lockMethodTable) != PROC_WAIT_STATUS_OK)
+		if (ProcSleep(locallock, lockMethodTable, wakeups) != PROC_WAIT_STATUS_OK)
 		{
 			/*
 			 * We failed as a result of a deadlock, see CheckDeadLock(). Quit
@@ -1890,7 +1893,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
  * NB: this does not clean up any locallock object that may exist for the lock.
  */
 void
-RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
+RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode, LatchGroup *wakeups)
 {
 	LOCK	   *waitLock = proc->waitLock;
 	PROCLOCK   *proclock = proc->waitProcLock;
@@ -1931,7 +1934,7 @@ RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
 	 */
 	CleanUpLock(waitLock, proclock,
 				LockMethods[lockmethodid], hashcode,
-				true);
+				true, wakeups);
 }
 
 /*
@@ -1956,6 +1959,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	PROCLOCK   *proclock;
 	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2134,10 +2138,12 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 
 	CleanUpLock(lock, proclock,
 				lockMethodTable, locallock->hashcode,
-				wakeupNeeded);
+				wakeupNeeded, &wakeups);
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	RemoveLocalLock(locallock);
 	return true;
 }
@@ -2161,6 +2167,7 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	LOCK	   *lock;
 	int			partition;
 	bool		have_fast_path_lwlock = false;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2399,10 +2406,12 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 			CleanUpLock(lock, proclock,
 						lockMethodTable,
 						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
+						wakeupNeeded, &wakeups);
 		}						/* loop over PROCLOCKs within this partition */
 
 		LWLockRelease(partitionLock);
+
+		SetLatches(&wakeups);
 	}							/* loop over partitions */
 
 #ifdef LOCK_DEBUG
@@ -3095,6 +3104,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	uint32		proclock_hashcode;
 	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
+	LatchGroup	wakeups = {0};
 
 	hashcode = LockTagHashCode(locktag);
 	partitionLock = LockHashPartitionLock(hashcode);
@@ -3148,10 +3158,12 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 
 	CleanUpLock(lock, proclock,
 				lockMethodTable, hashcode,
-				wakeupNeeded);
+				wakeupNeeded, &wakeups);
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	/*
 	 * Decrement strong lock count.  This logic is needed only for 2PC.
 	 */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 09eda65fd1..156d293210 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -699,6 +699,7 @@ LockErrorCleanup(void)
 {
 	LWLock	   *partitionLock;
 	DisableTimeoutParams timeouts[2];
+	LatchGroup	wakeups = {0};
 
 	HOLD_INTERRUPTS();
 
@@ -732,7 +733,7 @@ LockErrorCleanup(void)
 	if (!dlist_node_is_detached(&MyProc->links))
 	{
 		/* We could not have been granted the lock yet */
-		RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
+		RemoveFromWaitQueue(MyProc, lockAwaited->hashcode, &wakeups);
 	}
 	else
 	{
@@ -749,6 +750,7 @@ LockErrorCleanup(void)
 	lockAwaited = NULL;
 
 	LWLockRelease(partitionLock);
+	SetLatches(&wakeups);
 
 	RESUME_INTERRUPTS();
 }
@@ -1001,7 +1003,7 @@ AuxiliaryPidGetProc(int pid)
  * NOTES: The process queue is now a priority queue for locking.
  */
 ProcWaitStatus
-ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
+ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 {
 	LOCKMODE	lockmode = locallock->tag.mode;
 	LOCK	   *lock = locallock->lock;
@@ -1137,7 +1139,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 */
 	if (early_deadlock)
 	{
-		RemoveFromWaitQueue(MyProc, hashcode);
+		RemoveFromWaitQueue(MyProc, hashcode, wakeups);
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
@@ -1153,6 +1155,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 * LockErrorCleanup will clean up if cancel/die happens.
 	 */
 	LWLockRelease(partitionLock);
+	SetLatches(wakeups);
 
 	/*
 	 * Also, now that we will successfully clean up after an ereport, it's
@@ -1594,7 +1597,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 
 /*
- * ProcWakeup -- wake up a process by setting its latch.
+ * ProcWakeup -- prepare to wake up a process by setting its latch.
  *
  *	 Also remove the process from the wait queue and set its links invalid.
  *
@@ -1606,7 +1609,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  * Hence, in practice the waitStatus parameter must be PROC_WAIT_STATUS_OK.
  */
 void
-ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
+ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus, LatchGroup *wakeups)
 {
 	if (dlist_node_is_detached(&proc->links))
 		return;
@@ -1622,8 +1625,8 @@ ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
 	proc->waitStatus = waitStatus;
 	pg_atomic_write_u64(&MyProc->waitStart, 0);
 
-	/* And awaken it */
-	SetLatch(&proc->procLatch);
+	/* Schedule it to be awoken */
+	AddLatch(wakeups, &proc->procLatch);
 }
 
 /*
@@ -1634,7 +1637,7 @@ ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
  * The appropriate lock partition lock must be held by caller.
  */
 void
-ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock)
+ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock, LatchGroup *wakeups)
 {
 	dclist_head *waitQueue = &lock->waitProcs;
 	LOCKMASK	aheadRequests = 0;
@@ -1659,7 +1662,7 @@ ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock)
 			/* OK to waken */
 			GrantLock(lock, proc->waitProcLock, lockmode);
 			/* removes proc from the lock's waiting process queue */
-			ProcWakeup(proc, PROC_WAIT_STATUS_OK);
+			ProcWakeup(proc, PROC_WAIT_STATUS_OK, wakeups);
 		}
 		else
 		{
@@ -1685,6 +1688,7 @@ static void
 CheckDeadLock(void)
 {
 	int			i;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * Acquire exclusive lock on the entire shared lock data structures. Must
@@ -1719,7 +1723,7 @@ CheckDeadLock(void)
 #endif
 
 	/* Run the deadlock check, and set deadlock_state for use by ProcSleep */
-	deadlock_state = DeadLockCheck(MyProc);
+	deadlock_state = DeadLockCheck(MyProc, &wakeups);
 
 	if (deadlock_state == DS_HARD_DEADLOCK)
 	{
@@ -1736,7 +1740,8 @@ CheckDeadLock(void)
 		 * return from the signal handler.
 		 */
 		Assert(MyProc->waitLock != NULL);
-		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)));
+		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)),
+							&wakeups);
 
 		/*
 		 * We're done here.  Transaction abort caused by the error that
@@ -1760,6 +1765,8 @@ CheckDeadLock(void)
 check_done:
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
 		LWLockRelease(LockHashPartitionLockByIndex(i));
+
+	SetLatches(&wakeups);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 6ae434596a..8ee2f3e855 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -20,6 +20,7 @@
 
 #include "lib/ilist.h"
 #include "storage/backendid.h"
+#include "storage/latch.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
@@ -583,7 +584,7 @@ extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCK *lock, PROCLOCK *proclock);
 extern void GrantLock(LOCK *lock, PROCLOCK *proclock, LOCKMODE lockmode);
 extern void GrantAwaitedLock(void);
-extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
+extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode, LatchGroup *wakeups);
 extern Size LockShmemSize(void);
 extern LockData *GetLockStatusData(void);
 extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
@@ -600,7 +601,7 @@ extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
-extern DeadLockState DeadLockCheck(PGPROC *proc);
+extern DeadLockState DeadLockCheck(PGPROC *proc, LatchGroup *wakeups);
 extern PGPROC *GetBlockingAutoVacuumPgproc(void);
 extern void DeadLockReport(void) pg_attribute_noreturn();
 extern void RememberSimpleDeadLock(PGPROC *proc1,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ca39ae9486..799db535c9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -451,9 +451,9 @@ extern int	GetStartupBufferPinWaitBufId(void);
 extern bool HaveNFreeProcs(int n, int *nfree);
 extern void ProcReleaseLocks(bool isCommit);
 
-extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable);
-extern void ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus);
-extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock);
+extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups);
+extern void ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus, LatchGroup *wakeups);
+extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock, LatchGroup *wakeups);
 extern void CheckDeadLockAlert(void);
 extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
-- 
2.39.2

v3-0006-Don-t-re-acquire-LockManager-partition-lock-after.patchtext/x-patch; charset=US-ASCII; name=v3-0006-Don-t-re-acquire-LockManager-partition-lock-after.patchDownload
From 77c6f747011c0ff8771414ded1a23306e6e953c9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 10:56:39 +1300
Subject: [PATCH v3 6/6] Don't re-acquire LockManager partition lock after
 wakeup.

Change the contract of WaitOnLock() and ProcSleep() to return with the
partition lock released.  ProcSleep() re-acquired the lock to prevent
interrupts, which was a problem at the time the ancestor of that code
was committed in fe548629c50, because then we had signal handlers that
longjmp'd out of there.  Now, that can happen only at points where we
explicitly run CHECK_FOR_INTERRUPTS(), so we can remove the lock, which
leads to the idea of making the function always release.

While an earlier commit fixed the problem that the backend woke us up
before it had even released the lock itself, this lock reacquisition
point was still contended when multiple backends woke at the same time.

XXX Right?  Or what am I missing?

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 18 ++++++++++++------
 src/backend/storage/lmgr/proc.c | 20 ++++++++++++--------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index fec6e52145..97be94dbd3 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -787,6 +787,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	LWLock	   *partitionLock;
 	bool		found_conflict;
 	bool		log_lock = false;
+	bool		partitionLocked = false;
 	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -995,6 +996,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	partitionLock = LockHashPartitionLock(hashcode);
 
 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+	partitionLocked = true;
 
 	/*
 	 * Find or create lock and proclock entries with this tag
@@ -1099,7 +1101,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
 										 locktag->locktag_type,
 										 lockmode);
 
+		Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
 		WaitOnLock(locallock, owner, &wakeups);
+		Assert(!LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+		partitionLocked = false;
 
 		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
 										locktag->locktag_field2,
@@ -1124,7 +1129,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 			PROCLOCK_PRINT("LockAcquire: INCONSISTENT", proclock);
 			LOCK_PRINT("LockAcquire: INCONSISTENT", lock, lockmode);
 			/* Should we retry ? */
-			LWLockRelease(partitionLock);
 			elog(ERROR, "LockAcquire failed");
 		}
 		PROCLOCK_PRINT("LockAcquire: granted", proclock);
@@ -1137,9 +1141,11 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	FinishStrongLockAcquire();
 
-	LWLockRelease(partitionLock);
-
-	SetLatches(&wakeups);
+	if (partitionLocked)
+	{
+		LWLockRelease(partitionLock);
+		SetLatches(&wakeups);
+	}
 
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
@@ -1806,7 +1812,8 @@ MarkLockClear(LOCALLOCK *locallock)
  * Caller must have set MyProc->heldLocks to reflect locks already held
  * on the lockable object by this process.
  *
- * The appropriate partition lock must be held at entry.
+ * The appropriate partition lock must be held at entry.  It is not held on
+ * exit.
  */
 static void
 WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
@@ -1851,7 +1858,6 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
 			awaitedLock = NULL;
 			LOCK_PRINT("WaitOnLock: aborting on lock",
 					   locallock->lock, locallock->tag.mode);
-			LWLockRelease(LockHashPartitionLock(locallock->hashcode));
 
 			/*
 			 * Now that we aren't holding the partition lock, we can give an
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 156d293210..35720346f2 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -992,7 +992,7 @@ AuxiliaryPidGetProc(int pid)
  * Caller must have set MyProc->heldLocks to reflect locks already held
  * on the lockable object by this process (under all XIDs).
  *
- * The lock table's partition lock must be held at entry, and will be held
+ * The lock table's partition lock must be held at entry, and will be released
  * at exit.
  *
  * Result: PROC_WAIT_STATUS_OK if we acquired the lock, PROC_WAIT_STATUS_ERROR if not (deadlock).
@@ -1020,6 +1020,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 	ProcWaitStatus myWaitStatus;
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 
+	/*
+	 * Every way out of this function will release the partition lock and send
+	 * buffered latch wakeups.
+	 */
+	Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
@@ -1104,6 +1110,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 					/* Skip the wait and just grant myself the lock. */
 					GrantLock(lock, proclock, lockmode);
 					GrantAwaitedLock();
+					LWLockRelease(partitionLock);
+					SetLatches(wakeups);
 					return PROC_WAIT_STATUS_OK;
 				}
 
@@ -1140,6 +1148,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 	if (early_deadlock)
 	{
 		RemoveFromWaitQueue(MyProc, hashcode, wakeups);
+		LWLockRelease(partitionLock);
+		SetLatches(wakeups);
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
@@ -1469,6 +1479,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 			}
 
 			LWLockRelease(partitionLock);
+			SetLatches(wakeups);
 
 			if (deadlock_state == DS_SOFT_DEADLOCK)
 				ereport(LOG,
@@ -1570,13 +1581,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 							standbyWaitStart, GetCurrentTimestamp(),
 							NULL, false);
 
-	/*
-	 * Re-acquire the lock table's partition lock.  We have to do this to hold
-	 * off cancel/die interrupts before we can mess with lockAwaited (else we
-	 * might have a missed or duplicated locallock update).
-	 */
-	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
 	/*
 	 * We no longer want LockErrorCleanup to do anything.
 	 */
-- 
2.39.2

#5Peter Eisentraut
peter@eisentraut.org
In reply to: Thomas Munro (#4)
Re: Latches vs lwlock contention

On 04.03.23 20:50, Thomas Munro wrote:

Subject: [PATCH v3 1/6] Allow palloc_extended(NO_OOM) in critical sections.

Commit 4a170ee9e0e banned palloc() and similar in critical sections, because an
allocation failure would produce a panic. Make an exception for allocation
with NULL on failure, for code that has a backup plan.

I suppose this assumes that out of memory is the only possible error
condition that we are concerned about for this?

For example, we sometimes see "invalid memory alloc request size" either
because of corrupted data or because code does things we didn't expect.
This would then possibly panic? Also, the realloc code paths
potentially do more work with possibly more error conditions, and/or
they error out right away because it's not supported by the context type.

Maybe this is all ok, but it would be good to make the assumptions more
explicit.

#6Aleksander Alekseev
aleksander@timescale.com
In reply to: Peter Eisentraut (#5)
6 attachment(s)
Re: Latches vs lwlock contention

Hi,

Maybe this is all ok, but it would be good to make the assumptions more
explicit.

Here are my two cents.

```
static void
SetLatchV(Latch **latches, int nlatches)
{
/* Flush any other changes out to main memory just once. */
pg_memory_barrier();

/* Keep only latches that are not already set, and set them. */
for (int i = 0; i < nlatches; ++i)
{
Latch *latch = latches[i];

if (!latch->is_set)
latch->is_set = true;
else
latches[i] = NULL;
}

pg_memory_barrier();

[...]

void
SetLatches(LatchGroup *group)
{
if (group->size > 0)
{
SetLatchV(group->latches, group->size);

[...]
```

I suspect this API may be error-prone without some additional
comments. The caller (which may be an extension author too) may rely
on the implementation details of SetLatches() / SetLatchV() and use
the returned group->latches[] values e.g. to figure out whether he
attempted to change the state of the given latch. Even worse, one can
mistakenly assume that the result says exactly if the caller was the
one who changed the state of the latch. This being said I see why this
particular implementation was chosen.

I added corresponding comments to SetLatchV() and SetLatches(). Also
the patchset needed a rebase. PFA v4.

It passes `make installcheck-world` on 3 machines of mine: MacOS x64,
Linux x64 and Linux RISC-V.

--
Best regards,
Aleksander Alekseev

Attachments:

v4-0002-Provide-SetLatches-for-batched-deferred-latches.patchapplication/x-patch; name=v4-0002-Provide-SetLatches-for-batched-deferred-latches.patchDownload
From 006499ee28230e8c1b5c8d2f047276c5661f50ad Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 15:51:45 +1300
Subject: [PATCH v4 2/6] Provide SetLatches() for batched deferred latches.

If we have a way to buffer a set of wakeup targets and process them at a
later time, we can:

* move SetLatch() system calls out from under LWLocks, so that locks can
  be released faster; this is especially interesting in cases where the
  target backends will immediately try to acquire the same lock, or
  generally when the lock is heavily contended

* possibly gain some micro-opimization from issuing only two memory
  barriers for the whole batch of latches, not two for each latch to be
  set

* prepare for future IPC mechanisms which might allow us to wake
  multiple backends in a single system call

Individual users of this facility will follow in separate patches.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c  | 221 ++++++++++++++++++++-----------
 src/include/storage/latch.h      |  13 ++
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 161 insertions(+), 74 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index db889385b7..5b7113dac8 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -589,6 +589,88 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	return ret;
 }
 
+/*
+ * Sets multiple 'latches' at the same time.
+ *
+ * Modifies the input array. The nature of this modification is
+ * an implementation detail of SetLatchV(). The caller shouldn't try
+ * interpreting it and/or using the input array after the call.
+ */
+static void
+SetLatchV(Latch **latches, int nlatches)
+{
+	/* Flush any other changes out to main memory just once. */
+	pg_memory_barrier();
+
+	/* Keep only latches that are not already set, and set them. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+
+		if (!latch->is_set)
+			latch->is_set = true;
+		else
+			latches[i] = NULL;
+	}
+
+	pg_memory_barrier();
+
+	/* Wake the remaining latches that might be sleeping. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch	   *latch = latches[i];
+
+		/*
+		 * See if anyone's waiting for the latch. It can be the current
+		 * process if we're in a signal handler. We use the self-pipe or
+		 * SIGURG to ourselves to wake up WaitEventSetWaitBlock() without
+		 * races in that case. If it's another process, send a signal.
+		 *
+		 * Fetch owner_pid only once, in case the latch is concurrently
+		 * getting owned or disowned. XXX: This assumes that pid_t is atomic,
+		 * which isn't guaranteed to be true! In practice, the effective range
+		 * of pid_t fits in a 32 bit integer, and so should be atomic. In the
+		 * worst case, we might end up signaling the wrong process. Even then,
+		 * you're very unlucky if a process with that bogus pid exists and
+		 * belongs to Postgres; and PG database processes should handle excess
+		 * SIGURG interrupts without a problem anyhow.
+		 *
+		 * Another sort of race condition that's possible here is for a new
+		 * process to own the latch immediately after we look, so we don't
+		 * signal it. This is okay so long as all callers of
+		 * ResetLatch/WaitLatch follow the standard coding convention of
+		 * waiting at the bottom of their loops, not the top, so that they'll
+		 * correctly process latch-setting events that happen before they
+		 * enter the loop.
+		 */
+		if (latch && latch->maybe_sleeping)
+		{
+#ifndef WIN32
+			pid_t		owner_pid = latch->owner_pid;
+
+			if (owner_pid == MyProcPid)
+			{
+				if (waiting)
+				{
+#if defined(WAIT_USE_SELF_PIPE)
+					sendSelfPipeByte();
+#else
+					kill(MyProcPid, SIGURG);
+#endif
+				}
+			}
+			else
+				kill(owner_pid, SIGURG);
+#else
+			HANDLE		event = latch->event;
+
+			if (event)
+				SetEvent(event);
+#endif
+		}
+	}
+}
+
 /*
  * Sets a latch and wakes up anyone waiting on it.
  *
@@ -604,89 +686,80 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 void
 SetLatch(Latch *latch)
 {
-#ifndef WIN32
-	pid_t		owner_pid;
-#else
-	HANDLE		handle;
-#endif
-
-	/*
-	 * The memory barrier has to be placed here to ensure that any flag
-	 * variables possibly changed by this process have been flushed to main
-	 * memory, before we check/set is_set.
-	 */
-	pg_memory_barrier();
-
-	/* Quick exit if already set */
-	if (latch->is_set)
-		return;
-
-	latch->is_set = true;
-
-	pg_memory_barrier();
-	if (!latch->maybe_sleeping)
-		return;
+	SetLatchV(&latch, 1);
+}
 
-#ifndef WIN32
+/*
+ * Add a latch to a group, to be set later.
+ */
+void
+AddLatch(LatchGroup *group, Latch *latch)
+{
+	/* First time.  Set up the in-place buffer. */
+	if (!group->latches)
+	{
+		group->latches = group->in_place_buffer;
+		group->capacity = lengthof(group->in_place_buffer);
+		Assert(group->size == 0);
+	}
 
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler. We use the self-pipe or SIGURG to ourselves
-	 * to wake up WaitEventSetWaitBlock() without races in that case. If it's
-	 * another process, send a signal.
-	 *
-	 * Fetch owner_pid only once, in case the latch is concurrently getting
-	 * owned or disowned. XXX: This assumes that pid_t is atomic, which isn't
-	 * guaranteed to be true! In practice, the effective range of pid_t fits
-	 * in a 32 bit integer, and so should be atomic. In the worst case, we
-	 * might end up signaling the wrong process. Even then, you're very
-	 * unlucky if a process with that bogus pid exists and belongs to
-	 * Postgres; and PG database processes should handle excess SIGUSR1
-	 * interrupts without a problem anyhow.
-	 *
-	 * Another sort of race condition that's possible here is for a new
-	 * process to own the latch immediately after we look, so we don't signal
-	 * it. This is okay so long as all callers of ResetLatch/WaitLatch follow
-	 * the standard coding convention of waiting at the bottom of their loops,
-	 * not the top, so that they'll correctly process latch-setting events
-	 * that happen before they enter the loop.
-	 */
-	owner_pid = latch->owner_pid;
-	if (owner_pid == 0)
-		return;
-	else if (owner_pid == MyProcPid)
+	/* Are we full? */
+	if (group->size == group->capacity)
 	{
-#if defined(WAIT_USE_SELF_PIPE)
-		if (waiting)
-			sendSelfPipeByte();
-#else
-		if (waiting)
-			kill(MyProcPid, SIGURG);
-#endif
+		Latch	  **new_latches;
+		int			new_capacity;
+
+		/* Try to allocate more space. */
+		new_capacity = group->capacity * 2;
+		new_latches = palloc_extended(sizeof(latch) * new_capacity,
+									  MCXT_ALLOC_NO_OOM);
+		if (!new_latches)
+		{
+			/*
+			 * Allocation failed.  This is very unlikely to happen, but it
+			 * might be useful to be able to use this function in critical
+			 * sections, so we handle allocation failure by flushing instead
+			 * of throwing.
+			 */
+			SetLatches(group);
+		}
+		else
+		{
+			memcpy(new_latches, group->latches, sizeof(latch) * group->size);
+			if (group->latches != group->in_place_buffer)
+				pfree(group->latches);
+			group->latches = new_latches;
+			group->capacity = new_capacity;
+		}
 	}
-	else
-		kill(owner_pid, SIGURG);
 
-#else
+	Assert(group->size < group->capacity);
+	group->latches[group->size++] = latch;
+}
 
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler.
-	 *
-	 * Use a local variable here just in case somebody changes the event field
-	 * concurrently (which really should not happen).
-	 */
-	handle = latch->event;
-	if (handle)
+/*
+ * Sets all the latches accumulated in the 'group'.
+ *
+ * Modifies the 'group'. The nature of this modification is
+ * an implementation detail of SetLatches(). The caller shouldn't try
+ * interpreting it and/or using the 'group' again after the call.
+ */
+void
+SetLatches(LatchGroup *group)
+{
+	if (group->size > 0)
 	{
-		SetEvent(handle);
+		SetLatchV(group->latches, group->size);
+		group->size = 0;
 
-		/*
-		 * Note that we silently ignore any errors. We might be in a signal
-		 * handler or other critical path where it's not safe to call elog().
-		 */
+		/* If we allocated a large buffer, it's time to free it. */
+		if (group->latches != group->in_place_buffer)
+		{
+			pfree(group->latches);
+			group->latches = group->in_place_buffer;
+			group->capacity = lengthof(group->in_place_buffer);
+		}
 	}
-#endif
 }
 
 /*
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 99cc47874a..7b81806825 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -118,6 +118,17 @@ typedef struct Latch
 #endif
 } Latch;
 
+/*
+ * A buffer for setting multiple latches at a time.
+ */
+typedef struct LatchGroup
+{
+	int			size;
+	int			capacity;
+	Latch	  **latches;
+	Latch	   *in_place_buffer[64];
+} LatchGroup;
+
 /*
  * Bitmasks for events that may wake-up WaitLatch(), WaitLatchOrSocket(), or
  * WaitEventSetWait().
@@ -170,6 +181,8 @@ extern void InitSharedLatch(Latch *latch);
 extern void OwnLatch(Latch *latch);
 extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
+extern void AddLatch(LatchGroup *group, Latch *latch);
+extern void SetLatches(LatchGroup *group);
 extern void ResetLatch(Latch *latch);
 extern void ShutdownLatchSupport(void);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e941fb6c82..34f9b32b29 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1409,6 +1409,7 @@ LabelProvider
 LagTracker
 LargeObjectDesc
 Latch
+LatchGroup
 LauncherLastStartTimesEntry
 LerpFunc
 LexDescr
-- 
2.41.0

v4-0003-Use-SetLatches-for-synchronous-replication-wakeup.patchapplication/x-patch; name=v4-0003-Use-SetLatches-for-synchronous-replication-wakeup.patchDownload
From c4c30181ef039107d0e03068a114032eb883fd05 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 12:59:45 +1300
Subject: [PATCH v4 3/6] Use SetLatches() for synchronous replication wakeups.

Don't issue SetLatch() system calls while holding SyncRepLock.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/replication/syncrep.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 0ea71b5c43..f954c39b38 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -100,7 +100,7 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, LatchGroup *wakeups);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 								 XLogRecPtr *flushPtr,
@@ -440,6 +440,7 @@ SyncRepReleaseWaiters(void)
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			numapply = 0;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
@@ -509,21 +510,23 @@ SyncRepReleaseWaiters(void)
 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
+		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, &wakeups);
 	}
 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, &wakeups);
 	}
 	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, &wakeups);
 	}
 
 	LWLockRelease(SyncRepLock);
 
+	SetLatches(&wakeups);
+
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
 		 numwrite, LSN_FORMAT_ARGS(writePtr),
 		 numflush, LSN_FORMAT_ARGS(flushPtr),
@@ -867,7 +870,7 @@ SyncRepGetStandbyPriority(void)
  * The caller must hold SyncRepLock in exclusive mode.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, LatchGroup *wakeups)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	int			numprocs = 0;
@@ -908,7 +911,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Wake only when we have set state and removed from queue.
 		 */
-		SetLatch(&(proc->procLatch));
+		AddLatch(wakeups, &proc->procLatch);
 
 		numprocs++;
 	}
@@ -930,6 +933,8 @@ SyncRepUpdateSyncStandbysDefined(void)
 
 	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
 	{
+		LatchGroup		wakeups = {0};
+
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 		/*
@@ -942,7 +947,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, &wakeups);
 		}
 
 		/*
@@ -955,6 +960,8 @@ SyncRepUpdateSyncStandbysDefined(void)
 		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
 
 		LWLockRelease(SyncRepLock);
+
+		SetLatches(&wakeups);
 	}
 }
 
-- 
2.41.0

v4-0004-Use-SetLatches-for-SERIALIZABLE-DEFERRABLE-wakeup.patchapplication/x-patch; name=v4-0004-Use-SetLatches-for-SERIALIZABLE-DEFERRABLE-wakeup.patchDownload
From 1dd76ab9bfadd070958a7c077ebffb5dde0afbdc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 12:40:12 +1300
Subject: [PATCH v4 4/6] Use SetLatches() for SERIALIZABLE DEFERRABLE wakeups.

Don't issue SetLatch()'s system call while holding a highly contended
LWLock.  Collect them in a LatchGroup to be set after the lock is
released.  Once woken, other backends will immediately try to acquire
that lock anyway so it's better to wake after releasing.

Take this opportunity to retire the confusingly named ProcSendSignal()
and replace it with something that gives the latch pointer we need.
There was only one other caller, in bufmgr.c, which is easily changed.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c  |  2 +-
 src/backend/storage/ipc/procsignal.c |  2 --
 src/backend/storage/lmgr/predicate.c |  5 ++++-
 src/backend/storage/lmgr/proc.c      | 12 ------------
 src/include/storage/proc.h           |  4 +++-
 5 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a7e3b9bb1d..de5908876d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2449,7 +2449,7 @@ UnpinBuffer(BufferDesc *buf)
 
 				buf_state &= ~BM_PIN_COUNT_WAITER;
 				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
+				SetLatch(GetProcLatchByNumber(wait_backend_pgprocno));
 			}
 			else
 				UnlockBufHdr(buf, buf_state);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c85cb5cc18..ec4c3888b0 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -255,8 +255,6 @@ CleanupProcSignalState(int status, Datum arg)
  *
  * On success (a signal was sent), zero is returned.
  * On error, -1 is returned, and errno is set (typically to ESRCH or EPERM).
- *
- * Not to be confused with ProcSendSignal
  */
 int
 SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 1af41213b4..4d95ca94df 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3249,6 +3249,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 	bool		needToClear;
 	SERIALIZABLEXACT *roXact;
 	dlist_mutable_iter iter;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * We can't trust XactReadOnly here, because a transaction which started
@@ -3553,7 +3554,7 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 			 */
 			if (SxactIsDeferrableWaiting(roXact) &&
 				(SxactIsROUnsafe(roXact) || SxactIsROSafe(roXact)))
-				ProcSendSignal(roXact->pgprocno);
+				AddLatch(&wakeups, GetProcLatchByNumber(roXact->pgprocno));
 		}
 	}
 
@@ -3583,6 +3584,8 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
 
 	LWLockRelease(SerializableXactHashLock);
 
+	SetLatches(&wakeups);
+
 	LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
 
 	/* Add this to the list of transactions to check for later cleanup. */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5b663a2997..a71b34ba0b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1802,18 +1802,6 @@ ProcWaitForSignal(uint32 wait_event_info)
 	CHECK_FOR_INTERRUPTS();
 }
 
-/*
- * ProcSendSignal - set the latch of a backend identified by pgprocno
- */
-void
-ProcSendSignal(int pgprocno)
-{
-	if (pgprocno < 0 || pgprocno >= ProcGlobal->allProcCount)
-		elog(ERROR, "pgprocno out of range");
-
-	SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
-}
-
 /*
  * BecomeLockGroupLeader - designate process as lock group leader
  *
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ef74f32693..56a31d2a89 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -413,6 +413,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
 /* Accessor for PGPROC given a pgprocno. */
 #define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
 
+/* Accessor for procLatch given a pgprocno. */
+#define GetProcLatchByNumber(n) (&(GetPGProcByNumber(n))->procLatch)
+
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
@@ -456,7 +459,6 @@ extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
 
 extern void ProcWaitForSignal(uint32 wait_event_info);
-extern void ProcSendSignal(int pgprocno);
 
 extern PGPROC *AuxiliaryPidGetProc(int pid);
 
-- 
2.41.0

v4-0005-Use-SetLatches-for-heavyweight-locks.patchapplication/x-patch; name=v4-0005-Use-SetLatches-for-heavyweight-locks.patchDownload
From f488319ab14d7698793b1b7e1b9c8098063b2294 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 26 Oct 2022 16:43:31 +1300
Subject: [PATCH v4 5/6] Use SetLatches() for heavyweight locks.

Collect wakeups into a LatchGroup to be set at once after the
LockManager's internal partition lock is released.  This avoids holding
busy locks while issuing system calls.

Currently, waiters immediately try to acquire that lock themselves, so
deferring until it's released makes sense to avoid contention (a later
patch may remove that lock acquisition though).

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/lmgr/deadlock.c |  4 ++--
 src/backend/storage/lmgr/lock.c     | 36 +++++++++++++++++++----------
 src/backend/storage/lmgr/proc.c     | 29 ++++++++++++++---------
 src/include/storage/lock.h          |  5 ++--
 src/include/storage/proc.h          |  6 ++---
 5 files changed, 50 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 2bdd20b1fe..3657186d9f 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -214,7 +214,7 @@ InitDeadLockChecking(void)
  * and (b) we are typically invoked inside a signal handler.
  */
 DeadLockState
-DeadLockCheck(PGPROC *proc)
+DeadLockCheck(PGPROC *proc, LatchGroup *wakeups)
 {
 	/* Initialize to "no constraints" */
 	nCurConstraints = 0;
@@ -266,7 +266,7 @@ DeadLockCheck(PGPROC *proc)
 #endif
 
 		/* See if any waiters for the lock can be woken up now */
-		ProcLockWakeup(GetLocksMethodTable(lock), lock);
+		ProcLockWakeup(GetLocksMethodTable(lock), lock, wakeups);
 	}
 
 	/* Return code tells caller if we had to escape a deadlock or not */
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index f595bce31b..126073a0d3 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -362,14 +362,14 @@ static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
-static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
+static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
 static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 						PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 						LockMethod lockMethodTable, uint32 hashcode,
-						bool wakeupNeeded);
+						bool wakeupNeeded, LatchGroup *wakeups);
 static void LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 								 LOCKTAG *locktag, LOCKMODE lockmode,
 								 bool decrement_strong_lock_count);
@@ -775,6 +775,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	LWLock	   *partitionLock;
 	bool		found_conflict;
 	bool		log_lock = false;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -1079,7 +1080,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 										 locktag->locktag_type,
 										 lockmode);
 
-		WaitOnLock(locallock, owner);
+		WaitOnLock(locallock, owner, &wakeups);
 
 		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
 										locktag->locktag_field2,
@@ -1119,6 +1120,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
 	 * standby server.
@@ -1605,7 +1608,7 @@ UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 static void
 CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 			LockMethod lockMethodTable, uint32 hashcode,
-			bool wakeupNeeded)
+			bool wakeupNeeded, LatchGroup *wakeups)
 {
 	/*
 	 * If this was my last hold on this lock, delete my entry in the proclock
@@ -1645,7 +1648,7 @@ CleanUpLock(LOCK *lock, PROCLOCK *proclock,
 	else if (wakeupNeeded)
 	{
 		/* There are waiters on this lock, so wake them up. */
-		ProcLockWakeup(lockMethodTable, lock);
+		ProcLockWakeup(lockMethodTable, lock, wakeups);
 	}
 }
 
@@ -1782,7 +1785,7 @@ MarkLockClear(LOCALLOCK *locallock)
  * The appropriate partition lock must be held at entry.
  */
 static void
-WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
+WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
 {
 	LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallock);
 	LockMethod	lockMethodTable = LockMethods[lockmethodid];
@@ -1815,7 +1818,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 	 */
 	PG_TRY();
 	{
-		if (ProcSleep(locallock, lockMethodTable) != PROC_WAIT_STATUS_OK)
+		if (ProcSleep(locallock, lockMethodTable, wakeups) != PROC_WAIT_STATUS_OK)
 		{
 			/*
 			 * We failed as a result of a deadlock, see CheckDeadLock(). Quit
@@ -1866,7 +1869,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
  * NB: this does not clean up any locallock object that may exist for the lock.
  */
 void
-RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
+RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode, LatchGroup *wakeups)
 {
 	LOCK	   *waitLock = proc->waitLock;
 	PROCLOCK   *proclock = proc->waitProcLock;
@@ -1907,7 +1910,7 @@ RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
 	 */
 	CleanUpLock(waitLock, proclock,
 				LockMethods[lockmethodid], hashcode,
-				true);
+				true, wakeups);
 }
 
 /*
@@ -1932,6 +1935,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	PROCLOCK   *proclock;
 	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2110,10 +2114,12 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 
 	CleanUpLock(lock, proclock,
 				lockMethodTable, locallock->hashcode,
-				wakeupNeeded);
+				wakeupNeeded, &wakeups);
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	RemoveLocalLock(locallock);
 	return true;
 }
@@ -2137,6 +2143,7 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	LOCK	   *lock;
 	int			partition;
 	bool		have_fast_path_lwlock = false;
+	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2375,10 +2382,12 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 			CleanUpLock(lock, proclock,
 						lockMethodTable,
 						LockTagHashCode(&lock->tag),
-						wakeupNeeded);
+						wakeupNeeded, &wakeups);
 		}						/* loop over PROCLOCKs within this partition */
 
 		LWLockRelease(partitionLock);
+
+		SetLatches(&wakeups);
 	}							/* loop over partitions */
 
 #ifdef LOCK_DEBUG
@@ -3071,6 +3080,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 	uint32		proclock_hashcode;
 	LWLock	   *partitionLock;
 	bool		wakeupNeeded;
+	LatchGroup	wakeups = {0};
 
 	hashcode = LockTagHashCode(locktag);
 	partitionLock = LockHashPartitionLock(hashcode);
@@ -3124,10 +3134,12 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 
 	CleanUpLock(lock, proclock,
 				lockMethodTable, hashcode,
-				wakeupNeeded);
+				wakeupNeeded, &wakeups);
 
 	LWLockRelease(partitionLock);
 
+	SetLatches(&wakeups);
+
 	/*
 	 * Decrement strong lock count.  This logic is needed only for 2PC.
 	 */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index a71b34ba0b..8821517897 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -699,6 +699,7 @@ LockErrorCleanup(void)
 {
 	LWLock	   *partitionLock;
 	DisableTimeoutParams timeouts[2];
+	LatchGroup	wakeups = {0};
 
 	HOLD_INTERRUPTS();
 
@@ -732,7 +733,7 @@ LockErrorCleanup(void)
 	if (!dlist_node_is_detached(&MyProc->links))
 	{
 		/* We could not have been granted the lock yet */
-		RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
+		RemoveFromWaitQueue(MyProc, lockAwaited->hashcode, &wakeups);
 	}
 	else
 	{
@@ -749,6 +750,7 @@ LockErrorCleanup(void)
 	lockAwaited = NULL;
 
 	LWLockRelease(partitionLock);
+	SetLatches(&wakeups);
 
 	RESUME_INTERRUPTS();
 }
@@ -1001,7 +1003,7 @@ AuxiliaryPidGetProc(int pid)
  * NOTES: The process queue is now a priority queue for locking.
  */
 ProcWaitStatus
-ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
+ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 {
 	LOCKMODE	lockmode = locallock->tag.mode;
 	LOCK	   *lock = locallock->lock;
@@ -1137,7 +1139,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 */
 	if (early_deadlock)
 	{
-		RemoveFromWaitQueue(MyProc, hashcode);
+		RemoveFromWaitQueue(MyProc, hashcode, wakeups);
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
@@ -1153,6 +1155,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 * LockErrorCleanup will clean up if cancel/die happens.
 	 */
 	LWLockRelease(partitionLock);
+	SetLatches(wakeups);
 
 	/*
 	 * Also, now that we will successfully clean up after an ereport, it's
@@ -1594,7 +1597,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 
 
 /*
- * ProcWakeup -- wake up a process by setting its latch.
+ * ProcWakeup -- prepare to wake up a process by setting its latch.
  *
  *	 Also remove the process from the wait queue and set its links invalid.
  *
@@ -1606,7 +1609,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  * Hence, in practice the waitStatus parameter must be PROC_WAIT_STATUS_OK.
  */
 void
-ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
+ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus, LatchGroup *wakeups)
 {
 	if (dlist_node_is_detached(&proc->links))
 		return;
@@ -1622,8 +1625,8 @@ ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
 	proc->waitStatus = waitStatus;
 	pg_atomic_write_u64(&MyProc->waitStart, 0);
 
-	/* And awaken it */
-	SetLatch(&proc->procLatch);
+	/* Schedule it to be awoken */
+	AddLatch(wakeups, &proc->procLatch);
 }
 
 /*
@@ -1634,7 +1637,7 @@ ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus)
  * The appropriate lock partition lock must be held by caller.
  */
 void
-ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock)
+ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock, LatchGroup *wakeups)
 {
 	dclist_head *waitQueue = &lock->waitProcs;
 	LOCKMASK	aheadRequests = 0;
@@ -1659,7 +1662,7 @@ ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock)
 			/* OK to waken */
 			GrantLock(lock, proc->waitProcLock, lockmode);
 			/* removes proc from the lock's waiting process queue */
-			ProcWakeup(proc, PROC_WAIT_STATUS_OK);
+			ProcWakeup(proc, PROC_WAIT_STATUS_OK, wakeups);
 		}
 		else
 		{
@@ -1685,6 +1688,7 @@ static void
 CheckDeadLock(void)
 {
 	int			i;
+	LatchGroup	wakeups = {0};
 
 	/*
 	 * Acquire exclusive lock on the entire shared lock data structures. Must
@@ -1719,7 +1723,7 @@ CheckDeadLock(void)
 #endif
 
 	/* Run the deadlock check, and set deadlock_state for use by ProcSleep */
-	deadlock_state = DeadLockCheck(MyProc);
+	deadlock_state = DeadLockCheck(MyProc, &wakeups);
 
 	if (deadlock_state == DS_HARD_DEADLOCK)
 	{
@@ -1736,7 +1740,8 @@ CheckDeadLock(void)
 		 * return from the signal handler.
 		 */
 		Assert(MyProc->waitLock != NULL);
-		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)));
+		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)),
+							&wakeups);
 
 		/*
 		 * We're done here.  Transaction abort caused by the error that
@@ -1760,6 +1765,8 @@ CheckDeadLock(void)
 check_done:
 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
 		LWLockRelease(LockHashPartitionLockByIndex(i));
+
+	SetLatches(&wakeups);
 }
 
 /*
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 8575bea25c..b089bf341e 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -20,6 +20,7 @@
 
 #include "lib/ilist.h"
 #include "storage/backendid.h"
+#include "storage/latch.h"
 #include "storage/lockdefs.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
@@ -583,7 +584,7 @@ extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCK *lock, PROCLOCK *proclock);
 extern void GrantLock(LOCK *lock, PROCLOCK *proclock, LOCKMODE lockmode);
 extern void GrantAwaitedLock(void);
-extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
+extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode, LatchGroup *wakeups);
 extern Size LockShmemSize(void);
 extern LockData *GetLockStatusData(void);
 extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
@@ -600,7 +601,7 @@ extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
-extern DeadLockState DeadLockCheck(PGPROC *proc);
+extern DeadLockState DeadLockCheck(PGPROC *proc, LatchGroup *wakeups);
 extern PGPROC *GetBlockingAutoVacuumPgproc(void);
 extern void DeadLockReport(void) pg_attribute_noreturn();
 extern void RememberSimpleDeadLock(PGPROC *proc1,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 56a31d2a89..6944927222 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -451,9 +451,9 @@ extern int	GetStartupBufferPinWaitBufId(void);
 extern bool HaveNFreeProcs(int n, int *nfree);
 extern void ProcReleaseLocks(bool isCommit);
 
-extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable);
-extern void ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus);
-extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock);
+extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups);
+extern void ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus, LatchGroup *wakeups);
+extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock, LatchGroup *wakeups);
 extern void CheckDeadLockAlert(void);
 extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
-- 
2.41.0

v4-0001-Allow-palloc_extended-NO_OOM-in-critical-sections.patchapplication/x-patch; name=v4-0001-Allow-palloc_extended-NO_OOM-in-critical-sections.patchDownload
From fd907df99388e3694d7ea31d3ebe0dea8acbc742 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 1 Nov 2022 23:29:38 +1300
Subject: [PATCH v4 1/6] Allow palloc_extended(NO_OOM) in critical sections.

Commit 4a170ee9e0e banned palloc() and similar in critical sections, because an
allocation failure would produce a panic.  Make an exception for allocation
with NULL on failure, for code that has a backup plan.

Discussion: https://postgr.es/m/CA%2BhUKGJHudZMG_vh6GiPB61pE%2BGgiBk5jxzd7inijqx5nEZLCw%40mail.gmail.com
---
 src/backend/utils/mmgr/mcxt.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 9fc83f11f6..440c8e684e 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -1139,7 +1139,8 @@ MemoryContextAllocExtended(MemoryContext context, Size size, int flags)
 	void	   *ret;
 
 	Assert(MemoryContextIsValid(context));
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	if (!((flags & MCXT_ALLOC_HUGE) != 0 ? AllocHugeSizeIsValid(size) :
 		  AllocSizeIsValid(size)))
@@ -1294,7 +1295,8 @@ palloc_extended(Size size, int flags)
 	MemoryContext context = CurrentMemoryContext;
 
 	Assert(MemoryContextIsValid(context));
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	if (!((flags & MCXT_ALLOC_HUGE) != 0 ? AllocHugeSizeIsValid(size) :
 		  AllocSizeIsValid(size)))
@@ -1529,7 +1531,8 @@ repalloc_extended(void *pointer, Size size, int flags)
 		  AllocSizeIsValid(size)))
 		elog(ERROR, "invalid memory alloc request size %zu", size);
 
-	AssertNotInCriticalSection(context);
+	if ((flags & MCXT_ALLOC_NO_OOM) == 0)
+		AssertNotInCriticalSection(context);
 
 	/* isReset must be false already */
 	Assert(!context->isReset);
-- 
2.41.0

v4-0006-Don-t-re-acquire-LockManager-partition-lock-after.patchapplication/x-patch; name=v4-0006-Don-t-re-acquire-LockManager-partition-lock-after.patchDownload
From d43e82290028ef61e62d55627cc5e602dd20c27d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Oct 2022 10:56:39 +1300
Subject: [PATCH v4 6/6] Don't re-acquire LockManager partition lock after
 wakeup.

Change the contract of WaitOnLock() and ProcSleep() to return with the
partition lock released.  ProcSleep() re-acquired the lock to prevent
interrupts, which was a problem at the time the ancestor of that code
was committed in fe548629c50, because then we had signal handlers that
longjmp'd out of there.  Now, that can happen only at points where we
explicitly run CHECK_FOR_INTERRUPTS(), so we can remove the lock, which
leads to the idea of making the function always release.

While an earlier commit fixed the problem that the backend woke us up
before it had even released the lock itself, this lock reacquisition
point was still contended when multiple backends woke at the same time.

XXX Right?  Or what am I missing?

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 18 ++++++++++++------
 src/backend/storage/lmgr/proc.c | 20 ++++++++++++--------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 126073a0d3..207abf93d0 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -775,6 +775,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	LWLock	   *partitionLock;
 	bool		found_conflict;
 	bool		log_lock = false;
+	bool		partitionLocked = false;
 	LatchGroup	wakeups = {0};
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
@@ -976,6 +977,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	partitionLock = LockHashPartitionLock(hashcode);
 
 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+	partitionLocked = true;
 
 	/*
 	 * Find or create lock and proclock entries with this tag
@@ -1080,7 +1082,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
 										 locktag->locktag_type,
 										 lockmode);
 
+		Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
 		WaitOnLock(locallock, owner, &wakeups);
+		Assert(!LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+		partitionLocked = false;
 
 		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
 										locktag->locktag_field2,
@@ -1105,7 +1110,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 			PROCLOCK_PRINT("LockAcquire: INCONSISTENT", proclock);
 			LOCK_PRINT("LockAcquire: INCONSISTENT", lock, lockmode);
 			/* Should we retry ? */
-			LWLockRelease(partitionLock);
 			elog(ERROR, "LockAcquire failed");
 		}
 		PROCLOCK_PRINT("LockAcquire: granted", proclock);
@@ -1118,9 +1122,11 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	FinishStrongLockAcquire();
 
-	LWLockRelease(partitionLock);
-
-	SetLatches(&wakeups);
+	if (partitionLocked)
+	{
+		LWLockRelease(partitionLock);
+		SetLatches(&wakeups);
+	}
 
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
@@ -1782,7 +1788,8 @@ MarkLockClear(LOCALLOCK *locallock)
  * Caller must have set MyProc->heldLocks to reflect locks already held
  * on the lockable object by this process.
  *
- * The appropriate partition lock must be held at entry.
+ * The appropriate partition lock must be held at entry.  It is not held on
+ * exit.
  */
 static void
 WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
@@ -1827,7 +1834,6 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, LatchGroup *wakeups)
 			awaitedLock = NULL;
 			LOCK_PRINT("WaitOnLock: aborting on lock",
 					   locallock->lock, locallock->tag.mode);
-			LWLockRelease(LockHashPartitionLock(locallock->hashcode));
 
 			/*
 			 * Now that we aren't holding the partition lock, we can give an
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 8821517897..2c5a6ebe83 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -992,7 +992,7 @@ AuxiliaryPidGetProc(int pid)
  * Caller must have set MyProc->heldLocks to reflect locks already held
  * on the lockable object by this process (under all XIDs).
  *
- * The lock table's partition lock must be held at entry, and will be held
+ * The lock table's partition lock must be held at entry, and will be released
  * at exit.
  *
  * Result: PROC_WAIT_STATUS_OK if we acquired the lock, PROC_WAIT_STATUS_ERROR if not (deadlock).
@@ -1020,6 +1020,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 	ProcWaitStatus myWaitStatus;
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 
+	/*
+	 * Every way out of this function will release the partition lock and send
+	 * buffered latch wakeups.
+	 */
+	Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
@@ -1104,6 +1110,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 					/* Skip the wait and just grant myself the lock. */
 					GrantLock(lock, proclock, lockmode);
 					GrantAwaitedLock();
+					LWLockRelease(partitionLock);
+					SetLatches(wakeups);
 					return PROC_WAIT_STATUS_OK;
 				}
 
@@ -1140,6 +1148,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 	if (early_deadlock)
 	{
 		RemoveFromWaitQueue(MyProc, hashcode, wakeups);
+		LWLockRelease(partitionLock);
+		SetLatches(wakeups);
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
@@ -1469,6 +1479,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 			}
 
 			LWLockRelease(partitionLock);
+			SetLatches(wakeups);
 
 			if (deadlock_state == DS_SOFT_DEADLOCK)
 				ereport(LOG,
@@ -1570,13 +1581,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, LatchGroup *wakeups)
 							standbyWaitStart, GetCurrentTimestamp(),
 							NULL, false);
 
-	/*
-	 * Re-acquire the lock table's partition lock.  We have to do this to hold
-	 * off cancel/die interrupts before we can mess with lockAwaited (else we
-	 * might have a missed or duplicated locallock update).
-	 */
-	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
 	/*
 	 * We no longer want LockErrorCleanup to do anything.
 	 */
-- 
2.41.0

#7Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Thomas Munro (#1)
Re: Latches vs lwlock contention

On 28/10/2022 06:56, Thomas Munro wrote:

One example is heavyweight lock wakeups. If you run BEGIN; LOCK TABLE
t; ... and then N other sessions wait in SELECT * FROM t;, and then
you run ... COMMIT;, you'll see the first session wake all the others
while it still holds the partition lock itself. They'll all wake up
and begin to re-acquire the same partition lock in exclusive mode,
immediately go back to sleep on*that* wait list, and then wake each
other up one at a time in a chain. We could avoid the first
double-bounce by not setting the latches until after we've released
the partition lock. We could avoid the rest of them by not
re-acquiring the partition lock at all, which ... if I'm reading right
... shouldn't actually be necessary in modern PostgreSQL? Or if there
is another reason to re-acquire then maybe the comment should be
updated.

ISTM that the change to not re-aqcuire the lock in ProcSleep is
independent from the other changes. Let's split that off to a separate
patch.

I agree it should be safe. Acquiring a lock just to hold off interrupts
is overkill anwyway, HOLD_INTERRUPTS() would be enough.
LockErrorCleanup() uses HOLD_INTERRUPTS() already.

There are no CHECK_FOR_INTERRUPTS() in GrantAwaitedLock(), so cancel/die
interrupts can't happen here. But could we add HOLD_INTERRUPTS(), just
pro forma, to document the assumption? It's a little awkward: you really
should hold interrupts until the caller has done "awaitedLock = NULL;".
So it's not quite enough to add a pair of HOLD_ and RESUME_INTERRUPTS()
at the end of ProcSleep(). You'd need to do the HOLD_INTERRUPTS() in
ProcSleep() and require the caller to do RESUME_INTERRUPTS(). In a
sense, ProcSleep downgrades the lock on the partition to just holding
off interrupts.

Overall +1 on this change to not re-acquire the partition lock.

--
Heikki Linnakangas
Neon (https://neon.tech)

#8Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#7)
8 attachment(s)
Re: Latches vs lwlock contention

On 28/09/2023 12:58, Heikki Linnakangas wrote:

On 28/10/2022 06:56, Thomas Munro wrote:

One example is heavyweight lock wakeups. If you run BEGIN; LOCK TABLE
t; ... and then N other sessions wait in SELECT * FROM t;, and then
you run ... COMMIT;, you'll see the first session wake all the others
while it still holds the partition lock itself. They'll all wake up
and begin to re-acquire the same partition lock in exclusive mode,
immediately go back to sleep on*that* wait list, and then wake each
other up one at a time in a chain. We could avoid the first
double-bounce by not setting the latches until after we've released
the partition lock. We could avoid the rest of them by not
re-acquiring the partition lock at all, which ... if I'm reading right
... shouldn't actually be necessary in modern PostgreSQL? Or if there
is another reason to re-acquire then maybe the comment should be
updated.

ISTM that the change to not re-aqcuire the lock in ProcSleep is
independent from the other changes. Let's split that off to a separate
patch.

I spent some time on splitting that off. I had to start from scratch,
because commit 2346df6fc373df9c5ab944eebecf7d3036d727de conflicted
heavily with your patch.

I split ProcSleep() into two functions: JoinWaitQueue does the first
part of ProcSleep(), adding the process to the wait queue and checking
for the dontWait and early deadlock cases. What remains in ProcSleep()
does just the sleeping part. JoinWaitQueue is called with the partition
lock held, and ProcSleep() is called without it. This way, the partition
lock is acquired and released in the same function
(LockAcquireExtended), avoiding awkward "lock is held on enter, but
might be released on exit depending on the outcome" logic.

This is actually a set of 8 patches. The first 7 are independent tiny
fixes and refactorings in these functions. See individual commit messages.

I agree it should be safe. Acquiring a lock just to hold off interrupts
is overkill anwyway, HOLD_INTERRUPTS() would be enough.
LockErrorCleanup() uses HOLD_INTERRUPTS() already.

There are no CHECK_FOR_INTERRUPTS() in GrantAwaitedLock(), so cancel/die
interrupts can't happen here. But could we add HOLD_INTERRUPTS(), just
pro forma, to document the assumption? It's a little awkward: you really
should hold interrupts until the caller has done "awaitedLock = NULL;".
So it's not quite enough to add a pair of HOLD_ and RESUME_INTERRUPTS()
at the end of ProcSleep(). You'd need to do the HOLD_INTERRUPTS() in
ProcSleep() and require the caller to do RESUME_INTERRUPTS(). In a
sense, ProcSleep downgrades the lock on the partition to just holding
off interrupts.

I didn't use HOLD/RESUME_INTERRUPTS() after all. Like you did, I'm just
relying on the fact that there are no CHECK_FOR_INTERRUPTS() calls in
places where they might cause trouble. Those sections are short, so I
think it's fine.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-Remove-LOCK_PRINT-call-that-could-point-to-garbage.patchtext/x-patch; charset=UTF-8; name=0001-Remove-LOCK_PRINT-call-that-could-point-to-garbage.patchDownload
From 989069eaca788cfd61010e5f88c0feb359b40180 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 01:21:20 +0300
Subject: [PATCH 1/8] Remove LOCK_PRINT() call that could point to garbage

If a deadlock is detected in ProcSleep, it removes the process from
the wait queue and the shared LOCK and PROCLOCK entries. As soon as it
does that, the other process holding the lock might release it, and
remove the LOCK entry altogether. So on return from ProcSleep(), it's
possible that the LOCK entry is no longer valid.
---
 src/backend/storage/lmgr/lock.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a507779..05b3a243b5c 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1861,8 +1861,6 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 			 * now.
 			 */
 			awaitedLock = NULL;
-			LOCK_PRINT("WaitOnLock: aborting on lock",
-					   locallock->lock, locallock->tag.mode);
 			LWLockRelease(LockHashPartitionLock(locallock->hashcode));
 
 			/*
-- 
2.39.2

0002-Fix-comment-in-LockReleaseAll-on-when-locallock-nLoc.patchtext/x-patch; charset=UTF-8; name=0002-Fix-comment-in-LockReleaseAll-on-when-locallock-nLoc.patchDownload
From 22bc7a3b7826dfbb991b5d63ad1fba52bb297923 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 00:43:13 +0300
Subject: [PATCH 2/8] Fix comment in LockReleaseAll() on when locallock->nLock
 can be zero

We reach this case also e.g. when a deadlock is detected, not only
when we run out of memory.
---
 src/backend/storage/lmgr/lock.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 05b3a243b5c..7eb5c2e5226 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2208,9 +2208,8 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
+		 * If the LOCALLOCK entry is unused, something must've gone wrong
+		 * while trying to acquire this lock.  Just forget the local entry.
 		 */
 		if (locallock->nLocks == 0)
 		{
-- 
2.39.2

0003-Set-MyProc-heldLocks-in-ProcSleep.patchtext/x-patch; charset=UTF-8; name=0003-Set-MyProc-heldLocks-in-ProcSleep.patchDownload
From ae8d12bab05ede48166d15312b222e046aae14a0 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 13:23:59 +0300
Subject: [PATCH 3/8] Set MyProc->heldLocks in ProcSleep

Previously, the ProcSleep() caller was respnsible for setting it, and
we had comments to remind about that. But it seems simpler to make
ProcSleep() itself responsible for it. ProcSleep() already set the
other info about the lock its waiting for (waitLock, waitProcLock and
waitLockMode), so seems natural for it to set heldLocks too.
---
 src/backend/storage/lmgr/lock.c |  8 --------
 src/backend/storage/lmgr/proc.c | 10 ++++++----
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 7eb5c2e5226..489073259b6 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1047,11 +1047,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 	else
 	{
-		/*
-		 * Set bitmask of locks this process already holds on this object.
-		 */
-		MyProc->heldLocks = proclock->holdMask;
-
 		/*
 		 * Sleep till someone wakes me up. We do this even in the dontWait
 		 * case, because while trying to go to sleep, we may discover that we
@@ -1808,9 +1803,6 @@ MarkLockClear(LOCALLOCK *locallock)
 /*
  * WaitOnLock -- wait to acquire a lock
  *
- * Caller must have set MyProc->heldLocks to reflect locks already held
- * on the lockable object by this process.
- *
  * The appropriate partition lock must be held at entry, and will still be
  * held at exit.
  */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1b23efb26f3..5c1d2792c3d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1040,9 +1040,6 @@ AuxiliaryPidGetProc(int pid)
 /*
  * ProcSleep -- put a process to sleep on the specified lock
  *
- * Caller must have set MyProc->heldLocks to reflect locks already held
- * on the lockable object by this process (under all XIDs).
- *
  * It's not actually guaranteed that we need to wait when this function is
  * called, because it could be that when we try to find a position at which
  * to insert ourself into the wait queue, we discover that we must be inserted
@@ -1072,7 +1069,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
 	dclist_head *waitQueue = &lock->waitProcs;
 	PGPROC	   *insert_before = NULL;
-	LOCKMASK	myHeldLocks = MyProc->heldLocks;
+	LOCKMASK	myHeldLocks;
 	TimestampTz standbyWaitStart = 0;
 	bool		early_deadlock = false;
 	bool		allow_autovacuum_cancel = true;
@@ -1080,6 +1077,11 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 	ProcWaitStatus myWaitStatus;
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 
+	/*
+	 * Set bitmask of locks this process already holds on this object.
+	 */
+	myHeldLocks = MyProc->heldLocks = proclock->holdMask;
+
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
-- 
2.39.2

0004-Move-TRACE-calls-into-WaitOnLock.patchtext/x-patch; charset=UTF-8; name=0004-Move-TRACE-calls-into-WaitOnLock.patchDownload
From 05b8b4a7222795377f0a967b6408e61aa6b8c62f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 13:12:57 +0300
Subject: [PATCH 4/8] Move TRACE calls into WaitOnLock()

LockAcquire is a long and complex function. Pushing more stuff to its
subroutines makes it a little more manageable.
---
 src/backend/storage/lmgr/lock.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 489073259b6..bac950efdf3 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1052,23 +1052,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		 * case, because while trying to go to sleep, we may discover that we
 		 * can acquire the lock immediately after all.
 		 */
-
-		TRACE_POSTGRESQL_LOCK_WAIT_START(locktag->locktag_field1,
-										 locktag->locktag_field2,
-										 locktag->locktag_field3,
-										 locktag->locktag_field4,
-										 locktag->locktag_type,
-										 lockmode);
-
 		WaitOnLock(locallock, owner, dontWait);
 
-		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
-										locktag->locktag_field2,
-										locktag->locktag_field3,
-										locktag->locktag_field4,
-										locktag->locktag_type,
-										lockmode);
-
 		/*
 		 * NOTE: do not do any material change of state between here and
 		 * return.  All required changes in locktable state must have been
@@ -1812,6 +1797,13 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 	LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallock);
 	LockMethod	lockMethodTable = LockMethods[lockmethodid];
 
+	TRACE_POSTGRESQL_LOCK_WAIT_START(locallock->tag.lock.locktag_field1,
+									 locallock->tag.lock.locktag_field2,
+									 locallock->tag.lock.locktag_field3,
+									 locallock->tag.lock.locktag_field4,
+									 locallock->tag.lock.locktag_type,
+									 locallock->tag.mode);
+
 	LOCK_PRINT("WaitOnLock: sleeping on lock",
 			   locallock->lock, locallock->tag.mode);
 
@@ -1882,6 +1874,13 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 
 	LOCK_PRINT("WaitOnLock: wakeup on lock",
 			   locallock->lock, locallock->tag.mode);
+
+	TRACE_POSTGRESQL_LOCK_WAIT_DONE(locallock->tag.lock.locktag_field1,
+									locallock->tag.lock.locktag_field2,
+									locallock->tag.lock.locktag_field3,
+									locallock->tag.lock.locktag_field4,
+									locallock->tag.lock.locktag_type,
+									locallock->tag.mode);
 }
 
 /*
-- 
2.39.2

0005-Remove-redundant-lockAwaited-global-variable.patchtext/x-patch; charset=UTF-8; name=0005-Remove-redundant-lockAwaited-global-variable.patchDownload
From 6f70b1f3fad97a89725706ee974ad2665471c6d6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 17:23:33 +0300
Subject: [PATCH 5/8] Remove redundant lockAwaited global variable

We had two global (static) variables to hold the LOCALLOCK that we are
currently acquiring: awaitedLock in lock.c and lockAwaited in
proc.c. They were set at slightly different times, but when they were
both set, they were the same thing. Remove the lockAwaited variable
and rely on awaitedLock everywhere.
---
 src/backend/storage/lmgr/lock.c |  9 +++++++++
 src/backend/storage/lmgr/proc.c | 30 ++++--------------------------
 src/backend/tcop/postgres.c     |  2 +-
 src/include/storage/lock.h      |  2 ++
 src/include/storage/proc.h      |  1 -
 5 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index bac950efdf3..9035145169e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1771,6 +1771,12 @@ GrantAwaitedLock(void)
 	GrantLockLocal(awaitedLock, awaitedOwner);
 }
 
+LOCALLOCK *
+GetAwaitedLock(void)
+{
+	return awaitedLock;
+}
+
 /*
  * MarkLockClear -- mark an acquired lock as "clear"
  *
@@ -1867,6 +1873,9 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 	}
 	PG_END_TRY();
 
+	/*
+	 * We no longer want LockErrorCleanup to do anything.
+	 */
 	awaitedLock = NULL;
 
 	/* reset ps display to remove the suffix */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5c1d2792c3d..7a207d0104d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -79,9 +79,6 @@ PROC_HDR   *ProcGlobal = NULL;
 NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
 PGPROC	   *PreparedXactProcs = NULL;
 
-/* If we are waiting for a lock, this points to the associated LOCALLOCK */
-static LOCALLOCK *lockAwaited = NULL;
-
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
 /* Is a deadlock check pending? */
@@ -706,18 +703,6 @@ HaveNFreeProcs(int n, int *nfree)
 	return (*nfree == n);
 }
 
-/*
- * Check if the current process is awaiting a lock.
- */
-bool
-IsWaitingForLock(void)
-{
-	if (lockAwaited == NULL)
-		return false;
-
-	return true;
-}
-
 /*
  * Cancel any pending wait for lock, when aborting a transaction, and revert
  * any strong lock count acquisition for a lock being acquired.
@@ -729,6 +714,7 @@ IsWaitingForLock(void)
 void
 LockErrorCleanup(void)
 {
+	LOCALLOCK  *lockAwaited;
 	LWLock	   *partitionLock;
 	DisableTimeoutParams timeouts[2];
 
@@ -737,6 +723,7 @@ LockErrorCleanup(void)
 	AbortStrongLockAcquire();
 
 	/* Nothing to do if we weren't waiting for a lock */
+	lockAwaited = GetAwaitedLock();
 	if (lockAwaited == NULL)
 	{
 		RESUME_INTERRUPTS();
@@ -1212,9 +1199,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
-	/* mark that we are waiting for a lock */
-	lockAwaited = locallock;
-
 	/*
 	 * Release the lock table's partition lock.
 	 *
@@ -1639,17 +1623,11 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 							NULL, false);
 
 	/*
-	 * Re-acquire the lock table's partition lock.  We have to do this to hold
-	 * off cancel/die interrupts before we can mess with lockAwaited (else we
-	 * might have a missed or duplicated locallock update).
+	 * Re-acquire the lock table's partition lock, because the caller expects
+	 * us to still hold it.
 	 */
 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
 
-	/*
-	 * We no longer want LockErrorCleanup to do anything.
-	 */
-	lockAwaited = NULL;
-
 	/*
 	 * If we got the lock, be sure to remember it in the locallock table.
 	 */
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ea517f4b9bb..fce65c7313e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3052,7 +3052,7 @@ ProcessRecoveryConflictInterrupt(ProcSignalReason reason)
 			/*
 			 * If we aren't waiting for a lock we can never deadlock.
 			 */
-			if (!IsWaitingForLock())
+			if (GetAwaitedLock() == NULL)
 				return;
 
 			/* Intentional fall through to check wait for pin */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index cc1f6e78c39..45752ac1408 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -583,6 +583,8 @@ extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCK *lock, PROCLOCK *proclock);
 extern void GrantLock(LOCK *lock, PROCLOCK *proclock, LOCKMODE lockmode);
 extern void GrantAwaitedLock(void);
+extern LOCALLOCK *GetAwaitedLock(void);
+
 extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
 extern Size LockShmemSize(void);
 extern LockData *GetLockStatusData(void);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 7d3fc2bfa60..71e05727e54 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -476,7 +476,6 @@ extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock,
 extern void ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus);
 extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock);
 extern void CheckDeadLockAlert(void);
-extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
 
 extern void ProcWaitForSignal(uint32 wait_event_info);
-- 
2.39.2

0006-Update-local-lock-table-in-ProcSleep-s-caller.patchtext/x-patch; charset=UTF-8; name=0006-Update-local-lock-table-in-ProcSleep-s-caller.patchDownload
From 7cb00c043d47490263aa67d2f1dfce9c40f02af0 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 17:27:16 +0300
Subject: [PATCH 6/8] Update local lock table in ProcSleep's caller

ProcSleep is now responsible only for the shared state.  Seems a
little nicer that way.
---
 src/backend/storage/lmgr/lock.c | 4 +++-
 src/backend/storage/lmgr/proc.c | 7 -------
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9035145169e..8d1d57c0533 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1043,7 +1043,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	{
 		/* No conflict with held or previously requested locks */
 		GrantLock(lock, proclock, lockmode);
-		GrantLockLocal(locallock, owner);
 	}
 	else
 	{
@@ -1123,6 +1122,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		LOCK_PRINT("LockAcquire: granted", lock, lockmode);
 	}
 
+	/* The lock was granted to us.  Update the local lock entry accordingly */
+	GrantLockLocal(locallock, owner);
+
 	/*
 	 * Lock state is fully up-to-date now; if we error out after this, no
 	 * special error cleanup is required.
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7a207d0104d..8e74a04b932 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1152,7 +1152,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 				{
 					/* Skip the wait and just grant myself the lock. */
 					GrantLock(lock, proclock, lockmode);
-					GrantAwaitedLock();
 					return PROC_WAIT_STATUS_OK;
 				}
 
@@ -1628,12 +1627,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 	 */
 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
 
-	/*
-	 * If we got the lock, be sure to remember it in the locallock table.
-	 */
-	if (MyProc->waitStatus == PROC_WAIT_STATUS_OK)
-		GrantAwaitedLock();
-
 	/*
 	 * We don't have to do anything else, because the awaker did all the
 	 * necessary update of the lock table and MyProc.
-- 
2.39.2

0007-Release-partition-lock-a-little-earlier.patchtext/x-patch; charset=UTF-8; name=0007-Release-partition-lock-a-little-earlier.patchDownload
From 94804a2380d18107040887fa424919eeac97ba90 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 17:38:06 +0300
Subject: [PATCH 7/8] Release partition lock a little earlier

We don't need to hold the lock to prevent die/cancel interrupts, they
are only processed at explicit CHECK_FOR_INTERRUPTS() points now.
---
 src/backend/storage/lmgr/lock.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 8d1d57c0533..52f2867ffea 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1043,6 +1043,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	{
 		/* No conflict with held or previously requested locks */
 		GrantLock(lock, proclock, lockmode);
+		LWLockRelease(partitionLock);
 	}
 	else
 	{
@@ -1118,8 +1119,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
 				elog(ERROR, "LockAcquire failed");
 			}
 		}
+
 		PROCLOCK_PRINT("LockAcquire: granted", proclock);
 		LOCK_PRINT("LockAcquire: granted", lock, lockmode);
+		LWLockRelease(partitionLock);
 	}
 
 	/* The lock was granted to us.  Update the local lock entry accordingly */
@@ -1131,8 +1134,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	FinishStrongLockAcquire();
 
-	LWLockRelease(partitionLock);
-
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
 	 * standby server.
-- 
2.39.2

0008-Split-ProcSleep-function-into-JoinWaitQueue-and-Proc.patchtext/x-patch; charset=UTF-8; name=0008-Split-ProcSleep-function-into-JoinWaitQueue-and-Proc.patchDownload
From 0df6db8607c5d10d14ae4b7bbc5077ad394b4e6b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 21:53:18 +0300
Subject: [PATCH 8/8] Split ProcSleep function into JoinWaitQueue and ProcSleep

Split ProcSleep into two functions: JoinWaitQueue and ProcSleep.
JoinWaitQueue is called while holding the partition lock, and inserts
the current process to the wait queue, while ProcSleep() does the
actual sleeping. ProcSleep() is now called without holding the
partition lock, and it no longer re-acquires the partition lock before
returning. That makes the wakeup a little cheaper. Once upon a time,
re-acquiring the partition lock was ago to prevent a signal handler
from longjmping out at a bad time, but nowadays our signal handlers
just set flags, and longjmping can only happen at points where we
explicitly run CHECK_FOR_INTERRUPTS().

If JoinWaitQueue detects an "early deadlock" before even joining the
wait queue, it returns without changing the shared lock entry, leaving
the cleanup of the shared lock entry to the caller. This makes early
deadlock behave the same as the dontWait=true case.

One small side-effect of this refactoring is that we now only set the
'ps' title to say "waiting" when we actually enter the sleep, not when
the lock is skipped because dontWait=true, or when a deadlock is
detected early before entering the sleep.

Based on Thomas Munro's earlier patch and observation that ProcSleep
doesn't really need to re-acquire the partition lock.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 182 ++++++++++++++++----------------
 src/backend/storage/lmgr/proc.c | 107 +++++++++++--------
 src/include/storage/proc.h      |   6 +-
 3 files changed, 156 insertions(+), 139 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 52f2867ffea..c51c98f2665 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -360,8 +360,7 @@ static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
-static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner,
-					   bool dontWait);
+static ProcWaitStatus WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
 static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
@@ -1047,85 +1046,105 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 	else
 	{
-		/*
-		 * Sleep till someone wakes me up. We do this even in the dontWait
-		 * case, because while trying to go to sleep, we may discover that we
-		 * can acquire the lock immediately after all.
-		 */
-		WaitOnLock(locallock, owner, dontWait);
+		ProcWaitStatus waitResult;
 
 		/*
-		 * NOTE: do not do any material change of state between here and
-		 * return.  All required changes in locktable state must have been
-		 * done when the lock was granted to us --- see notes in WaitOnLock.
+		 * Join the lock's wait queue.  We do this even in the dontWait case,
+		 * because while joining the queue, we may discover that we can
+		 * acquire the lock immediately after all.
 		 */
+		waitResult = JoinWaitQueue(locallock, lockMethodTable, dontWait);
 
-		/*
-		 * Check the proclock entry status. If dontWait = true, this is an
-		 * expected case; otherwise, it will only happen if something in the
-		 * ipc communication doesn't work correctly.
-		 */
-		if (!(proclock->holdMask & LOCKBIT_ON(lockmode)))
+		if (waitResult == PROC_WAIT_STATUS_ERROR)
 		{
+			/*
+			 * We're not getting the lock because a deadlock was detected
+			 * already while trying to join the wait queue, or because we
+			 * would have to wait but the caller requested no blocking.
+			 *
+			 * Undo the changes to shared entries before releasing the
+			 * partition lock.
+			 */
 			AbortStrongLockAcquire();
 
+			if (proclock->holdMask == 0)
+			{
+				uint32		proclock_hashcode;
+
+				proclock_hashcode = ProcLockHashCode(&proclock->tag,
+													 hashcode);
+				dlist_delete(&proclock->lockLink);
+				dlist_delete(&proclock->procLink);
+				if (!hash_search_with_hash_value(LockMethodProcLockHash,
+												 &(proclock->tag),
+												 proclock_hashcode,
+												 HASH_REMOVE,
+												 NULL))
+					elog(PANIC, "proclock table corrupted");
+			}
+			else
+				PROCLOCK_PRINT("LockAcquire: did not join wait queue", proclock);
+			lock->nRequested--;
+			lock->requested[lockmode]--;
+			LOCK_PRINT("LockAcquire: did not join wait queue",
+					   lock, lockmode);
+			Assert((lock->nRequested > 0) &&
+				   (lock->requested[lockmode] >= 0));
+			Assert(lock->nGranted <= lock->nRequested);
+			LWLockRelease(partitionLock);
+			if (locallock->nLocks == 0)
+				RemoveLocalLock(locallock);
+
 			if (dontWait)
 			{
-				/*
-				 * We can't acquire the lock immediately.  If caller specified
-				 * no blocking, remove useless table entries and return
-				 * LOCKACQUIRE_NOT_AVAIL without waiting.
-				 */
-				if (proclock->holdMask == 0)
-				{
-					uint32		proclock_hashcode;
-
-					proclock_hashcode = ProcLockHashCode(&proclock->tag,
-														 hashcode);
-					dlist_delete(&proclock->lockLink);
-					dlist_delete(&proclock->procLink);
-					if (!hash_search_with_hash_value(LockMethodProcLockHash,
-													 &(proclock->tag),
-													 proclock_hashcode,
-													 HASH_REMOVE,
-													 NULL))
-						elog(PANIC, "proclock table corrupted");
-				}
-				else
-					PROCLOCK_PRINT("LockAcquire: NOWAIT", proclock);
-				lock->nRequested--;
-				lock->requested[lockmode]--;
-				LOCK_PRINT("LockAcquire: conditional lock failed",
-						   lock, lockmode);
-				Assert((lock->nRequested > 0) &&
-					   (lock->requested[lockmode] >= 0));
-				Assert(lock->nGranted <= lock->nRequested);
-				LWLockRelease(partitionLock);
-				if (locallock->nLocks == 0)
-					RemoveLocalLock(locallock);
 				if (locallockp)
 					*locallockp = NULL;
 				return LOCKACQUIRE_NOT_AVAIL;
 			}
 			else
+			{
+				DeadLockReport();
+				/* DeadLockReport() will not return */
+			}
+		}
+
+		/*
+		 * We are now in the lock queue, or the lock was already granted.  If
+		 * queued, go to sleep.
+		 */
+		if (waitResult == PROC_WAIT_STATUS_WAITING)
+		{
+			Assert(!dontWait);
+			PROCLOCK_PRINT("LockAcquire: sleeping on lock", proclock);
+			LOCK_PRINT("LockAcquire: sleeping on lock", lock, lockmode);
+			LWLockRelease(partitionLock);
+
+			waitResult = WaitOnLock(locallock, owner);
+
+			/*
+			 * NOTE: do not do any material change of state between here and
+			 * return.  All required changes in locktable state must have been
+			 * done when the lock was granted to us --- see notes in WaitOnLock.
+			 */
+
+			if (waitResult == PROC_WAIT_STATUS_ERROR)
 			{
 				/*
-				 * We should have gotten the lock, but somehow that didn't
-				 * happen. If we get here, it's a bug.
+				 * We failed as a result of a deadlock, see CheckDeadLock(). Quit
+				 * now.
 				 */
-				PROCLOCK_PRINT("LockAcquire: INCONSISTENT", proclock);
-				LOCK_PRINT("LockAcquire: INCONSISTENT", lock, lockmode);
-				LWLockRelease(partitionLock);
-				elog(ERROR, "LockAcquire failed");
+				Assert(!dontWait);
+				DeadLockReport();
+				/* DeadLockReport() will not return */
 			}
 		}
-
-		PROCLOCK_PRINT("LockAcquire: granted", proclock);
-		LOCK_PRINT("LockAcquire: granted", lock, lockmode);
-		LWLockRelease(partitionLock);
+		else
+			LWLockRelease(partitionLock);
+		Assert(waitResult == PROC_WAIT_STATUS_OK);
 	}
 
 	/* The lock was granted to us.  Update the local lock entry accordingly */
+	Assert((proclock->holdMask & LOCKBIT_ON(lockmode)) != 0);
 	GrantLockLocal(locallock, owner);
 
 	/*
@@ -1797,14 +1816,12 @@ MarkLockClear(LOCALLOCK *locallock)
 /*
  * WaitOnLock -- wait to acquire a lock
  *
- * The appropriate partition lock must be held at entry, and will still be
- * held at exit.
+ * This is a wrapper around ProcSleep, with extra tracing and bookkeeping.
  */
-static void
-WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
+static ProcWaitStatus
+WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallock);
-	LockMethod	lockMethodTable = LockMethods[lockmethodid];
+	ProcWaitStatus result;
 
 	TRACE_POSTGRESQL_LOCK_WAIT_START(locallock->tag.lock.locktag_field1,
 									 locallock->tag.lock.locktag_field2,
@@ -1813,12 +1830,13 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 									 locallock->tag.lock.locktag_type,
 									 locallock->tag.mode);
 
-	LOCK_PRINT("WaitOnLock: sleeping on lock",
-			   locallock->lock, locallock->tag.mode);
-
 	/* adjust the process title to indicate that it's waiting */
 	set_ps_display_suffix("waiting");
 
+	/*
+	 * Record the fact that we are waiting for a lock, so that
+	 * LockErrorCleanup will clean up if cancel/die happens.
+	 */
 	awaitedLock = locallock;
 	awaitedOwner = owner;
 
@@ -1841,28 +1859,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 	 */
 	PG_TRY();
 	{
-		/*
-		 * If dontWait = true, we handle success and failure in the same way
-		 * here. The caller will be able to sort out what has happened.
-		 */
-		if (ProcSleep(locallock, lockMethodTable, dontWait) != PROC_WAIT_STATUS_OK
-			&& !dontWait)
-		{
-
-			/*
-			 * We failed as a result of a deadlock, see CheckDeadLock(). Quit
-			 * now.
-			 */
-			awaitedLock = NULL;
-			LWLockRelease(LockHashPartitionLock(locallock->hashcode));
-
-			/*
-			 * Now that we aren't holding the partition lock, we can give an
-			 * error report including details about the detected deadlock.
-			 */
-			DeadLockReport();
-			/* not reached */
-		}
+		result = ProcSleep(locallock);
 	}
 	PG_CATCH();
 	{
@@ -1884,15 +1881,14 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 	/* reset ps display to remove the suffix */
 	set_ps_display_remove_suffix();
 
-	LOCK_PRINT("WaitOnLock: wakeup on lock",
-			   locallock->lock, locallock->tag.mode);
-
 	TRACE_POSTGRESQL_LOCK_WAIT_DONE(locallock->tag.lock.locktag_field1,
 									locallock->tag.lock.locktag_field2,
 									locallock->tag.lock.locktag_field3,
 									locallock->tag.lock.locktag_field4,
 									locallock->tag.lock.locktag_type,
 									locallock->tag.mode);
+
+	return result;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 8e74a04b932..6d0adbb8018 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1025,7 +1025,7 @@ AuxiliaryPidGetProc(int pid)
 
 
 /*
- * ProcSleep -- put a process to sleep on the specified lock
+ * JoinWaitQueue -- join the wait queue on the specified lock
  *
  * It's not actually guaranteed that we need to wait when this function is
  * called, because it could be that when we try to find a position at which
@@ -1034,20 +1034,24 @@ AuxiliaryPidGetProc(int pid)
  * we get the lock immediately. Because of this, it's sensible for this function
  * to have a dontWait argument, despite the name.
  *
- * The lock table's partition lock must be held at entry, and will be held
- * at exit.
+ * On entry, the caller has already set up LOCK and PROCLOCK entries to
+ * reflect that we have "requested" the lock.  The caller is responsible for
+ * cleaning that up, if we end up not joining the queue after all.
  *
- * Result: PROC_WAIT_STATUS_OK if we acquired the lock, PROC_WAIT_STATUS_ERROR
- * if not (if dontWait = true, we would have had to wait; if dontWait = false,
- * this is a deadlock).
+ * The lock table's partition lock must be held at entry, and is still held
+ * at exit.  The caller must release it before calling ProcSleep().
  *
- * ASSUME: that no one will fiddle with the queue until after
- *		we release the partition lock.
+ * Result is one of the following:
+ *
+ *  PROC_WAIT_STATUS_OK       - lock was immediately granted
+ *  PROC_WAIT_STATUS_WAITING  - joined the wait queue; call ProcSleep()
+ *  PROC_WAIT_STATUS_ERROR    - immediate deadlock was detected, or would
+ *                              need to wait and dontWait == true
  *
  * NOTES: The process queue is now a priority queue for locking.
  */
 ProcWaitStatus
-ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
+JoinWaitQueue(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 {
 	LOCKMODE	lockmode = locallock->tag.mode;
 	LOCK	   *lock = locallock->lock;
@@ -1057,13 +1061,16 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 	dclist_head *waitQueue = &lock->waitProcs;
 	PGPROC	   *insert_before = NULL;
 	LOCKMASK	myHeldLocks;
-	TimestampTz standbyWaitStart = 0;
 	bool		early_deadlock = false;
-	bool		allow_autovacuum_cancel = true;
-	bool		logged_recovery_conflict = false;
-	ProcWaitStatus myWaitStatus;
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 
+	Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+
+	/*
+	 * Set bitmask of locks this process already holds on this object.
+	 */
+	myHeldLocks = MyProc->heldLocks = proclock->holdMask;
+
 	/*
 	 * Set bitmask of locks this process already holds on this object.
 	 */
@@ -1164,6 +1171,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 		}
 	}
 
+	/*
+	 * If we detected deadlock, give up without waiting.  This must agree with
+	 * CheckDeadLock's recovery code.
+	 */
+	if (early_deadlock)
+		return PROC_WAIT_STATUS_ERROR;
+
 	/*
 	 * At this point we know that we'd really need to sleep. If we've been
 	 * commanded not to do that, bail out.
@@ -1188,31 +1202,43 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 
 	MyProc->waitStatus = PROC_WAIT_STATUS_WAITING;
 
-	/*
-	 * If we detected deadlock, give up without waiting.  This must agree with
-	 * CheckDeadLock's recovery code.
-	 */
-	if (early_deadlock)
-	{
-		RemoveFromWaitQueue(MyProc, hashcode);
-		return PROC_WAIT_STATUS_ERROR;
-	}
+	return PROC_WAIT_STATUS_WAITING;
+}
 
-	/*
-	 * Release the lock table's partition lock.
-	 *
-	 * NOTE: this may also cause us to exit critical-section state, possibly
-	 * allowing a cancel/die interrupt to be accepted. This is OK because we
-	 * have recorded the fact that we are waiting for a lock, and so
-	 * LockErrorCleanup will clean up if cancel/die happens.
-	 */
-	LWLockRelease(partitionLock);
+/*
+ * ProcSleep -- put process to sleep waiting on lock
+ *
+ * This must be called when JoinWaitQueue() returns PROC_WAIT_STATUS_WAITING.
+ * Returns after the lock has been granted, or if a deadlock is detected.  Can
+ * also bail out with ereport(ERROR), if some other error condition, or a
+ * timeout or cancellation is triggered.
+ *
+ * Result is one of the following:
+ *
+ *  PROC_WAIT_STATUS_OK      - lock was granted
+ *  PROC_WAIT_STATUS_ERROR   - a deadlock was detected
+ */
+ProcWaitStatus
+ProcSleep(LOCALLOCK *locallock)
+{
+	LOCKMODE	lockmode = locallock->tag.mode;
+	LOCK	   *lock = locallock->lock;
+	uint32		hashcode = locallock->hashcode;
+	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
+	TimestampTz standbyWaitStart = 0;
+	bool		allow_autovacuum_cancel = true;
+	bool		logged_recovery_conflict = false;
+	ProcWaitStatus myWaitStatus;
+
+	/* The caller must've armed the on-error cleanup mechanism */
+	Assert(GetAwaitedLock() == locallock);
+	Assert(!LWLockHeldByMe(partitionLock));
 
 	/*
-	 * Also, now that we will successfully clean up after an ereport, it's
-	 * safe to check to see if there's a buffer pin deadlock against the
-	 * Startup process.  Of course, that's only necessary if we're doing Hot
-	 * Standby and are not the Startup process ourselves.
+	 * Now that we will successfully clean up after an ereport, it's safe to
+	 * check to see if there's a buffer pin deadlock against the Startup
+	 * process.  Of course, that's only necessary if we're doing Hot Standby
+	 * and are not the Startup process ourselves.
 	 */
 	if (RecoveryInProgress() && !InRecovery)
 		CheckRecoveryConflictDeadlock();
@@ -1621,17 +1647,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 							standbyWaitStart, GetCurrentTimestamp(),
 							NULL, false);
 
-	/*
-	 * Re-acquire the lock table's partition lock, because the caller expects
-	 * us to still hold it.
-	 */
-	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
 	/*
 	 * We don't have to do anything else, because the awaker did all the
-	 * necessary update of the lock table and MyProc.
+	 * necessary updates of the lock table and MyProc. (The caller is
+	 * responsible for updating the local lock table.)
 	 */
-	return MyProc->waitStatus;
+	return myWaitStatus;
 }
 
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 71e05727e54..89dd7578377 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -470,9 +470,9 @@ extern int	GetStartupBufferPinWaitBufId(void);
 extern bool HaveNFreeProcs(int n, int *nfree);
 extern void ProcReleaseLocks(bool isCommit);
 
-extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock,
-								LockMethod lockMethodTable,
-								bool dontWait);
+extern ProcWaitStatus JoinWaitQueue(LOCALLOCK *locallock,
+									LockMethod lockMethodTable, bool dontWait);
+extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock);
 extern void ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus);
 extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock);
 extern void CheckDeadLockAlert(void);
-- 
2.39.2

#9Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#8)
8 attachment(s)
Re: Latches vs lwlock contention

I looked at the patch set and found it quite useful.

The first 7 patches are just refactoring and may be committed separately if
needed.
There were minor problems: patch #5 don't want to apply clearly and the #8
is complained
about partitionLock is unused if we build without asserts. So, I add a
PG_USED_FOR_ASSERTS_ONLY
to solve the last issue.

Again, overall patch looks good and seems useful to me. Here is the rebased
v5 version based on Heikki's patch set above.

--
Best regards,
Maxim Orlov.

Attachments:

v5-0004-Move-TRACE-calls-into-WaitOnLock.patchapplication/octet-stream; name=v5-0004-Move-TRACE-calls-into-WaitOnLock.patchDownload
From 6f381e04a3201c6926190b234f128377ca3f84f4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 13:12:57 +0300
Subject: [PATCH v5 4/8] Move TRACE calls into WaitOnLock()

LockAcquire is a long and complex function. Pushing more stuff to its
subroutines makes it a little more manageable.
---
 src/backend/storage/lmgr/lock.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index abac7134da..e82a041097 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1051,23 +1051,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		 * case, because while trying to go to sleep, we may discover that we
 		 * can acquire the lock immediately after all.
 		 */
-
-		TRACE_POSTGRESQL_LOCK_WAIT_START(locktag->locktag_field1,
-										 locktag->locktag_field2,
-										 locktag->locktag_field3,
-										 locktag->locktag_field4,
-										 locktag->locktag_type,
-										 lockmode);
-
 		WaitOnLock(locallock, owner, dontWait);
 
-		TRACE_POSTGRESQL_LOCK_WAIT_DONE(locktag->locktag_field1,
-										locktag->locktag_field2,
-										locktag->locktag_field3,
-										locktag->locktag_field4,
-										locktag->locktag_type,
-										lockmode);
-
 		/*
 		 * NOTE: do not do any material change of state between here and
 		 * return.  All required changes in locktable state must have been
@@ -1811,6 +1796,13 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 	LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallock);
 	LockMethod	lockMethodTable = LockMethods[lockmethodid];
 
+	TRACE_POSTGRESQL_LOCK_WAIT_START(locallock->tag.lock.locktag_field1,
+									 locallock->tag.lock.locktag_field2,
+									 locallock->tag.lock.locktag_field3,
+									 locallock->tag.lock.locktag_field4,
+									 locallock->tag.lock.locktag_type,
+									 locallock->tag.mode);
+
 	LOCK_PRINT("WaitOnLock: sleeping on lock",
 			   locallock->lock, locallock->tag.mode);
 
@@ -1881,6 +1873,13 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 
 	LOCK_PRINT("WaitOnLock: wakeup on lock",
 			   locallock->lock, locallock->tag.mode);
+
+	TRACE_POSTGRESQL_LOCK_WAIT_DONE(locallock->tag.lock.locktag_field1,
+									locallock->tag.lock.locktag_field2,
+									locallock->tag.lock.locktag_field3,
+									locallock->tag.lock.locktag_field4,
+									locallock->tag.lock.locktag_type,
+									locallock->tag.mode);
 }
 
 /*
-- 
2.34.1

v5-0001-Remove-LOCK_PRINT-call-that-could-point-to-garbag.patchapplication/octet-stream; name=v5-0001-Remove-LOCK_PRINT-call-that-could-point-to-garbag.patchDownload
From e7c1d682a6d7ff2104a17659a0546c91b390601a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 01:21:20 +0300
Subject: [PATCH v5 1/8] Remove LOCK_PRINT() call that could point to garbage

If a deadlock is detected in ProcSleep, it removes the process from
the wait queue and the shared LOCK and PROCLOCK entries. As soon as it
does that, the other process holding the lock might release it, and
remove the LOCK entry altogether. So on return from ProcSleep(), it's
possible that the LOCK entry is no longer valid.
---
 src/backend/storage/lmgr/lock.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 83b99a98f0..4d2b510835 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1860,8 +1860,6 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 			 * now.
 			 */
 			awaitedLock = NULL;
-			LOCK_PRINT("WaitOnLock: aborting on lock",
-					   locallock->lock, locallock->tag.mode);
 			LWLockRelease(LockHashPartitionLock(locallock->hashcode));
 
 			/*
-- 
2.34.1

v5-0003-Set-MyProc-heldLocks-in-ProcSleep.patchapplication/octet-stream; name=v5-0003-Set-MyProc-heldLocks-in-ProcSleep.patchDownload
From f1ee1178fa214cce8885b99ff76441781e0816af Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 13:23:59 +0300
Subject: [PATCH v5 3/8] Set MyProc->heldLocks in ProcSleep

Previously, the ProcSleep() caller was respnsible for setting it, and
we had comments to remind about that. But it seems simpler to make
ProcSleep() itself responsible for it. ProcSleep() already set the
other info about the lock its waiting for (waitLock, waitProcLock and
waitLockMode), so seems natural for it to set heldLocks too.
---
 src/backend/storage/lmgr/lock.c |  8 --------
 src/backend/storage/lmgr/proc.c | 10 ++++++----
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 68dff44495..abac7134da 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1046,11 +1046,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 	else
 	{
-		/*
-		 * Set bitmask of locks this process already holds on this object.
-		 */
-		MyProc->heldLocks = proclock->holdMask;
-
 		/*
 		 * Sleep till someone wakes me up. We do this even in the dontWait
 		 * case, because while trying to go to sleep, we may discover that we
@@ -1807,9 +1802,6 @@ MarkLockClear(LOCALLOCK *locallock)
 /*
  * WaitOnLock -- wait to acquire a lock
  *
- * Caller must have set MyProc->heldLocks to reflect locks already held
- * on the lockable object by this process.
- *
  * The appropriate partition lock must be held at entry, and will still be
  * held at exit.
  */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ac66da8638..9aff57a519 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1046,9 +1046,6 @@ AuxiliaryPidGetProc(int pid)
 /*
  * ProcSleep -- put a process to sleep on the specified lock
  *
- * Caller must have set MyProc->heldLocks to reflect locks already held
- * on the lockable object by this process (under all XIDs).
- *
  * It's not actually guaranteed that we need to wait when this function is
  * called, because it could be that when we try to find a position at which
  * to insert ourself into the wait queue, we discover that we must be inserted
@@ -1078,7 +1075,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
 	dclist_head *waitQueue = &lock->waitProcs;
 	PGPROC	   *insert_before = NULL;
-	LOCKMASK	myHeldLocks = MyProc->heldLocks;
+	LOCKMASK	myHeldLocks;
 	TimestampTz standbyWaitStart = 0;
 	bool		early_deadlock = false;
 	bool		allow_autovacuum_cancel = true;
@@ -1086,6 +1083,11 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 	ProcWaitStatus myWaitStatus;
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 
+	/*
+	 * Set bitmask of locks this process already holds on this object.
+	 */
+	myHeldLocks = MyProc->heldLocks = proclock->holdMask;
+
 	/*
 	 * If group locking is in use, locks held by members of my locking group
 	 * need to be included in myHeldLocks.  This is not required for relation
-- 
2.34.1

v5-0005-Remove-redundant-lockAwaited-global-variable.patchapplication/octet-stream; name=v5-0005-Remove-redundant-lockAwaited-global-variable.patchDownload
From ae12e4be3b34ffac896d7c406f8a5f4ad5815579 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <m.orlov@postgrespro.ru>
Date: Tue, 10 Sep 2024 18:58:09 +0300
Subject: [PATCH v5 5/8] Remove redundant lockAwaited global variable

We had two global (static) variables to hold the LOCALLOCK that we are
currently acquiring: awaitedLock in lock.c and lockAwaited in
proc.c. They were set at slightly different times, but when they were
both set, they were the same thing. Remove the lockAwaited variable
and rely on awaitedLock everywhere.
---
 src/backend/storage/lmgr/lock.c |  9 +++++++++
 src/backend/storage/lmgr/proc.c | 30 ++++--------------------------
 src/backend/tcop/postgres.c     |  2 +-
 src/include/storage/lock.h      |  2 ++
 src/include/storage/proc.h      |  1 -
 5 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e82a041097..4e1890276f 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1770,6 +1770,12 @@ GrantAwaitedLock(void)
 	GrantLockLocal(awaitedLock, awaitedOwner);
 }
 
+LOCALLOCK *
+GetAwaitedLock(void)
+{
+	return awaitedLock;
+}
+
 /*
  * MarkLockClear -- mark an acquired lock as "clear"
  *
@@ -1866,6 +1872,9 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 	}
 	PG_END_TRY();
 
+	/*
+	 * We no longer want LockErrorCleanup to do anything.
+	 */
 	awaitedLock = NULL;
 
 	/* reset ps display to remove the suffix */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9aff57a519..e8084dd3d5 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -80,9 +80,6 @@ PROC_HDR   *ProcGlobal = NULL;
 NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
 PGPROC	   *PreparedXactProcs = NULL;
 
-/* If we are waiting for a lock, this points to the associated LOCALLOCK */
-static LOCALLOCK *lockAwaited = NULL;
-
 static DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
 
 /* Is a deadlock check pending? */
@@ -707,18 +704,6 @@ HaveNFreeProcs(int n, int *nfree)
 	return (*nfree == n);
 }
 
-/*
- * Check if the current process is awaiting a lock.
- */
-bool
-IsWaitingForLock(void)
-{
-	if (lockAwaited == NULL)
-		return false;
-
-	return true;
-}
-
 /*
  * Cancel any pending wait for lock, when aborting a transaction, and revert
  * any strong lock count acquisition for a lock being acquired.
@@ -730,6 +715,7 @@ IsWaitingForLock(void)
 void
 LockErrorCleanup(void)
 {
+	LOCALLOCK  *lockAwaited;
 	LWLock	   *partitionLock;
 	DisableTimeoutParams timeouts[2];
 
@@ -738,6 +724,7 @@ LockErrorCleanup(void)
 	AbortStrongLockAcquire();
 
 	/* Nothing to do if we weren't waiting for a lock */
+	lockAwaited = GetAwaitedLock();
 	if (lockAwaited == NULL)
 	{
 		RESUME_INTERRUPTS();
@@ -1218,9 +1205,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 		return PROC_WAIT_STATUS_ERROR;
 	}
 
-	/* mark that we are waiting for a lock */
-	lockAwaited = locallock;
-
 	/*
 	 * Release the lock table's partition lock.
 	 *
@@ -1645,17 +1629,11 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 							NULL, false);
 
 	/*
-	 * Re-acquire the lock table's partition lock.  We have to do this to hold
-	 * off cancel/die interrupts before we can mess with lockAwaited (else we
-	 * might have a missed or duplicated locallock update).
+	 * Re-acquire the lock table's partition lock, because the caller expects
+	 * us to still hold it.
 	 */
 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
 
-	/*
-	 * We no longer want LockErrorCleanup to do anything.
-	 */
-	lockAwaited = NULL;
-
 	/*
 	 * If we got the lock, be sure to remember it in the locallock table.
 	 */
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8bc6bea113..9b800b72bf 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3062,7 +3062,7 @@ ProcessRecoveryConflictInterrupt(ProcSignalReason reason)
 			/*
 			 * If we aren't waiting for a lock we can never deadlock.
 			 */
-			if (!IsWaitingForLock())
+			if (GetAwaitedLock() == NULL)
 				return;
 
 			/* Intentional fall through to check wait for pin */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 8b328a06d9..787f3db06a 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -585,6 +585,8 @@ extern bool LockCheckConflicts(LockMethod lockMethodTable,
 							   LOCK *lock, PROCLOCK *proclock);
 extern void GrantLock(LOCK *lock, PROCLOCK *proclock, LOCKMODE lockmode);
 extern void GrantAwaitedLock(void);
+extern LOCALLOCK *GetAwaitedLock(void);
+
 extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
 extern LockData *GetLockStatusData(void);
 extern BlockedProcsData *GetBlockerStatusData(int blocked_pid);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index deeb06c9e0..7ddfc8f392 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -481,7 +481,6 @@ extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock,
 extern void ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus);
 extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock);
 extern void CheckDeadLockAlert(void);
-extern bool IsWaitingForLock(void);
 extern void LockErrorCleanup(void);
 
 extern void ProcWaitForSignal(uint32 wait_event_info);
-- 
2.34.1

v5-0002-Fix-comment-in-LockReleaseAll-on-when-locallock-n.patchapplication/octet-stream; name=v5-0002-Fix-comment-in-LockReleaseAll-on-when-locallock-n.patchDownload
From 4660001bf25079c35e584be03c8a4dcf7b8ff1e0 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 00:43:13 +0300
Subject: [PATCH v5 2/8] Fix comment in LockReleaseAll() on when
 locallock->nLock can be zero

We reach this case also e.g. when a deadlock is detected, not only
when we run out of memory.
---
 src/backend/storage/lmgr/lock.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4d2b510835..68dff44495 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2207,9 +2207,8 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
 	{
 		/*
-		 * If the LOCALLOCK entry is unused, we must've run out of shared
-		 * memory while trying to set up this lock.  Just forget the local
-		 * entry.
+		 * If the LOCALLOCK entry is unused, something must've gone wrong
+		 * while trying to acquire this lock.  Just forget the local entry.
 		 */
 		if (locallock->nLocks == 0)
 		{
-- 
2.34.1

v5-0007-Release-partition-lock-a-little-earlier.patchapplication/octet-stream; name=v5-0007-Release-partition-lock-a-little-earlier.patchDownload
From dc32923f690390f120fccc154d656ff6529d6302 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 17:38:06 +0300
Subject: [PATCH v5 7/8] Release partition lock a little earlier

We don't need to hold the lock to prevent die/cancel interrupts, they
are only processed at explicit CHECK_FOR_INTERRUPTS() points now.
---
 src/backend/storage/lmgr/lock.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 91d2402609..fc77c88747 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1042,6 +1042,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	{
 		/* No conflict with held or previously requested locks */
 		GrantLock(lock, proclock, lockmode);
+		LWLockRelease(partitionLock);
 	}
 	else
 	{
@@ -1117,8 +1118,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
 				elog(ERROR, "LockAcquire failed");
 			}
 		}
+
 		PROCLOCK_PRINT("LockAcquire: granted", proclock);
 		LOCK_PRINT("LockAcquire: granted", lock, lockmode);
+		LWLockRelease(partitionLock);
 	}
 
 	/* The lock was granted to us.  Update the local lock entry accordingly */
@@ -1130,8 +1133,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 */
 	FinishStrongLockAcquire();
 
-	LWLockRelease(partitionLock);
-
 	/*
 	 * Emit a WAL record if acquisition of this lock needs to be replayed in a
 	 * standby server.
-- 
2.34.1

v5-0006-Update-local-lock-table-in-ProcSleep-s-caller.patchapplication/octet-stream; name=v5-0006-Update-local-lock-table-in-ProcSleep-s-caller.patchDownload
From 7d5556cabd3955ee25afc55a145d82cefb39b663 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 17:27:16 +0300
Subject: [PATCH v5 6/8] Update local lock table in ProcSleep's caller

ProcSleep is now responsible only for the shared state.  Seems a
little nicer that way.
---
 src/backend/storage/lmgr/lock.c | 4 +++-
 src/backend/storage/lmgr/proc.c | 7 -------
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 4e1890276f..91d2402609 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1042,7 +1042,6 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	{
 		/* No conflict with held or previously requested locks */
 		GrantLock(lock, proclock, lockmode);
-		GrantLockLocal(locallock, owner);
 	}
 	else
 	{
@@ -1122,6 +1121,9 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		LOCK_PRINT("LockAcquire: granted", lock, lockmode);
 	}
 
+	/* The lock was granted to us.  Update the local lock entry accordingly */
+	GrantLockLocal(locallock, owner);
+
 	/*
 	 * Lock state is fully up-to-date now; if we error out after this, no
 	 * special error cleanup is required.
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e8084dd3d5..f29e02e902 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1158,7 +1158,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 				{
 					/* Skip the wait and just grant myself the lock. */
 					GrantLock(lock, proclock, lockmode);
-					GrantAwaitedLock();
 					return PROC_WAIT_STATUS_OK;
 				}
 
@@ -1634,12 +1633,6 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 	 */
 	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
 
-	/*
-	 * If we got the lock, be sure to remember it in the locallock table.
-	 */
-	if (MyProc->waitStatus == PROC_WAIT_STATUS_OK)
-		GrantAwaitedLock();
-
 	/*
 	 * We don't have to do anything else, because the awaker did all the
 	 * necessary update of the lock table and MyProc.
-- 
2.34.1

v5-0008-Split-ProcSleep-function-into-JoinWaitQueue-and-P.patchapplication/octet-stream; name=v5-0008-Split-ProcSleep-function-into-JoinWaitQueue-and-P.patchDownload
From ecc3beda97fa00621998b304809a4156f2c1c3c9 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2024 21:53:18 +0300
Subject: [PATCH v5 8/8] Split ProcSleep function into JoinWaitQueue and
 ProcSleep

Split ProcSleep into two functions: JoinWaitQueue and ProcSleep.
JoinWaitQueue is called while holding the partition lock, and inserts
the current process to the wait queue, while ProcSleep() does the
actual sleeping. ProcSleep() is now called without holding the
partition lock, and it no longer re-acquires the partition lock before
returning. That makes the wakeup a little cheaper. Once upon a time,
re-acquiring the partition lock was ago to prevent a signal handler
from longjmping out at a bad time, but nowadays our signal handlers
just set flags, and longjmping can only happen at points where we
explicitly run CHECK_FOR_INTERRUPTS().

If JoinWaitQueue detects an "early deadlock" before even joining the
wait queue, it returns without changing the shared lock entry, leaving
the cleanup of the shared lock entry to the caller. This makes early
deadlock behave the same as the dontWait=true case.

One small side-effect of this refactoring is that we now only set the
'ps' title to say "waiting" when we actually enter the sleep, not when
the lock is skipped because dontWait=true, or when a deadlock is
detected early before entering the sleep.

Based on Thomas Munro's earlier patch and observation that ProcSleep
doesn't really need to re-acquire the partition lock.

Discussion: https://postgr.es/m/CA%2BhUKGKmO7ze0Z6WXKdrLxmvYa%3DzVGGXOO30MMktufofVwEm1A%40mail.gmail.com
---
 src/backend/storage/lmgr/lock.c | 182 ++++++++++++++++----------------
 src/backend/storage/lmgr/proc.c | 109 +++++++++++--------
 src/include/storage/proc.h      |   6 +-
 3 files changed, 157 insertions(+), 140 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index fc77c88747..ede791c9ff 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -360,8 +360,7 @@ static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
-static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner,
-					   bool dontWait);
+static ProcWaitStatus WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
 static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
@@ -1046,85 +1045,105 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	}
 	else
 	{
-		/*
-		 * Sleep till someone wakes me up. We do this even in the dontWait
-		 * case, because while trying to go to sleep, we may discover that we
-		 * can acquire the lock immediately after all.
-		 */
-		WaitOnLock(locallock, owner, dontWait);
+		ProcWaitStatus waitResult;
 
 		/*
-		 * NOTE: do not do any material change of state between here and
-		 * return.  All required changes in locktable state must have been
-		 * done when the lock was granted to us --- see notes in WaitOnLock.
+		 * Join the lock's wait queue.  We do this even in the dontWait case,
+		 * because while joining the queue, we may discover that we can
+		 * acquire the lock immediately after all.
 		 */
+		waitResult = JoinWaitQueue(locallock, lockMethodTable, dontWait);
 
-		/*
-		 * Check the proclock entry status. If dontWait = true, this is an
-		 * expected case; otherwise, it will only happen if something in the
-		 * ipc communication doesn't work correctly.
-		 */
-		if (!(proclock->holdMask & LOCKBIT_ON(lockmode)))
+		if (waitResult == PROC_WAIT_STATUS_ERROR)
 		{
+			/*
+			 * We're not getting the lock because a deadlock was detected
+			 * already while trying to join the wait queue, or because we
+			 * would have to wait but the caller requested no blocking.
+			 *
+			 * Undo the changes to shared entries before releasing the
+			 * partition lock.
+			 */
 			AbortStrongLockAcquire();
 
+			if (proclock->holdMask == 0)
+			{
+				uint32		proclock_hashcode;
+
+				proclock_hashcode = ProcLockHashCode(&proclock->tag,
+													 hashcode);
+				dlist_delete(&proclock->lockLink);
+				dlist_delete(&proclock->procLink);
+				if (!hash_search_with_hash_value(LockMethodProcLockHash,
+												 &(proclock->tag),
+												 proclock_hashcode,
+												 HASH_REMOVE,
+												 NULL))
+					elog(PANIC, "proclock table corrupted");
+			}
+			else
+				PROCLOCK_PRINT("LockAcquire: did not join wait queue", proclock);
+			lock->nRequested--;
+			lock->requested[lockmode]--;
+			LOCK_PRINT("LockAcquire: did not join wait queue",
+					   lock, lockmode);
+			Assert((lock->nRequested > 0) &&
+				   (lock->requested[lockmode] >= 0));
+			Assert(lock->nGranted <= lock->nRequested);
+			LWLockRelease(partitionLock);
+			if (locallock->nLocks == 0)
+				RemoveLocalLock(locallock);
+
 			if (dontWait)
 			{
-				/*
-				 * We can't acquire the lock immediately.  If caller specified
-				 * no blocking, remove useless table entries and return
-				 * LOCKACQUIRE_NOT_AVAIL without waiting.
-				 */
-				if (proclock->holdMask == 0)
-				{
-					uint32		proclock_hashcode;
-
-					proclock_hashcode = ProcLockHashCode(&proclock->tag,
-														 hashcode);
-					dlist_delete(&proclock->lockLink);
-					dlist_delete(&proclock->procLink);
-					if (!hash_search_with_hash_value(LockMethodProcLockHash,
-													 &(proclock->tag),
-													 proclock_hashcode,
-													 HASH_REMOVE,
-													 NULL))
-						elog(PANIC, "proclock table corrupted");
-				}
-				else
-					PROCLOCK_PRINT("LockAcquire: NOWAIT", proclock);
-				lock->nRequested--;
-				lock->requested[lockmode]--;
-				LOCK_PRINT("LockAcquire: conditional lock failed",
-						   lock, lockmode);
-				Assert((lock->nRequested > 0) &&
-					   (lock->requested[lockmode] >= 0));
-				Assert(lock->nGranted <= lock->nRequested);
-				LWLockRelease(partitionLock);
-				if (locallock->nLocks == 0)
-					RemoveLocalLock(locallock);
 				if (locallockp)
 					*locallockp = NULL;
 				return LOCKACQUIRE_NOT_AVAIL;
 			}
 			else
+			{
+				DeadLockReport();
+				/* DeadLockReport() will not return */
+			}
+		}
+
+		/*
+		 * We are now in the lock queue, or the lock was already granted.  If
+		 * queued, go to sleep.
+		 */
+		if (waitResult == PROC_WAIT_STATUS_WAITING)
+		{
+			Assert(!dontWait);
+			PROCLOCK_PRINT("LockAcquire: sleeping on lock", proclock);
+			LOCK_PRINT("LockAcquire: sleeping on lock", lock, lockmode);
+			LWLockRelease(partitionLock);
+
+			waitResult = WaitOnLock(locallock, owner);
+
+			/*
+			 * NOTE: do not do any material change of state between here and
+			 * return.  All required changes in locktable state must have been
+			 * done when the lock was granted to us --- see notes in WaitOnLock.
+			 */
+
+			if (waitResult == PROC_WAIT_STATUS_ERROR)
 			{
 				/*
-				 * We should have gotten the lock, but somehow that didn't
-				 * happen. If we get here, it's a bug.
+				 * We failed as a result of a deadlock, see CheckDeadLock(). Quit
+				 * now.
 				 */
-				PROCLOCK_PRINT("LockAcquire: INCONSISTENT", proclock);
-				LOCK_PRINT("LockAcquire: INCONSISTENT", lock, lockmode);
-				LWLockRelease(partitionLock);
-				elog(ERROR, "LockAcquire failed");
+				Assert(!dontWait);
+				DeadLockReport();
+				/* DeadLockReport() will not return */
 			}
 		}
-
-		PROCLOCK_PRINT("LockAcquire: granted", proclock);
-		LOCK_PRINT("LockAcquire: granted", lock, lockmode);
-		LWLockRelease(partitionLock);
+		else
+			LWLockRelease(partitionLock);
+		Assert(waitResult == PROC_WAIT_STATUS_OK);
 	}
 
 	/* The lock was granted to us.  Update the local lock entry accordingly */
+	Assert((proclock->holdMask & LOCKBIT_ON(lockmode)) != 0);
 	GrantLockLocal(locallock, owner);
 
 	/*
@@ -1796,14 +1815,12 @@ MarkLockClear(LOCALLOCK *locallock)
 /*
  * WaitOnLock -- wait to acquire a lock
  *
- * The appropriate partition lock must be held at entry, and will still be
- * held at exit.
+ * This is a wrapper around ProcSleep, with extra tracing and bookkeeping.
  */
-static void
-WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
+static ProcWaitStatus
+WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 {
-	LOCKMETHODID lockmethodid = LOCALLOCK_LOCKMETHOD(*locallock);
-	LockMethod	lockMethodTable = LockMethods[lockmethodid];
+	ProcWaitStatus result;
 
 	TRACE_POSTGRESQL_LOCK_WAIT_START(locallock->tag.lock.locktag_field1,
 									 locallock->tag.lock.locktag_field2,
@@ -1812,12 +1829,13 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 									 locallock->tag.lock.locktag_type,
 									 locallock->tag.mode);
 
-	LOCK_PRINT("WaitOnLock: sleeping on lock",
-			   locallock->lock, locallock->tag.mode);
-
 	/* adjust the process title to indicate that it's waiting */
 	set_ps_display_suffix("waiting");
 
+	/*
+	 * Record the fact that we are waiting for a lock, so that
+	 * LockErrorCleanup will clean up if cancel/die happens.
+	 */
 	awaitedLock = locallock;
 	awaitedOwner = owner;
 
@@ -1840,28 +1858,7 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 	 */
 	PG_TRY();
 	{
-		/*
-		 * If dontWait = true, we handle success and failure in the same way
-		 * here. The caller will be able to sort out what has happened.
-		 */
-		if (ProcSleep(locallock, lockMethodTable, dontWait) != PROC_WAIT_STATUS_OK
-			&& !dontWait)
-		{
-
-			/*
-			 * We failed as a result of a deadlock, see CheckDeadLock(). Quit
-			 * now.
-			 */
-			awaitedLock = NULL;
-			LWLockRelease(LockHashPartitionLock(locallock->hashcode));
-
-			/*
-			 * Now that we aren't holding the partition lock, we can give an
-			 * error report including details about the detected deadlock.
-			 */
-			DeadLockReport();
-			/* not reached */
-		}
+		result = ProcSleep(locallock);
 	}
 	PG_CATCH();
 	{
@@ -1883,15 +1880,14 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, bool dontWait)
 	/* reset ps display to remove the suffix */
 	set_ps_display_remove_suffix();
 
-	LOCK_PRINT("WaitOnLock: wakeup on lock",
-			   locallock->lock, locallock->tag.mode);
-
 	TRACE_POSTGRESQL_LOCK_WAIT_DONE(locallock->tag.lock.locktag_field1,
 									locallock->tag.lock.locktag_field2,
 									locallock->tag.lock.locktag_field3,
 									locallock->tag.lock.locktag_field4,
 									locallock->tag.lock.locktag_type,
 									locallock->tag.mode);
+
+	return result;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f29e02e902..1ec357854e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1031,7 +1031,7 @@ AuxiliaryPidGetProc(int pid)
 
 
 /*
- * ProcSleep -- put a process to sleep on the specified lock
+ * JoinWaitQueue -- join the wait queue on the specified lock
  *
  * It's not actually guaranteed that we need to wait when this function is
  * called, because it could be that when we try to find a position at which
@@ -1040,36 +1040,43 @@ AuxiliaryPidGetProc(int pid)
  * we get the lock immediately. Because of this, it's sensible for this function
  * to have a dontWait argument, despite the name.
  *
- * The lock table's partition lock must be held at entry, and will be held
- * at exit.
+ * On entry, the caller has already set up LOCK and PROCLOCK entries to
+ * reflect that we have "requested" the lock.  The caller is responsible for
+ * cleaning that up, if we end up not joining the queue after all.
  *
- * Result: PROC_WAIT_STATUS_OK if we acquired the lock, PROC_WAIT_STATUS_ERROR
- * if not (if dontWait = true, we would have had to wait; if dontWait = false,
- * this is a deadlock).
+ * The lock table's partition lock must be held at entry, and is still held
+ * at exit.  The caller must release it before calling ProcSleep().
  *
- * ASSUME: that no one will fiddle with the queue until after
- *		we release the partition lock.
+ * Result is one of the following:
+ *
+ *  PROC_WAIT_STATUS_OK       - lock was immediately granted
+ *  PROC_WAIT_STATUS_WAITING  - joined the wait queue; call ProcSleep()
+ *  PROC_WAIT_STATUS_ERROR    - immediate deadlock was detected, or would
+ *                              need to wait and dontWait == true
  *
  * NOTES: The process queue is now a priority queue for locking.
  */
 ProcWaitStatus
-ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
+JoinWaitQueue(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 {
 	LOCKMODE	lockmode = locallock->tag.mode;
 	LOCK	   *lock = locallock->lock;
 	PROCLOCK   *proclock = locallock->proclock;
 	uint32		hashcode = locallock->hashcode;
-	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
+	LWLock	   *partitionLock PG_USED_FOR_ASSERTS_ONLY = LockHashPartitionLock(hashcode);
 	dclist_head *waitQueue = &lock->waitProcs;
 	PGPROC	   *insert_before = NULL;
 	LOCKMASK	myHeldLocks;
-	TimestampTz standbyWaitStart = 0;
 	bool		early_deadlock = false;
-	bool		allow_autovacuum_cancel = true;
-	bool		logged_recovery_conflict = false;
-	ProcWaitStatus myWaitStatus;
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 
+	Assert(LWLockHeldByMeInMode(partitionLock, LW_EXCLUSIVE));
+
+	/*
+	 * Set bitmask of locks this process already holds on this object.
+	 */
+	myHeldLocks = MyProc->heldLocks = proclock->holdMask;
+
 	/*
 	 * Set bitmask of locks this process already holds on this object.
 	 */
@@ -1170,6 +1177,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 		}
 	}
 
+	/*
+	 * If we detected deadlock, give up without waiting.  This must agree with
+	 * CheckDeadLock's recovery code.
+	 */
+	if (early_deadlock)
+		return PROC_WAIT_STATUS_ERROR;
+
 	/*
 	 * At this point we know that we'd really need to sleep. If we've been
 	 * commanded not to do that, bail out.
@@ -1194,31 +1208,43 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 
 	MyProc->waitStatus = PROC_WAIT_STATUS_WAITING;
 
-	/*
-	 * If we detected deadlock, give up without waiting.  This must agree with
-	 * CheckDeadLock's recovery code.
-	 */
-	if (early_deadlock)
-	{
-		RemoveFromWaitQueue(MyProc, hashcode);
-		return PROC_WAIT_STATUS_ERROR;
-	}
+	return PROC_WAIT_STATUS_WAITING;
+}
 
-	/*
-	 * Release the lock table's partition lock.
-	 *
-	 * NOTE: this may also cause us to exit critical-section state, possibly
-	 * allowing a cancel/die interrupt to be accepted. This is OK because we
-	 * have recorded the fact that we are waiting for a lock, and so
-	 * LockErrorCleanup will clean up if cancel/die happens.
-	 */
-	LWLockRelease(partitionLock);
+/*
+ * ProcSleep -- put process to sleep waiting on lock
+ *
+ * This must be called when JoinWaitQueue() returns PROC_WAIT_STATUS_WAITING.
+ * Returns after the lock has been granted, or if a deadlock is detected.  Can
+ * also bail out with ereport(ERROR), if some other error condition, or a
+ * timeout or cancellation is triggered.
+ *
+ * Result is one of the following:
+ *
+ *  PROC_WAIT_STATUS_OK      - lock was granted
+ *  PROC_WAIT_STATUS_ERROR   - a deadlock was detected
+ */
+ProcWaitStatus
+ProcSleep(LOCALLOCK *locallock)
+{
+	LOCKMODE	lockmode = locallock->tag.mode;
+	LOCK	   *lock = locallock->lock;
+	uint32		hashcode = locallock->hashcode;
+	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
+	TimestampTz standbyWaitStart = 0;
+	bool		allow_autovacuum_cancel = true;
+	bool		logged_recovery_conflict = false;
+	ProcWaitStatus myWaitStatus;
+
+	/* The caller must've armed the on-error cleanup mechanism */
+	Assert(GetAwaitedLock() == locallock);
+	Assert(!LWLockHeldByMe(partitionLock));
 
 	/*
-	 * Also, now that we will successfully clean up after an ereport, it's
-	 * safe to check to see if there's a buffer pin deadlock against the
-	 * Startup process.  Of course, that's only necessary if we're doing Hot
-	 * Standby and are not the Startup process ourselves.
+	 * Now that we will successfully clean up after an ereport, it's safe to
+	 * check to see if there's a buffer pin deadlock against the Startup
+	 * process.  Of course, that's only necessary if we're doing Hot Standby
+	 * and are not the Startup process ourselves.
 	 */
 	if (RecoveryInProgress() && !InRecovery)
 		CheckRecoveryConflictDeadlock();
@@ -1627,17 +1653,12 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable, bool dontWait)
 							standbyWaitStart, GetCurrentTimestamp(),
 							NULL, false);
 
-	/*
-	 * Re-acquire the lock table's partition lock, because the caller expects
-	 * us to still hold it.
-	 */
-	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
-
 	/*
 	 * We don't have to do anything else, because the awaker did all the
-	 * necessary update of the lock table and MyProc.
+	 * necessary updates of the lock table and MyProc. (The caller is
+	 * responsible for updating the local lock table.)
 	 */
-	return MyProc->waitStatus;
+	return myWaitStatus;
 }
 
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 7ddfc8f392..daae929593 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -475,9 +475,9 @@ extern int	GetStartupBufferPinWaitBufId(void);
 extern bool HaveNFreeProcs(int n, int *nfree);
 extern void ProcReleaseLocks(bool isCommit);
 
-extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock,
-								LockMethod lockMethodTable,
-								bool dontWait);
+extern ProcWaitStatus JoinWaitQueue(LOCALLOCK *locallock,
+									LockMethod lockMethodTable, bool dontWait);
+extern ProcWaitStatus ProcSleep(LOCALLOCK *locallock);
 extern void ProcWakeup(PGPROC *proc, ProcWaitStatus waitStatus);
 extern void ProcLockWakeup(LockMethod lockMethodTable, LOCK *lock);
 extern void CheckDeadLockAlert(void);
-- 
2.34.1

#10Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Maxim Orlov (#9)
Re: Latches vs lwlock contention

On 10/09/2024 19:53, Maxim Orlov wrote:

I looked at the patch set and found it quite useful.

The first 7 patches are just refactoring and may be committed separately
if needed.
There were minor problems: patch #5 don't want to apply clearly and the
#8 is complained
about partitionLock is unused if we build without asserts. So, I add a
PG_USED_FOR_ASSERTS_ONLY
to solve the last issue.

Again, overall patch looks good and seems useful to me. Here is the
rebased v5 version based on Heikki's patch set above.

Committed, thanks for the review!

In case you're wondering, I committed some of the smaller patches
separately, but also squashed some of them with the main patch, On
closer look, the first patch, "Remove LOCK_PRINT() call that could point
to garbage", wasn't fixing any existing issue. The LOCK_PRINT() was
fine, because we held the partition lock. But it became necessary with
the main patch, so I squashed it with that. And the others that I
squashed were just not that interesting on their own.

The rest of Thomas's SetLatches work remains, so I left the commitfest
entry in "Needs review" state.

--
Heikki Linnakangas
Neon (https://neon.tech)

#11Alexander Lakhin
exclusion@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: Latches vs lwlock contention

Hello Heikki,

04.11.2024 18:08, Heikki Linnakangas wrote:

Committed, thanks for the review!

I've discovered that the following script:
export PGOPTIONS='-c lock_timeout=1s'
createdb regression
for i in {1..100}; do
echo "ITERATION: $i"
psql -c "CREATE TABLE t(i int);"
cat << 'EOF' | psql &
DO $$
DECLARE
    i int;
BEGIN
   FOR i IN 1 .. 5000000 LOOP
    INSERT INTO t VALUES (1);
  END LOOP;
END;
$$;
EOF
sleep 1
psql -c "DROP TABLE t" &
cat << 'EOF' | psql &
COPY t FROM STDIN;
0
\.
EOF
wait

psql -c "DROP TABLE t" || break;
done

causes a segmentation fault on master (it fails on iterations 5, 4, 26 for me):
ITERATION: 26
CREATE TABLE
ERROR:  canceling statement due to lock timeout
ERROR:  canceling statement due to lock timeout
invalid command \.
ERROR:  syntax error at or near "0"
LINE 1: 0
        ^
server closed the connection unexpectedly

Core was generated by `postgres: law regression [local] idle                                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  GrantLockLocal (locallock=0x5a1d75c35ba8, owner=0x5a1d75c18630) at lock.c:1805
1805            lockOwners[i].owner = owner;
(gdb) bt
#0  GrantLockLocal (locallock=0x5a1d75c35ba8, owner=0x5a1d75c18630) at lock.c:1805
#1  0x00005a1d51e93ee7 in GrantAwaitedLock () at lock.c:1887
#2  0x00005a1d51ea1e58 in LockErrorCleanup () at proc.c:814
#3  0x00005a1d51b9a1a7 in AbortTransaction () at xact.c:2853
#4  0x00005a1d51b9abc6 in AbortCurrentTransactionInternal () at xact.c:3579
#5  AbortCurrentTransaction () at xact.c:3457
#6  0x00005a1d51eafeda in PostgresMain (dbname=<optimized out>, username=0x5a1d75c139b8 "law") at postgres.c:4440

(gdb) p lockOwners
$1 = (LOCALLOCKOWNER *) 0x0

git bisect led me to 3c0fd64fe.
Could you please take a look?

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#12Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alexander Lakhin (#11)
Re: Latches vs lwlock contention

On 27/03/2025 07:00, Alexander Lakhin wrote:

I've discovered that the following script:
export PGOPTIONS='-c lock_timeout=1s'
createdb regression
for i in {1..100}; do
echo "ITERATION: $i"
psql -c "CREATE TABLE t(i int);"
cat << 'EOF' | psql &
DO $$
DECLARE
    i int;
BEGIN
   FOR i IN 1 .. 5000000 LOOP
    INSERT INTO t VALUES (1);
  END LOOP;
END;
$$;
EOF
sleep 1
psql -c "DROP TABLE t" &
cat << 'EOF' | psql &
COPY t FROM STDIN;
0
\.
EOF
wait

psql -c "DROP TABLE t" || break;
done

causes a segmentation fault on master (it fails on iterations 5, 4, 26
for me):
ITERATION: 26
CREATE TABLE
ERROR:  canceling statement due to lock timeout
ERROR:  canceling statement due to lock timeout
invalid command \.
ERROR:  syntax error at or near "0"
LINE 1: 0
        ^
server closed the connection unexpectedly

Core was generated by `postgres: law regression [local]
idle                                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  GrantLockLocal (locallock=0x5a1d75c35ba8, owner=0x5a1d75c18630) at
lock.c:1805
1805            lockOwners[i].owner = owner;
(gdb) bt
#0  GrantLockLocal (locallock=0x5a1d75c35ba8, owner=0x5a1d75c18630) at
lock.c:1805
#1  0x00005a1d51e93ee7 in GrantAwaitedLock () at lock.c:1887
#2  0x00005a1d51ea1e58 in LockErrorCleanup () at proc.c:814
#3  0x00005a1d51b9a1a7 in AbortTransaction () at xact.c:2853
#4  0x00005a1d51b9abc6 in AbortCurrentTransactionInternal () at xact.c:3579
#5  AbortCurrentTransaction () at xact.c:3457
#6  0x00005a1d51eafeda in PostgresMain (dbname=<optimized out>,
username=0x5a1d75c139b8 "law") at postgres.c:4440

(gdb) p lockOwners
$1 = (LOCALLOCKOWNER *) 0x0

git bisect led me to 3c0fd64fe.
Could you please take a look?

Great, thanks for the repro! With that, I was able to capture the
failure with 'rr' and understand what happens: Commit 3c0fd64fe removed
"lockAwaited = NULL;" from LockErrorCleanup(). Because of that, if the
lock had been granted to us, and if LockErrorCleanup() was called twice,
the second call would call GrantAwaitedLock() even if the lock was
already released and cleaned up.

I've pushed a fix to put that back.

--
Heikki Linnakangas
Neon (https://neon.tech)