heavily contended lwlocks with long wait queues scale badly

andres@anarazel.de

about 3 years ago

In reply to: Bharath Rupireddy (#4)

1 attachment(s)

Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().

Updated patch attached.

Greetings,

Andres Freund

Attachments:

v3-0001-lwlock-fix-quadratic-behaviour-with-very-long-wai.patchtext/x-diff; charset=us-asciiDownload

From fbdd1345936d0e9909061fb6101831ade657b1b0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 27 Oct 2022 10:18:13 -0700
Subject: [PATCH v3] lwlock: fix quadratic behaviour with very long wait lists

Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
Backpatch:
---
 src/include/storage/lwlock.h          |  8 +++++
 src/include/storage/proc.h            |  2 +-
 src/backend/access/transam/twophase.c |  2 +-
 src/backend/storage/lmgr/lwlock.c     | 50 +++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  4 +--
 5 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ca4eca76f41..a494cb598ff 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -23,6 +23,14 @@
 
 struct PGPROC;
 
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+	LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+	LW_WS_WAITING, /* currently waiting */
+	LW_WS_PENDING_WAKEUP, /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
 /*
  * Code outside of lwlock.c should not manipulate the contents of this
  * structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 8d096fdeeb1..aa13e1d66e8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -217,7 +217,7 @@ struct PGPROC
 	bool		recoveryConflictPending;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
+	uint8		lwWaiting;		/* see LWLockWaitState */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	proclist_node lwWaitLink;	/* position in LW lock wait list */
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 803d169f578..5017f4451eb 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -485,7 +485,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->roleId = owner;
 	proc->tempNamespaceId = InvalidOid;
 	proc->isBackgroundWorker = false;
-	proc->lwWaiting = false;
+	proc->lwWaiting = LW_WS_NOT_WAITING;
 	proc->lwWaitMode = 0;
 	proc->waitLock = NULL;
 	proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0fc0cf6ebbd..a8a439ad516 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -987,6 +987,14 @@ LWLockWakeup(LWLock *lock)
 			wokeup_somebody = true;
 		}
 
+		/*
+		 * Signal that the process isn't on the wait list anymore. This allows
+		 * LWLockDequeueSelf() to remove itself of the waitlist with a
+		 * proclist_delete(), rather than having to check if it has been
+		 * removed from the list.
+		 */
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
 		/*
 		 * Once we've woken up an exclusive lock, there's no point in waking
 		 * up anybody else.
@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
 		 * another lock.
 		 */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
@@ -1065,7 +1073,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	if (MyProc == NULL)
 		elog(PANIC, "cannot wait without a PGPROC structure");
 
-	if (MyProc->lwWaiting)
+	if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
 		elog(PANIC, "queueing for lock while waiting on another one");
 
 	LWLockWaitListLock(lock);
@@ -1073,7 +1081,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	/* setting the flag is protected by the spinlock */
 	pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
 
-	MyProc->lwWaiting = true;
+	MyProc->lwWaiting = LW_WS_WAITING;
 	MyProc->lwWaitMode = mode;
 
 	/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1100,8 +1108,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 static void
 LWLockDequeueSelf(LWLock *lock)
 {
-	bool		found = false;
-	proclist_mutable_iter iter;
+	bool		on_waitlist;
 
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -1114,18 +1121,13 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListLock(lock);
 
 	/*
-	 * Can't just remove ourselves from the list, but we need to iterate over
-	 * all entries as somebody else could have dequeued us.
+	 * Remove ourselves from the waitlist, unless we've already been
+	 * removed. The removal happens with the wait list lock held, so there's
+	 * no race in this check.
 	 */
-	proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
-	{
-		if (iter.cur == MyProc->pgprocno)
-		{
-			found = true;
-			proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
-			break;
-		}
-	}
+	on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+	if (on_waitlist)
+		proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
 
 	if (proclist_is_empty(&lock->waiters) &&
 		(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1137,8 +1139,8 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListUnlock(lock);
 
 	/* clear waiting state again, nice for debugging */
-	if (found)
-		MyProc->lwWaiting = false;
+	if (on_waitlist)
+		MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	else
 	{
 		int			extraWaits = 0;
@@ -1162,7 +1164,7 @@ LWLockDequeueSelf(LWLock *lock)
 		for (;;)
 		{
 			PGSemaphoreLock(MyProc->sem);
-			if (!MyProc->lwWaiting)
+			if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1313,7 +1315,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1478,7 +1480,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 			for (;;)
 			{
 				PGSemaphoreLock(proc->sem);
-				if (!proc->lwWaiting)
+				if (proc->lwWaiting == LW_WS_NOT_WAITING)
 					break;
 				extraWaits++;
 			}
@@ -1694,7 +1696,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1772,6 +1774,8 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 
 		proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
 		proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -1787,7 +1791,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 		proclist_delete(&wakeup, iter.cur, lwWaitLink);
 		/* check comment in LWLockWakeup() about this barrier */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 13fa07b0ff7..01af5ab4a2e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -397,7 +397,7 @@ InitProcess(void)
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
 		MyProc->statusFlags |= PROC_IS_AUTOVACUUM;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkptFlags = 0;
 	MyProc->statusFlags = 0;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
-- 
2.38.0

Zhihong Yu

zyu@yugabyte.com

about 3 years ago

In reply to: Andres Freund (#8)

Re: heavily contended lwlocks with long wait queues scale badly

On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().

Updated patch attached.

Greetings,

Andres Freund

Hi,
Minor comment:

+ uint8 lwWaiting; /* see LWLockWaitState */

Why not declare `lwWaiting` of type LWLockWaitState ?

Cheers

#10

andres@anarazel.de

about 3 years ago

In reply to: Zhihong Yu (#9)

Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-10-31 17:17:03 -0700, Zhihong Yu wrote:

On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().

Updated patch attached.

Greetings,

Andres Freund

Hi,
Minor comment:

+ uint8 lwWaiting; /* see LWLockWaitState */

Why not declare `lwWaiting` of type LWLockWaitState ?

Unfortunately C99 (*) doesn't allow to specify the width of an enum
field. With most compilers we'd end up using 4 bytes.

Greetings,

Andres Freund

(*) C++ has allowed specifying this for quite a few years now and I think C23
will support it too, but that doesn't help us at this point.

#11

Zhihong Yu

zyu@yugabyte.com

about 3 years ago

In reply to: Andres Freund (#10)

Re: heavily contended lwlocks with long wait queues scale badly

On Mon, Oct 31, 2022 at 5:19 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-31 17:17:03 -0700, Zhihong Yu wrote:

On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de>

wrote:

Hi,

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread

missed

updating LWLockUpdateVar().

Updated patch attached.

Greetings,

Andres Freund

Hi,
Minor comment:

+ uint8 lwWaiting; /* see LWLockWaitState */

Why not declare `lwWaiting` of type LWLockWaitState ?

Unfortunately C99 (*) doesn't allow to specify the width of an enum
field. With most compilers we'd end up using 4 bytes.

Greetings,

Andres Freund

(*) C++ has allowed specifying this for quite a few years now and I think
C23
will support it too, but that doesn't help us at this point.

Hi,
Thanks for the response.

If possible, it would be better to put your explanation in the code comment
(so that other people know the reasoning).

#12

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 3 years ago

In reply to: Andres Freund (#8)

Re: heavily contended lwlocks with long wait queues scale badly

On Tue, Nov 1, 2022 at 5:21 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().

Updated patch attached.

Thanks. It looks good to me. However, some minor comments on the v3 patch:

1.
-    if (MyProc->lwWaiting)
+    if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
         elog(PANIC, "queueing for lock while waiting on another one");

Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?

Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?

2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{

@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
          * another lock.
          */
         pg_write_barrier();
-        waiter->lwWaiting = false;
+        waiter->lwWaiting = LW_WS_NOT_WAITING;
         PGSemaphoreUnlock(waiter->sem);
     }

/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);

proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;

Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.

Below are test results with v3 patch. +1 for back-patching it.

HEAD PATCHED
1 34142 34289
2 72760 69720
4 136300 131848
8 210809 210192
16 240718 242744
32 297587 297354
64 341939 343036
128 383615 383801
256 342094 337680
512 263194 288629
768 145526 261553
1024 107267 241811
2048 35716 188389
4096 12415 120300

PG15 PATCHED
1 34503 34078
2 73708 72054
4 139415 133321
8 212396 211390
16 242227 242584
32 303441 309288
64 362680 339211
128 378645 344291
256 340016 344291
512 290044 293337
768 140277 264618
1024 96191 247636
2048 35158 181488
4096 12164 118610

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Bharath Rupireddy (#12)

Re: heavily contended lwlocks with long wait queues scale badly

On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Below are test results with v3 patch. +1 for back-patching it.

The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.

It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.

It's possible that I'm too pessimistic, though.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

jkatz@postgresql.org

about 3 years ago

In reply to: Robert Haas (#13)

Re: heavily contended lwlocks with long wait queues scale badly

On 11/1/22 8:37 AM, Robert Haas wrote:

On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Below are test results with v3 patch. +1 for back-patching it.

First, awesome find and proposed solution!

The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.

If someone is using the database to throttle activity for their app, I
have a bunch of follow up questions to understand why.

It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.

Having the privilege of reading through the release notes for every
update release, on average 1-2 "performance improvements" in each
release. I believe they tend to be more negligible, though.

I do understand the concerns. Say you discover your workload does have a
regression with this patch and then there's a CVE that you want to
accept -- what do you do? Reading the thread / patch, it seems as if
this is a lower risk "performance fix", but still nonzero.

While this does affect all supported versions, we could also consider
backpatching only for PG15. That at least 1/ limits impact on users
running older versions (opting into a major version upgrade) and 2/
we're still very early in the major upgrade cycle for PG15 that it's
lower risk if there are issues.

Users are generally happy when they can perform a simple upgrade and get
a performance boost, particularly the set of users that this patch
affects most (high throughput, high connection count). This is the type
of fix that would make headlines in a major release announcement (10x
TPS improvement w/4096 connections?!). That is also part of the tradeoff
of backpatching this, is that we may lose some of the higher visibility
marketing opportunities to discuss this (though I'm sure there will be
plenty of blog posts, etc.)

Andres: when you suggested backpatching, were you thinking of the Nov
2022 release or the Feb 2023 release?

Thanks,

Jonathan

#15

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#13)

Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-11-01 08:37:39 -0400, Robert Haas wrote:

On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Below are test results with v3 patch. +1 for back-patching it.

The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.

It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.

In general I agree, we shouldn't default to backpatching performance
fixes. The reason I am even considering it in this case, is that it's a
readily reproducible issue, leading to a quadratic behaviour that's extremely
hard to pinpoint. There's no increase in CPU usage, no wait event for
spinlocks, the system doesn't even get stuck (because the wait list lock is
held after the lwlock lock release). I don't think users have a decent chance
at figuring out that this is the issue.

I'm not at all convinced we should backpatch either, just to be clear.

Greetings,

Andres Freund

#16

andres@anarazel.de

about 3 years ago

In reply to: Jonathan S. Katz (#14)

Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-11-01 11:19:02 -0400, Jonathan S. Katz wrote:

This is the type of fix that would make headlines in a major release
announcement (10x TPS improvement w/4096 connections?!). That is also part
of the tradeoff of backpatching this, is that we may lose some of the higher
visibility marketing opportunities to discuss this (though I'm sure there
will be plenty of blog posts, etc.)

(read the next paragraph with the caveat that results below prove it somewhat
wrong)

I don't think the fix is as big a deal as the above make it sound - you need
to do somewhat extreme things to hit the problem. Yes, it drastically improves
the scalability of e.g. doing SELECT txid_current() across as many sessions as
possible - but that's not something you normally do (it was a good candidate
to show the problem because it's a single lock but doesn't trigger WAL flushes
at commit).

You can probably hit the problem with many concurrent single-tx INSERTs, but
you'd need to have synchronous_commit=off or fsync=off (or a very expensive
server class SSD with battery backup) and the effect is likely smaller.

Andres: when you suggested backpatching, were you thinking of the Nov 2022
release or the Feb 2023 release?

I wasn't thinking that concretely. Even if we decide to backpatch, I'd be very
hesitant to do it in a few days.

I tested with browser etc running, so this is plenty noisy. I used the best of
the two pgbench -T21 -P5 tps, after ignoring the first two periods (they're
too noisy). I used an ok-ish NVMe SSD, rather than the the expensive one that
has "free" fsync.

synchronous_commit=on:

clients master fix
16 6196 6202
64 25716 25545
256 90131 90240
1024 128556 151487
2048 59417 157050
4096 32252 178823

synchronous_commit=off:

clients master fix
16 409828 409016
64 454257 455804
256 304175 452160
1024 135081 334979
2048 66124 291582
4096 27019 245701

Hm. That's a bigger effect than I anticipated. I guess sc=off isn't actually
required, due to the level of concurrency making group commit very
effective.

This is without an index, serial column or anything. But a quick comparison
for just 4096 clients shows that to still be a big difference if I create an
serial primary key:
master: 26172
fix: 155813

Greetings,

Andres Freund

#17

jkatz@postgresql.org

about 3 years ago

In reply to: Andres Freund (#16)

Re: heavily contended lwlocks with long wait queues scale badly

On 11/1/22 1:41 PM, Andres Freund wrote:

Andres: when you suggested backpatching, were you thinking of the Nov 2022
release or the Feb 2023 release?

I wasn't thinking that concretely. Even if we decide to backpatch, I'd be very
hesitant to do it in a few days.

Yeah this was my thinking (and also why I took a few days to reply given
the lack of urgency for this release). It would at least give some more
time for others to test it to feel confident that we're not introducing
noticeable regressions.

<goes and runs test while in meeting>

I tested with browser etc running, so this is plenty noisy. I used the best of
the two pgbench -T21 -P5 tps, after ignoring the first two periods (they're
too noisy). I used an ok-ish NVMe SSD, rather than the the expensive one that
has "free" fsync.

synchronous_commit=on:

clients master fix
16 6196 6202
64 25716 25545
256 90131 90240
1024 128556 151487
2048 59417 157050
4096 32252 178823

synchronous_commit=off:

clients master fix
16 409828 409016
64 454257 455804
256 304175 452160
1024 135081 334979
2048 66124 291582
4096 27019 245701

Hm. That's a bigger effect than I anticipated. I guess sc=off isn't actually
required, due to the level of concurrency making group commit very
effective.

This is without an index, serial column or anything. But a quick comparison
for just 4096 clients shows that to still be a big difference if I create an
serial primary key:
master: 26172
fix: 155813

🤯 (seeing if my exploding head makes it into the archives).

Given the lack of ABI changes (hesitant to say low-risk until after more
testing, but seemingly low-risk), I can get behind backpatching esp if
we're targeting Feb 2023 so we can tests some more.

With my advocacy hat on, it bums me that we may not get as much buzz
about this change given it's not in a major release, but 1/ it'll fix an
issue that will help users with high-concurrency and 2/ users would be
able to perform a simpler update to get the change.

Thanks,

Jonathan

#18

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 3 years ago

In reply to: Bharath Rupireddy (#12)

Re: heavily contended lwlocks with long wait queues scale badly

On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Updated patch attached.

Thanks. It looks good to me. However, some minor comments on the v3 patch:
1.
-    if (MyProc->lwWaiting)
+    if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");
Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?

Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?

2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
-        waiter->lwWaiting = false;
+        waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);

proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;

Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.

Below are test results with v3 patch. +1 for back-patching it.

HEAD PATCHED
1 34142 34289
2 72760 69720
4 136300 131848
8 210809 210192
16 240718 242744
32 297587 297354
64 341939 343036
128 383615 383801
256 342094 337680
512 263194 288629
768 145526 261553
1024 107267 241811
2048 35716 188389
4096 12415 120300

PG15 PATCHED
1 34503 34078
2 73708 72054
4 139415 133321
8 212396 211390
16 242227 242584
32 303441 309288
64 362680 339211
128 378645 344291
256 340016 344291
512 290044 293337
768 140277 264618
1024 96191 247636
2048 35158 181488
4096 12164 118610

I looked at the v3 patch again today and ran some performance tests.
The results look impressive as they were earlier. Andres, any plans to
get this in?

pgbench with SELECT txid_current();:
Clients HEAD PATCHED
1 34613 33611
2 72634 70546
4 137885 136911
8 216470 216076
16 242535 245392
32 299952 304740
64 329788 347401
128 378296 386873
256 344939 343832
512 292196 295839
768 144212 260102
1024 101525 250263
2048 35594 185878
4096 11842 104227

pgbench with insert into pgbench_accounts table:
Clients HEAD PATCHED
1 1660 1600
2 1848 1746
4 3547 3395
8 7330 6754
16 13103 13613
32 26011 26372
64 52331 52594
128 93313 95526
256 127373 126182
512 126712 127857
768 116765 119227
1024 111464 112499
2048 58838 92756
4096 26066 60543

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#19

andres@anarazel.de

about 3 years ago

In reply to: Bharath Rupireddy (#18)

Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-11-09 15:54:16 +0530, Bharath Rupireddy wrote:

On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Updated patch attached.

Thanks. It looks good to me. However, some minor comments on the v3 patch:
1.
-    if (MyProc->lwWaiting)
+    if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");
Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?

Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?

I don't think that's a good idea - it'll just mean we have to modify more
places if we add another state, without making anything more robust.

2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
-        waiter->lwWaiting = false;
+        waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);

proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;

Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.

I guess it can't hurt - but it's not really related to the changes in the
patch, no?

I looked at the v3 patch again today and ran some performance tests.
The results look impressive as they were earlier. Andres, any plans to
get this in?

I definitely didn't want to backpatch before this point release. But it seems
we haven't quite got to an agreement what to do about backpatching. It's
probably best to just commit it to HEAD and let the backpatch discussion
happen concurrently.

I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.

Greetings,

Andres Freund

#20

andres@anarazel.de

about 3 years ago

In reply to: Andres Freund (#19)

Re: heavily contended lwlocks with long wait queues scale badly

On 2022-11-09 09:38:08 -0800, Andres Freund wrote:

I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.

I didn't get to it in time, so I'll leave it for when I'm back.

#21

andres@anarazel.de

about 3 years ago

In reply to: Andres Freund (#20)

Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-11-09 17:03:13 -0800, Andres Freund wrote:

On 2022-11-09 09:38:08 -0800, Andres Freund wrote:

I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.

I didn't get to it in time, so I'll leave it for when I'm back.

Took a few days longer, partially because I encountered an independent issue
(see 8c954168cff) while testing.

I pushed it to HEAD now.

I still think it might be worth to backpatch in a bit, but so far the votes on
that weren't clear enough on that to feel comfortable.

Regards,

Andres

#22

[0]: /messages/by-id/CAM527d-uDn5osa6QPKxHAC6srOfBH3M8iXUM=ewqHV6n=w1u8Q@mail.gmail.com
[1]: /messages/by-id/VI1PR05MB620666631A41186ACC3FC91ACFC70@VI1PR05MB6206.eurprd05.prod.outlook.com
[2]: /messages/by-id/dd0e070809430a31f7ddd8483fbcce59@mail.gmail.com

jkatz@postgresql.org

about 3 years ago

In reply to: Andres Freund (#21)

Re: heavily contended lwlocks with long wait queues scale badly

On 11/20/22 2:56 PM, Andres Freund wrote:

Hi,

On 2022-11-09 17:03:13 -0800, Andres Freund wrote:

On 2022-11-09 09:38:08 -0800, Andres Freund wrote:

I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.

I didn't get to it in time, so I'll leave it for when I'm back.

Took a few days longer, partially because I encountered an independent issue
(see 8c954168cff) while testing.

I pushed it to HEAD now.

Thanks!

I still think it might be worth to backpatch in a bit, but so far the votes on
that weren't clear enough on that to feel comfortable.

My general feeling is "yes" on backpatching, particularly if this is a
bug and it's fixable without ABI breaks.

My comments were around performing additional workload benchmarking just
to ensure people feel comfortable that we're not introducing any
performance regressions, and to consider the Feb 2023 release as the
time to introduce this (vs. Nov 2022). That gives us ample time to
determine if there are any performance regressions introduced.

Thanks,

Jonathan

#23

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Jonathan S. Katz (#22)

Re: heavily contended lwlocks with long wait queues scale badly

On Mon, Nov 21, 2022 at 10:31:14AM -0500, Jonathan S. Katz wrote:

On 11/20/22 2:56 PM, Andres Freund wrote:

I still think it might be worth to backpatch in a bit, but so far the votes on
that weren't clear enough on that to feel comfortable.

My general feeling is "yes" on backpatching, particularly if this is a bug
and it's fixable without ABI breaks.

Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion. I see a few possibly-related reports [0]/messages/by-id/CAM527d-uDn5osa6QPKxHAC6srOfBH3M8iXUM=ewqHV6n=w1u8Q@mail.gmail.com [1]/messages/by-id/VI1PR05MB620666631A41186ACC3FC91ACFC70@VI1PR05MB6206.eurprd05.prod.outlook.com
[2]: /messages/by-id/dd0e070809430a31f7ddd8483fbcce59@mail.gmail.com
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.

Thoughts?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#24

michael@paquier.xyz

about 2 years ago

In reply to: Nathan Bossart (#23)

Re: heavily contended lwlocks with long wait queues scale badly

On Wed, Jan 10, 2024 at 09:17:47PM -0600, Nathan Bossart wrote:

Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion. I see a few possibly-related reports [0] [1]
[2], and I'm now seeing this in the field, too. While it is debatable
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.

+1, I've seen this becoming a PITA for a few things.  Knowing that the
size of PGPROC does not change at all, I would be in favor for a
backpatch, especially since it's been in the tree for more than 1
year, and even more knowing that we have 16 released with this stuff
in.
--
Michael

#25

jkatz@postgresql.org

about 2 years ago

In reply to: Michael Paquier (#24)

Re: heavily contended lwlocks with long wait queues scale badly

On 1/10/24 10:45 PM, Michael Paquier wrote:

On Wed, Jan 10, 2024 at 09:17:47PM -0600, Nathan Bossart wrote:

Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion. I see a few possibly-related reports [0] [1]
[2], and I'm now seeing this in the field, too. While it is debatable
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.

+1, I've seen this becoming a PITA for a few things. Knowing that the
size of PGPROC does not change at all, I would be in favor for a
backpatch, especially since it's been in the tree for more than 1
year, and even more knowing that we have 16 released with this stuff
in.

I have similar data sources to Nathan/Michael and I'm trying to avoid
piling on, but one case that's interesting occurred after a major
version upgrade from PG10 to PG14 on a database supporting a very
active/highly concurrent workload. On inspection, it seems like
backpatching would help this particularly case.

With 10/11 EOL, I do wonder if we'll see more of these reports on
upgrade to < PG16.

(I was in favor of backpatching prior; opinion is unchanged).

Thanks,

Jonathan

#26

michael@paquier.xyz

almost 2 years ago

In reply to: Jonathan S. Katz (#25)

4 attachment(s)

Re: heavily contended lwlocks with long wait queues scale badly

On Thu, Jan 11, 2024 at 09:47:33AM -0500, Jonathan S. Katz wrote:

I have similar data sources to Nathan/Michael and I'm trying to avoid piling
on, but one case that's interesting occurred after a major version upgrade
from PG10 to PG14 on a database supporting a very active/highly concurrent
workload. On inspection, it seems like backpatching would help this
particularly case.

With 10/11 EOL, I do wonder if we'll see more of these reports on upgrade to
< PG16.

(I was in favor of backpatching prior; opinion is unchanged).

Hearing nothing, I have prepared a set of patches for v12~v15,
checking all the lwlock paths for all the branches. At the end the
set of changes look rather sane to me regarding the queue handlings.

I have also run some numbers on all the branches, and the test case
posted upthread falls off dramatically after 512 concurrent
connections at the top of all the stable branches :(

For example on REL_12_STABLE with and without the patch attached:
num v12 v12+patch
1 29717.151665 29096.707588
2 63257.709301 61889.476318
4 127921.873393 124575.901330
8 231400.571662 230562.725174
16 343911.185351 312432.897015
32 291748.985280 281011.787701
64 268998.728648 269975.605115
128 297332.597018 286449.176950
256 243902.817657 240559.122309
512 190069.602270 194510.718508
768 58915.650225 165714.707198
1024 39920.950552 149433.836901
2048 16922.391688 108164.301054
4096 6229.063321 69032.338708

I'd like to apply that, just let me know if you have any comments
and/or objections.
--
Michael

Attachments:

0001-lwlock-Fix-quadratic-behavior-with-very-long-wai.v15.patchtext/x-diff; charset=us-asciiDownload

From 83706cd5792bc002a668fd54cb93c15011a97063 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists

Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.

Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.

The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.

In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.

As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.

Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
---
 src/include/storage/lwlock.h          |  8 ++++
 src/include/storage/proc.h            |  2 +-
 src/backend/access/transam/twophase.c |  2 +-
 src/backend/storage/lmgr/lwlock.c     | 53 +++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  4 +-
 5 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e03d317eea..d88fa4b4ad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -23,6 +23,14 @@
 
 struct PGPROC;
 
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+	LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+	LW_WS_WAITING, /* currently waiting */
+	LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
 /*
  * Code outside of lwlock.c should not manipulate the contents of this
  * structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..2ea773098b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -211,7 +211,7 @@ struct PGPROC
 	bool		recoveryConflictPending;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
+	uint8		lwWaiting;		/* see LWLockWaitState */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	proclist_node lwWaitLink;	/* position in LW lock wait list */
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5293c6920b..ca7037eb2f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -486,7 +486,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->roleId = owner;
 	proc->tempNamespaceId = InvalidOid;
 	proc->isBackgroundWorker = false;
-	proc->lwWaiting = false;
+	proc->lwWaiting = LW_WS_NOT_WAITING;
 	proc->lwWaitMode = 0;
 	proc->waitLock = NULL;
 	proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 3bb448627c..ea34ee785c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -987,6 +987,15 @@ LWLockWakeup(LWLock *lock)
 			wokeup_somebody = true;
 		}
 
+		/*
+		 * Signal that the process isn't on the wait list anymore. This allows
+		 * LWLockDequeueSelf() to remove itself of the waitlist with a
+		 * proclist_delete(), rather than having to check if it has been
+		 * removed from the list.
+		 */
+		Assert(waiter->lwWaiting == LW_WS_WAITING);
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
 		/*
 		 * Once we've woken up an exclusive lock, there's no point in waking
 		 * up anybody else.
@@ -1044,7 +1053,7 @@ LWLockWakeup(LWLock *lock)
 		 * another lock.
 		 */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
@@ -1065,7 +1074,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	if (MyProc == NULL)
 		elog(PANIC, "cannot wait without a PGPROC structure");
 
-	if (MyProc->lwWaiting)
+	if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
 		elog(PANIC, "queueing for lock while waiting on another one");
 
 	LWLockWaitListLock(lock);
@@ -1073,7 +1082,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	/* setting the flag is protected by the spinlock */
 	pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
 
-	MyProc->lwWaiting = true;
+	MyProc->lwWaiting = LW_WS_WAITING;
 	MyProc->lwWaitMode = mode;
 
 	/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1100,8 +1109,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 static void
 LWLockDequeueSelf(LWLock *lock)
 {
-	bool		found = false;
-	proclist_mutable_iter iter;
+	bool		on_waitlist;
 
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -1114,18 +1122,13 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListLock(lock);
 
 	/*
-	 * Can't just remove ourselves from the list, but we need to iterate over
-	 * all entries as somebody else could have dequeued us.
+	 * Remove ourselves from the waitlist, unless we've already been
+	 * removed. The removal happens with the wait list lock held, so there's
+	 * no race in this check.
 	 */
-	proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
-	{
-		if (iter.cur == MyProc->pgprocno)
-		{
-			found = true;
-			proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
-			break;
-		}
-	}
+	on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+	if (on_waitlist)
+		proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
 
 	if (proclist_is_empty(&lock->waiters) &&
 		(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1137,8 +1140,8 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListUnlock(lock);
 
 	/* clear waiting state again, nice for debugging */
-	if (found)
-		MyProc->lwWaiting = false;
+	if (on_waitlist)
+		MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	else
 	{
 		int			extraWaits = 0;
@@ -1162,7 +1165,7 @@ LWLockDequeueSelf(LWLock *lock)
 		for (;;)
 		{
 			PGSemaphoreLock(MyProc->sem);
-			if (!MyProc->lwWaiting)
+			if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1313,7 +1316,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1478,7 +1481,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 			for (;;)
 			{
 				PGSemaphoreLock(proc->sem);
-				if (!proc->lwWaiting)
+				if (proc->lwWaiting == LW_WS_NOT_WAITING)
 					break;
 				extraWaits++;
 			}
@@ -1694,7 +1697,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1772,6 +1775,10 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 
 		proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
 		proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+		/* see LWLockWakeup() */
+		Assert(waiter->lwWaiting == LW_WS_WAITING);
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -1787,7 +1794,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 		proclist_delete(&wakeup, iter.cur, lwWaitLink);
 		/* check comment in LWLockWakeup() about this barrier */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1b514f44e3..c60707b9db 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -397,7 +397,7 @@ InitProcess(void)
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
 		MyProc->statusFlags |= PROC_IS_AUTOVACUUM;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkptFlags = 0;
 	MyProc->statusFlags = 0;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
-- 
2.43.0

0001-lwlock-Fix-quadratic-behavior-with-very-long-wai.v14.patchtext/x-diff; charset=us-asciiDownload

From 6066487bc301e5f725a1f0c2e696ccde0f5e06bf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists

Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.

Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.

The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.

In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.

As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.

Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
---
 src/include/storage/lwlock.h          |  8 ++++
 src/include/storage/proc.h            |  2 +-
 src/backend/access/transam/twophase.c |  2 +-
 src/backend/storage/lmgr/lwlock.c     | 53 +++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  4 +-
 5 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 78afe57fb2..b2c9a9dbbc 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -23,6 +23,14 @@
 
 struct PGPROC;
 
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+	LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+	LW_WS_WAITING, /* currently waiting */
+	LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
 /*
  * Code outside of lwlock.c should not manipulate the contents of this
  * structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 1464fad9b9..9cc37aa0d9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -180,7 +180,7 @@ struct PGPROC
 	bool		recoveryConflictPending;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
+	uint8		lwWaiting;		/* see LWLockWaitState */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	proclist_node lwWaitLink;	/* position in LW lock wait list */
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 076315e7a2..db68b14443 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -482,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->roleId = owner;
 	proc->tempNamespaceId = InvalidOid;
 	proc->isBackgroundWorker = false;
-	proc->lwWaiting = false;
+	proc->lwWaiting = LW_WS_NOT_WAITING;
 	proc->lwWaitMode = 0;
 	proc->waitLock = NULL;
 	proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d63dc1b93e..28a540961e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -993,6 +993,15 @@ LWLockWakeup(LWLock *lock)
 			wokeup_somebody = true;
 		}
 
+		/*
+		 * Signal that the process isn't on the wait list anymore. This allows
+		 * LWLockDequeueSelf() to remove itself of the waitlist with a
+		 * proclist_delete(), rather than having to check if it has been
+		 * removed from the list.
+		 */
+		Assert(waiter->lwWaiting == LW_WS_WAITING);
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
 		/*
 		 * Once we've woken up an exclusive lock, there's no point in waking
 		 * up anybody else.
@@ -1050,7 +1059,7 @@ LWLockWakeup(LWLock *lock)
 		 * another lock.
 		 */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
@@ -1071,7 +1080,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	if (MyProc == NULL)
 		elog(PANIC, "cannot wait without a PGPROC structure");
 
-	if (MyProc->lwWaiting)
+	if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
 		elog(PANIC, "queueing for lock while waiting on another one");
 
 	LWLockWaitListLock(lock);
@@ -1079,7 +1088,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	/* setting the flag is protected by the spinlock */
 	pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
 
-	MyProc->lwWaiting = true;
+	MyProc->lwWaiting = LW_WS_WAITING;
 	MyProc->lwWaitMode = mode;
 
 	/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1107,8 +1116,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 static void
 LWLockDequeueSelf(LWLock *lock)
 {
-	bool		found = false;
-	proclist_mutable_iter iter;
+	bool		on_waitlist;
 
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -1121,18 +1129,13 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListLock(lock);
 
 	/*
-	 * Can't just remove ourselves from the list, but we need to iterate over
-	 * all entries as somebody else could have dequeued us.
+	 * Remove ourselves from the waitlist, unless we've already been
+	 * removed. The removal happens with the wait list lock held, so there's
+	 * no race in this check.
 	 */
-	proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
-	{
-		if (iter.cur == MyProc->pgprocno)
-		{
-			found = true;
-			proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
-			break;
-		}
-	}
+	on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+	if (on_waitlist)
+		proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
 
 	if (proclist_is_empty(&lock->waiters) &&
 		(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1144,8 +1147,8 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListUnlock(lock);
 
 	/* clear waiting state again, nice for debugging */
-	if (found)
-		MyProc->lwWaiting = false;
+	if (on_waitlist)
+		MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	else
 	{
 		int			extraWaits = 0;
@@ -1169,7 +1172,7 @@ LWLockDequeueSelf(LWLock *lock)
 		for (;;)
 		{
 			PGSemaphoreLock(MyProc->sem);
-			if (!MyProc->lwWaiting)
+			if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1320,7 +1323,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1485,7 +1488,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 			for (;;)
 			{
 				PGSemaphoreLock(proc->sem);
-				if (!proc->lwWaiting)
+				if (proc->lwWaiting == LW_WS_NOT_WAITING)
 					break;
 				extraWaits++;
 			}
@@ -1701,7 +1704,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1779,6 +1782,10 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 
 		proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
 		proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+		/* see LWLockWakeup() */
+		Assert(waiter->lwWaiting == LW_WS_WAITING);
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -1794,7 +1801,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 		proclist_delete(&wakeup, iter.cur, lwWaitLink);
 		/* check comment in LWLockWakeup() about this barrier */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1e06d6580c..f62e622a7b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -400,7 +400,7 @@ InitProcess(void)
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
 		MyProc->statusFlags |= PROC_IS_AUTOVACUUM;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
@@ -583,7 +583,7 @@ InitAuxiliaryProcess(void)
 	MyProc->delayChkpt = false;
 	MyProc->delayChkptEnd = false;
 	MyProc->statusFlags = 0;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
-- 
2.43.0

0001-lwlock-Fix-quadratic-behavior-with-very-long-wai.v13.patchtext/x-diff; charset=us-asciiDownload

From 2cecd5135e294fbb81ff726ab2f46baf2a73d69c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists

Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.

Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.

The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.

In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.

As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.

Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
---
 src/include/storage/lwlock.h          |  8 ++++
 src/include/storage/proc.h            |  2 +-
 src/backend/access/transam/twophase.c |  2 +-
 src/backend/storage/lmgr/lwlock.c     | 53 +++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  4 +-
 5 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cdbfbed118..a1197a8007 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -24,6 +24,14 @@
 
 struct PGPROC;
 
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+	LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+	LW_WS_WAITING, /* currently waiting */
+	LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
 /*
  * Code outside of lwlock.c should not manipulate the contents of this
  * structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 7c85b5645b..ffb185d3df 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -140,7 +140,7 @@ struct PGPROC
 	bool		recoveryConflictPending;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
+	uint8		lwWaiting;		/* see LWLockWaitState */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	proclist_node lwWaitLink;	/* position in LW lock wait list */
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d375b1012b..f8272c2f0d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -484,7 +484,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->roleId = owner;
 	proc->tempNamespaceId = InvalidOid;
 	proc->isBackgroundWorker = false;
-	proc->lwWaiting = false;
+	proc->lwWaiting = LW_WS_NOT_WAITING;
 	proc->lwWaitMode = 0;
 	proc->waitLock = NULL;
 	proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db89137c03..49423ea4d2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -998,6 +998,15 @@ LWLockWakeup(LWLock *lock)
 			wokeup_somebody = true;
 		}
 
+		/*
+		 * Signal that the process isn't on the wait list anymore. This allows
+		 * LWLockDequeueSelf() to remove itself of the waitlist with a
+		 * proclist_delete(), rather than having to check if it has been
+		 * removed from the list.
+		 */
+		Assert(waiter->lwWaiting == LW_WS_WAITING);
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
 		/*
 		 * Once we've woken up an exclusive lock, there's no point in waking
 		 * up anybody else.
@@ -1055,7 +1064,7 @@ LWLockWakeup(LWLock *lock)
 		 * another lock.
 		 */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
@@ -1076,7 +1085,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	if (MyProc == NULL)
 		elog(PANIC, "cannot wait without a PGPROC structure");
 
-	if (MyProc->lwWaiting)
+	if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
 		elog(PANIC, "queueing for lock while waiting on another one");
 
 	LWLockWaitListLock(lock);
@@ -1084,7 +1093,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	/* setting the flag is protected by the spinlock */
 	pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
 
-	MyProc->lwWaiting = true;
+	MyProc->lwWaiting = LW_WS_WAITING;
 	MyProc->lwWaitMode = mode;
 
 	/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1112,8 +1121,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 static void
 LWLockDequeueSelf(LWLock *lock)
 {
-	bool		found = false;
-	proclist_mutable_iter iter;
+	bool		on_waitlist;
 
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -1126,18 +1134,13 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListLock(lock);
 
 	/*
-	 * Can't just remove ourselves from the list, but we need to iterate over
-	 * all entries as somebody else could have dequeued us.
+	 * Remove ourselves from the waitlist, unless we've already been
+	 * removed. The removal happens with the wait list lock held, so there's
+	 * no race in this check.
 	 */
-	proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
-	{
-		if (iter.cur == MyProc->pgprocno)
-		{
-			found = true;
-			proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
-			break;
-		}
-	}
+	on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+	if (on_waitlist)
+		proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
 
 	if (proclist_is_empty(&lock->waiters) &&
 		(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1149,8 +1152,8 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListUnlock(lock);
 
 	/* clear waiting state again, nice for debugging */
-	if (found)
-		MyProc->lwWaiting = false;
+	if (on_waitlist)
+		MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	else
 	{
 		int			extraWaits = 0;
@@ -1174,7 +1177,7 @@ LWLockDequeueSelf(LWLock *lock)
 		for (;;)
 		{
 			PGSemaphoreLock(MyProc->sem);
-			if (!MyProc->lwWaiting)
+			if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1325,7 +1328,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1490,7 +1493,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 			for (;;)
 			{
 				PGSemaphoreLock(proc->sem);
-				if (!proc->lwWaiting)
+				if (proc->lwWaiting == LW_WS_NOT_WAITING)
 					break;
 				extraWaits++;
 			}
@@ -1706,7 +1709,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1787,6 +1790,10 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 
 		proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
 		proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+		/* see LWLockWakeup() */
+		Assert(waiter->lwWaiting == LW_WS_WAITING);
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -1802,7 +1809,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 		proclist_delete(&wakeup, iter.cur, lwWaitLink);
 		/* check comment in LWLockWakeup() about this barrier */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 199c0eae07..81a2025eb5 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -402,7 +402,7 @@ InitProcess(void)
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
 		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
@@ -582,7 +582,7 @@ InitAuxiliaryProcess(void)
 	MyProc->delayChkpt = false;
 	MyProc->delayChkptEnd = false;
 	MyPgXact->vacuumFlags = 0;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
-- 
2.43.0

0001-lwlock-Fix-quadratic-behavior-with-very-long-wai.v12.patchtext/x-diff; charset=us-asciiDownload

From 52b403443ac4a9986af6d95d175606332bbc538e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists

Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.

Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.

The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.

In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.

As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.

Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
Backpatch-through: 12
---
 src/include/storage/lwlock.h          |  8 ++++
 src/include/storage/proc.h            |  2 +-
 src/backend/access/transam/twophase.c |  2 +-
 src/backend/storage/lmgr/lwlock.c     | 53 +++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  4 +-
 5 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 87b1ce083d..a30a7d08e9 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -24,6 +24,14 @@
 
 struct PGPROC;
 
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+	LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+	LW_WS_WAITING, /* currently waiting */
+	LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
 /*
  * Code outside of lwlock.c should not manipulate the contents of this
  * structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 7024df206d..5edb22f49d 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -134,7 +134,7 @@ struct PGPROC
 	bool		recoveryConflictPending;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
+	uint8		lwWaiting;		/* see LWLockWaitState */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	proclist_node lwWaitLink;	/* position in LW lock wait list */
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e9ab612717..71b54e0292 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -485,7 +485,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->roleId = owner;
 	proc->tempNamespaceId = InvalidOid;
 	proc->isBackgroundWorker = false;
-	proc->lwWaiting = false;
+	proc->lwWaiting = LW_WS_NOT_WAITING;
 	proc->lwWaitMode = 0;
 	proc->waitLock = NULL;
 	proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 6b6e920704..933831edf5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -913,6 +913,15 @@ LWLockWakeup(LWLock *lock)
 			wokeup_somebody = true;
 		}
 
+		/*
+		 * Signal that the process isn't on the wait list anymore. This allows
+		 * LWLockDequeueSelf() to remove itself of the waitlist with a
+		 * proclist_delete(), rather than having to check if it has been
+		 * removed from the list.
+		 */
+		Assert(waiter->lwWaiting == LW_WS_WAITING);
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
 		/*
 		 * Once we've woken up an exclusive lock, there's no point in waking
 		 * up anybody else.
@@ -970,7 +979,7 @@ LWLockWakeup(LWLock *lock)
 		 * another lock.
 		 */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
@@ -991,7 +1000,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	if (MyProc == NULL)
 		elog(PANIC, "cannot wait without a PGPROC structure");
 
-	if (MyProc->lwWaiting)
+	if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
 		elog(PANIC, "queueing for lock while waiting on another one");
 
 	LWLockWaitListLock(lock);
@@ -999,7 +1008,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	/* setting the flag is protected by the spinlock */
 	pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
 
-	MyProc->lwWaiting = true;
+	MyProc->lwWaiting = LW_WS_WAITING;
 	MyProc->lwWaitMode = mode;
 
 	/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1027,8 +1036,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 static void
 LWLockDequeueSelf(LWLock *lock)
 {
-	bool		found = false;
-	proclist_mutable_iter iter;
+	bool		on_waitlist;
 
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -1041,18 +1049,13 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListLock(lock);
 
 	/*
-	 * Can't just remove ourselves from the list, but we need to iterate over
-	 * all entries as somebody else could have dequeued us.
+	 * Remove ourselves from the waitlist, unless we've already been
+	 * removed. The removal happens with the wait list lock held, so there's
+	 * no race in this check.
 	 */
-	proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
-	{
-		if (iter.cur == MyProc->pgprocno)
-		{
-			found = true;
-			proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
-			break;
-		}
-	}
+	on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+	if (on_waitlist)
+		proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
 
 	if (proclist_is_empty(&lock->waiters) &&
 		(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1064,8 +1067,8 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListUnlock(lock);
 
 	/* clear waiting state again, nice for debugging */
-	if (found)
-		MyProc->lwWaiting = false;
+	if (on_waitlist)
+		MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	else
 	{
 		int			extraWaits = 0;
@@ -1089,7 +1092,7 @@ LWLockDequeueSelf(LWLock *lock)
 		for (;;)
 		{
 			PGSemaphoreLock(MyProc->sem);
-			if (!MyProc->lwWaiting)
+			if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1239,7 +1242,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1399,7 +1402,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 			for (;;)
 			{
 				PGSemaphoreLock(proc->sem);
-				if (!proc->lwWaiting)
+				if (proc->lwWaiting == LW_WS_NOT_WAITING)
 					break;
 				extraWaits++;
 			}
@@ -1611,7 +1614,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 		for (;;)
 		{
 			PGSemaphoreLock(proc->sem);
-			if (!proc->lwWaiting)
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
 			extraWaits++;
 		}
@@ -1690,6 +1693,10 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 
 		proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
 		proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+		/* see LWLockWakeup() */
+		Assert(waiter->lwWaiting == LW_WS_WAITING);
+		waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -1705,7 +1712,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 		proclist_delete(&wakeup, iter.cur, lwWaitLink);
 		/* check comment in LWLockWakeup() about this barrier */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = LW_WS_NOT_WAITING;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 80a8b48c3c..83b0610dce 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -403,7 +403,7 @@ InitProcess(void)
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
 		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
@@ -583,7 +583,7 @@ InitAuxiliaryProcess(void)
 	MyPgXact->delayChkpt = false;
 	MyProc->delayChkptEnd = false;
 	MyPgXact->vacuumFlags = 0;
-	MyProc->lwWaiting = false;
+	MyProc->lwWaiting = LW_WS_NOT_WAITING;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
-- 
2.43.0

#27

jkatz@postgresql.org

almost 2 years ago

In reply to: Michael Paquier (#26)

Re: heavily contended lwlocks with long wait queues scale badly

On 1/16/24 1:11 AM, Michael Paquier wrote:

On Thu, Jan 11, 2024 at 09:47:33AM -0500, Jonathan S. Katz wrote:

I have similar data sources to Nathan/Michael and I'm trying to avoid piling
on, but one case that's interesting occurred after a major version upgrade
from PG10 to PG14 on a database supporting a very active/highly concurrent
workload. On inspection, it seems like backpatching would help this
particularly case.

With 10/11 EOL, I do wonder if we'll see more of these reports on upgrade to
< PG16.

(I was in favor of backpatching prior; opinion is unchanged).

Hearing nothing, I have prepared a set of patches for v12~v15,
checking all the lwlock paths for all the branches. At the end the
set of changes look rather sane to me regarding the queue handlings.

I have also run some numbers on all the branches, and the test case
posted upthread falls off dramatically after 512 concurrent
connections at the top of all the stable branches :(

For example on REL_12_STABLE with and without the patch attached:
num v12 v12+patch
1 29717.151665 29096.707588
2 63257.709301 61889.476318
4 127921.873393 124575.901330
8 231400.571662 230562.725174
16 343911.185351 312432.897015
32 291748.985280 281011.787701
64 268998.728648 269975.605115
128 297332.597018 286449.176950
256 243902.817657 240559.122309
512 190069.602270 194510.718508
768 58915.650225 165714.707198
1024 39920.950552 149433.836901
2048 16922.391688 108164.301054
4096 6229.063321 69032.338708

I'd like to apply that, just let me know if you have any comments
and/or objections.

Wow. All I can say is that my opinion remains unchanged on going forward
with backpatching.

Looking at the code, I understand an argument for not backpatching given
we modify the struct, but this does seem low-risk/high-reward and should
help PostgreSQL to run better on this higher throughput workloads.

Thanks,

Jonathan

#28

michael@paquier.xyz

almost 2 years ago

In reply to: Jonathan S. Katz (#27)

Re: heavily contended lwlocks with long wait queues scale badly

On Tue, Jan 16, 2024 at 11:24:49PM -0500, Jonathan S. Katz wrote:

On 1/16/24 1:11 AM, Michael Paquier wrote:

I'd like to apply that, just let me know if you have any comments
and/or objections.

Looking at the code, I understand an argument for not backpatching given we
modify the struct, but this does seem low-risk/high-reward and should help
PostgreSQL to run better on this higher throughput workloads.

Just to be clear here. I have repeated tests on all the stable
branches yesterday, and the TPS falls off drastically around 256
concurrent sessions for all of them with patterns similar to what I've
posted for 12, getting back a lot of performance for the cases with
more than 1k connections.
--
Michael

#29

michael@paquier.xyz

almost 2 years ago

In reply to: Michael Paquier (#26)

Re: heavily contended lwlocks with long wait queues scale badly

On Tue, Jan 16, 2024 at 03:11:48PM +0900, Michael Paquier wrote:

I'd like to apply that, just let me know if you have any comments
and/or objections.

And done on 12~15.

While on it, I have also looked at source code references on github
and debian that involve lwWaiting, and all of them rely on lwWaiting
when not waiting, making LW_WS_NOT_WAITING an equivalent.
--
Michael

#30

Jakub Wartak

jakub.wartak@enterprisedb.com

almost 2 years ago

In reply to: Michael Paquier (#29)

4 attachment(s)

Re: heavily contended lwlocks with long wait queues scale badly

On Thu, Jan 18, 2024 at 7:17 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jan 16, 2024 at 03:11:48PM +0900, Michael Paquier wrote:

I'd like to apply that, just let me know if you have any comments
and/or objections.

And done on 12~15.

Hi Michael, just to reassure you that it is a good thing. We have a
customer who reported much better performance on 16.x than on 13~15 in
very heavy duty LWLock/lockmanager scenarios (ofc, before that was
committed/released), so I gave it a try here today to see how much can
be attributed to that single commit.

Given:
# $s=10, $p=10,100, DURATION=10s, m=prepared,simple, no reruns , just
single $DURATION run to save time
pgbench -i -s $s --partitions $p $DBNAME
ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT;
UPDATE pgbench_accounts SET aid_parent = aid
CREATE INDEX ON pgbench_accounts(aid_parent)

pgbench -n -M $m -T $DURATION -c $c -j $c -f join.sql $DBNAME

join.sql was:
\set aid random(1, 100000 * :scale)
select * from pgbench_accounts pa join pgbench_branches pb on pa.bid =
pb.bid where pa.aid_parent = :aid;

see attached results.The benefits are observable (at least when active
working sessions >= VCPUs [threads not cores]) and give up to ~2.65x
boost in certain cases at least for this testcase. Hopefully others
will find it useful.

-J.

#31