heavily contended lwlocks with long wait queues scale badly
Hi,
I am working on posting a patch series making relation extension more
scalable. As part of that I was running some benchmarks for workloads that I
thought should not or just positively impacted - but I was wrong, there was
some very significant degradation at very high client counts. After pulling my
hair out for quite a while to try to understand that behaviour, I figured out
that it's just a side-effect of *removing* some other contention. This
morning, turns out sleeping helps, I managed to reproduce it in an unmodified
postgres.
$ cat ~/tmp/txid.sql
SELECT txid_current();
$ for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";pgbench -n -M prepared -f ~/tmp/txid.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
1 60174
2 116169
4 208119
8 373685
16 515247
32 554726
64 497508
128 415097
256 334923
512 243679
768 192959
1024 157734
2048 82904
4096 32007
(I didn't properly round TPS, but that doesn't matter here)
Performance completely falls off a cliff starting at ~256 clients. There's
actually plenty CPU available here, so this isn't a case of running out of
CPU time.
Rather, the problem is very bad contention on the "spinlock" for the lwlock
wait list. I realized that something in that direction was off when trying to
investigate why I was seeing spin delays of substantial duration (>100ms).
The problem isn't a fundamental issue with lwlocks, it's that
LWLockDequeueSelf() does this:
LWLockWaitListLock(lock);
/*
* Can't just remove ourselves from the list, but we need to iterate over
* all entries as somebody else could have dequeued us.
*/
proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
{
if (iter.cur == MyProc->pgprocno)
{
found = true;
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
break;
}
}
I.e. it iterates over the whole waitlist to "find itself". The longer the
waitlist gets, the longer this takes. And the longer it takes for
LWLockWakeup() to actually wake up all waiters, the more likely it becomes
that LWLockDequeueSelf() needs to be called.
We can't make the trivial optimization and use proclist_contains(), because
PGPROC->lwWaitLink is also used for the list of processes to wake up in
LWLockWakeup().
But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken up
which we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.
client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007
Not perfect with the patch, but not awful either.
I suspect this issue might actually explain quite a few odd performance
behaviours we've seen at the larger end in the past. I think it has gotten a
bit worse with the conversion of lwlock.c to proclists (I see lots of
expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely
exists at least as far back as ab5194e6f61, in 9.5.
I guess there's an argument for considering this a bug that we should
backpatch a fix for? But given the vintage, probably not? The only thing that
gives me pause is that this is quite hard to pinpoint as happening.
I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.
Greetings,
Andres Freund
Attachments:
fix-lwlock-queue.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/include/storage/proc.h w/src/include/storage/proc.h
index 8d096fdeeb1..9a2615666a1 100644
--- i/src/include/storage/proc.h
+++ w/src/include/storage/proc.h
@@ -217,7 +217,8 @@ struct PGPROC
bool recoveryConflictPending;
/* Info about LWLock the process is currently waiting for, if any. */
- bool lwWaiting; /* true if waiting for an LW lock */
+ int lwWaiting; /* 0 if not waiting, 1 if on waitlist, 2 if
+ * waiting to be woken */
uint8 lwWaitMode; /* lwlock mode being waited for */
proclist_node lwWaitLink; /* position in LW lock wait list */
diff --git i/src/backend/storage/lmgr/lwlock.c w/src/backend/storage/lmgr/lwlock.c
index d274c9b1dc9..ffb852c91d7 100644
--- i/src/backend/storage/lmgr/lwlock.c
+++ w/src/backend/storage/lmgr/lwlock.c
@@ -987,6 +987,9 @@ LWLockWakeup(LWLock *lock)
wokeup_somebody = true;
}
+ /* signal that the process isn't on the wait list anymore */
+ waiter->lwWaiting = 2;
+
/*
* Once we've woken up an exclusive lock, there's no point in waking
* up anybody else.
@@ -1044,7 +1047,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = 0;
PGSemaphoreUnlock(waiter->sem);
}
}
@@ -1073,7 +1076,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
/* setting the flag is protected by the spinlock */
pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
- MyProc->lwWaiting = true;
+ MyProc->lwWaiting = 1;
MyProc->lwWaitMode = mode;
/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1101,7 +1104,6 @@ static void
LWLockDequeueSelf(LWLock *lock)
{
bool found = false;
- proclist_mutable_iter iter;
#ifdef LWLOCK_STATS
lwlock_stats *lwstats;
@@ -1114,17 +1116,14 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListLock(lock);
/*
- * Can't just remove ourselves from the list, but we need to iterate over
- * all entries as somebody else could have dequeued us.
+ * Remove ourselves from the waitlist, unless we've already been
+ * removed. The removal happens with the wait list lock held, so there's
+ * no race in this check.
*/
- proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
+ if (MyProc->lwWaiting == 1)
{
- if (iter.cur == MyProc->pgprocno)
- {
- found = true;
- proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
- break;
- }
+ proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
+ found = true;
}
if (proclist_is_empty(&lock->waiters) &&
@@ -1138,7 +1137,7 @@ LWLockDequeueSelf(LWLock *lock)
/* clear waiting state again, nice for debugging */
if (found)
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = 0;
else
{
int extraWaits = 0;
At Thu, 27 Oct 2022 09:59:14 -0700, Andres Freund <andres@anarazel.de> wrote in
But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken upwhich we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.
Since LWLockDequeueSelf() is defined to be called in some restricted
situation, there's no chance that the proc to remove is in another
lock waiters list at the time the function is called. So it seems to
work well. It is simple and requires no additional memory or cycles...
No. It enlarges PRPC by 8 bytes, but changing lwWaiting to int8/uint8
keeps the size as it is. (Rocky8/x86-64)
It just shaves off looping cycles. So +1 for what the patch does.
client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007Not perfect with the patch, but not awful either.
Fairly good? Agreed. The performance peak is improved by 6% and
shifted to larger number of clients (32->128).
I suspect this issue might actually explain quite a few odd performance
behaviours we've seen at the larger end in the past. I think it has gotten a
bit worse with the conversion of lwlock.c to proclists (I see lots of
expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely
exists at least as far back as ab5194e6f61, in 9.5.I guess there's an argument for considering this a bug that we should
backpatch a fix for? But given the vintage, probably not? The only thing that
gives me pause is that this is quite hard to pinpoint as happening.
I don't think this is a bug but I think it might be back-patchable
since it doesn't change memory footprint (if adjusted), causes no
additional cost or interfarce breakage, thus it might be ok to
backpatch. Since this "bug" has the nature of positive-feedback so
reducing the coefficient is benetifical than the direct cause of the
change.
I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Oct 31, 2022 at 11:03 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Thu, 27 Oct 2022 09:59:14 -0700, Andres Freund <andres@anarazel.de> wrote in
But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken upwhich we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.
This looks like a good idea.
No. It enlarges PRPC by 8 bytes, but changing lwWaiting to int8/uint8
keeps the size as it is. (Rocky8/x86-64)
I agree
It just shaves off looping cycles. So +1 for what the patch does.
client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007Not perfect with the patch, but not awful either.
Fairly good? Agreed. The performance peak is improved by 6% and
shifted to larger number of clients (32->128).
The performance result is promising so +1
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Oct 27, 2022 at 10:29 PM Andres Freund <andres@anarazel.de> wrote:
But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken upwhich we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007Not perfect with the patch, but not awful either.
Here are results from my testing [1]./configure --prefix=$PWD/inst/ --enable-tap-tests CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & shared_buffers = 8GB max_wal_size = 32GB max_connections = 4096 checkpoint_timeout = 10min. Results look impressive with the
patch at a higher number of clients, for instance, on HEAD TPS with
1024 clients is 103587 whereas it is 248702 with the patch.
HEAD, run 1:
1 34534
2 72088
4 135249
8 213045
16 243507
32 304108
64 375148
128 390658
256 345503
512 284510
768 146417
1024 103587
2048 34702
4096 12450
HEAD, run 2:
1 34110
2 72403
4 134421
8 211263
16 241606
32 295198
64 353580
128 385147
256 341672
512 295001
768 142341
1024 97721
2048 30229
4096 13179
PATCHED, run 1:
1 34412
2 71733
4 139141
8 211526
16 241692
32 308198
64 406198
128 385643
256 338464
512 295559
768 272639
1024 248702
2048 191402
4096 112074
PATCHED, run 2:
1 34087
2 73567
4 135624
8 211901
16 242819
32 310534
64 352663
128 381780
256 342483
512 301968
768 272596
1024 251014
2048 184939
4096 108186
I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.
Just for the record, here are some review comments posted in the other
thread - /messages/by-id/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com..
BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1]./configure --prefix=$PWD/inst/ --enable-tap-tests CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & shared_buffers = 8GB max_wal_size = 32GB max_connections = 4096 checkpoint_timeout = 10min, I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.
[1]: ./configure --prefix=$PWD/inst/ --enable-tap-tests CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & shared_buffers = 8GB max_wal_size = 32GB max_connections = 4096 checkpoint_timeout = 10min
install.log && make -j 8 install > install.log 2>&1 &
shared_buffers = 8GB
max_wal_size = 32GB
max_connections = 4096
checkpoint_timeout = 10min
ubuntu: cat << EOF >> txid.sql
SELECT txid_current();
EOF
ubuntu: for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do
echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f txid.sql
-c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
I was working on optimizing the LWLock queue in a little different way
and I also did a benchmarking of Andres' original patch from this
thread. [1]/messages/by-id/CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com
The results are quite impressive, indeed. Please feel free to see the
results and join the discussion in [1]/messages/by-id/CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com if you want.
Best regards,
Pavel
[1]: /messages/by-id/CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com
Hi Andres,
Thank you for your patch. The results are impressive.
On Mon, Oct 31, 2022 at 2:10 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
I was working on optimizing the LWLock queue in a little different way
and I also did a benchmarking of Andres' original patch from this
thread. [1]
The results are quite impressive, indeed. Please feel free to see the
results and join the discussion in [1] if you want.Best regards,
Pavel[1] /messages/by-id/CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com
Pavel posted a patch implementing a lock-less queue for LWLock. The
results are interesting indeed, but slightly lower than your current
patch have. The current Pavel's patch probably doesn't utilize the
full potential of lock-less idea. I wonder what do you think about
this direction? We would be grateful for your guidance. Thank you.
------
Regards,
Alexander Korotkov
On Thu, Oct 27, 2022 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:
After pulling my
hair out for quite a while to try to understand that behaviour, I figured out
that it's just a side-effect of *removing* some other contention.
I've seen this kind of pattern on multiple occasions. I don't know if
they were all caused by this, or what, but I certainly like the idea
of making it better.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:
BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.
That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().
Updated patch attached.
Greetings,
Andres Freund
Attachments:
v3-0001-lwlock-fix-quadratic-behaviour-with-very-long-wai.patchtext/x-diff; charset=us-asciiDownload
From fbdd1345936d0e9909061fb6101831ade657b1b0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 27 Oct 2022 10:18:13 -0700
Subject: [PATCH v3] lwlock: fix quadratic behaviour with very long wait lists
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
Backpatch:
---
src/include/storage/lwlock.h | 8 +++++
src/include/storage/proc.h | 2 +-
src/backend/access/transam/twophase.c | 2 +-
src/backend/storage/lmgr/lwlock.c | 50 +++++++++++++++------------
src/backend/storage/lmgr/proc.c | 4 +--
5 files changed, 39 insertions(+), 27 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ca4eca76f41..a494cb598ff 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -23,6 +23,14 @@
struct PGPROC;
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+ LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+ LW_WS_WAITING, /* currently waiting */
+ LW_WS_PENDING_WAKEUP, /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
/*
* Code outside of lwlock.c should not manipulate the contents of this
* structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 8d096fdeeb1..aa13e1d66e8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -217,7 +217,7 @@ struct PGPROC
bool recoveryConflictPending;
/* Info about LWLock the process is currently waiting for, if any. */
- bool lwWaiting; /* true if waiting for an LW lock */
+ uint8 lwWaiting; /* see LWLockWaitState */
uint8 lwWaitMode; /* lwlock mode being waited for */
proclist_node lwWaitLink; /* position in LW lock wait list */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 803d169f578..5017f4451eb 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -485,7 +485,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
proc->roleId = owner;
proc->tempNamespaceId = InvalidOid;
proc->isBackgroundWorker = false;
- proc->lwWaiting = false;
+ proc->lwWaiting = LW_WS_NOT_WAITING;
proc->lwWaitMode = 0;
proc->waitLock = NULL;
proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0fc0cf6ebbd..a8a439ad516 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -987,6 +987,14 @@ LWLockWakeup(LWLock *lock)
wokeup_somebody = true;
}
+ /*
+ * Signal that the process isn't on the wait list anymore. This allows
+ * LWLockDequeueSelf() to remove itself of the waitlist with a
+ * proclist_delete(), rather than having to check if it has been
+ * removed from the list.
+ */
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
/*
* Once we've woken up an exclusive lock, there's no point in waking
* up anybody else.
@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
@@ -1065,7 +1073,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
if (MyProc == NULL)
elog(PANIC, "cannot wait without a PGPROC structure");
- if (MyProc->lwWaiting)
+ if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");
LWLockWaitListLock(lock);
@@ -1073,7 +1081,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
/* setting the flag is protected by the spinlock */
pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
- MyProc->lwWaiting = true;
+ MyProc->lwWaiting = LW_WS_WAITING;
MyProc->lwWaitMode = mode;
/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1100,8 +1108,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
static void
LWLockDequeueSelf(LWLock *lock)
{
- bool found = false;
- proclist_mutable_iter iter;
+ bool on_waitlist;
#ifdef LWLOCK_STATS
lwlock_stats *lwstats;
@@ -1114,18 +1121,13 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListLock(lock);
/*
- * Can't just remove ourselves from the list, but we need to iterate over
- * all entries as somebody else could have dequeued us.
+ * Remove ourselves from the waitlist, unless we've already been
+ * removed. The removal happens with the wait list lock held, so there's
+ * no race in this check.
*/
- proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
- {
- if (iter.cur == MyProc->pgprocno)
- {
- found = true;
- proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
- break;
- }
- }
+ on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+ if (on_waitlist)
+ proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
if (proclist_is_empty(&lock->waiters) &&
(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1137,8 +1139,8 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListUnlock(lock);
/* clear waiting state again, nice for debugging */
- if (found)
- MyProc->lwWaiting = false;
+ if (on_waitlist)
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
else
{
int extraWaits = 0;
@@ -1162,7 +1164,7 @@ LWLockDequeueSelf(LWLock *lock)
for (;;)
{
PGSemaphoreLock(MyProc->sem);
- if (!MyProc->lwWaiting)
+ if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1313,7 +1315,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1478,7 +1480,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1694,7 +1696,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1772,6 +1774,8 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
}
/* We are done updating shared state of the lock itself. */
@@ -1787,7 +1791,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 13fa07b0ff7..01af5ab4a2e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -397,7 +397,7 @@ InitProcess(void)
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
MyProc->statusFlags |= PROC_IS_AUTOVACUUM;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
MyProc->isBackgroundWorker = IsBackgroundWorker;
MyProc->delayChkptFlags = 0;
MyProc->statusFlags = 0;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
--
2.38.0
On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:
BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().Updated patch attached.
Greetings,
Andres Freund
Hi,
Minor comment:
+ uint8 lwWaiting; /* see LWLockWaitState */
Why not declare `lwWaiting` of type LWLockWaitState ?
Cheers
Hi,
On 2022-10-31 17:17:03 -0700, Zhihong Yu wrote:
On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:
BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().Updated patch attached.
Greetings,
Andres Freund
Hi,
Minor comment:+ uint8 lwWaiting; /* see LWLockWaitState */
Why not declare `lwWaiting` of type LWLockWaitState ?
Unfortunately C99 (*) doesn't allow to specify the width of an enum
field. With most compilers we'd end up using 4 bytes.
Greetings,
Andres Freund
(*) C++ has allowed specifying this for quite a few years now and I think C23
will support it too, but that doesn't help us at this point.
On Mon, Oct 31, 2022 at 5:19 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-10-31 17:17:03 -0700, Zhihong Yu wrote:
On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de>
wrote:
Hi,
On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:
BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.That's likely because the prototype patch I submitted in this thread
missed
updating LWLockUpdateVar().
Updated patch attached.
Greetings,
Andres Freund
Hi,
Minor comment:+ uint8 lwWaiting; /* see LWLockWaitState */
Why not declare `lwWaiting` of type LWLockWaitState ?
Unfortunately C99 (*) doesn't allow to specify the width of an enum
field. With most compilers we'd end up using 4 bytes.Greetings,
Andres Freund
(*) C++ has allowed specifying this for quite a few years now and I think
C23
will support it too, but that doesn't help us at this point.
Hi,
Thanks for the response.
If possible, it would be better to put your explanation in the code comment
(so that other people know the reasoning).
On Tue, Nov 1, 2022 at 5:21 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:
BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().Updated patch attached.
Thanks. It looks good to me. However, some minor comments on the v3 patch:
1.
- if (MyProc->lwWaiting)
+ if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");
Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?
Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?
2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);
proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;
Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.
Below are test results with v3 patch. +1 for back-patching it.
HEAD PATCHED
1 34142 34289
2 72760 69720
4 136300 131848
8 210809 210192
16 240718 242744
32 297587 297354
64 341939 343036
128 383615 383801
256 342094 337680
512 263194 288629
768 145526 261553
1024 107267 241811
2048 35716 188389
4096 12415 120300
PG15 PATCHED
1 34503 34078
2 73708 72054
4 139415 133321
8 212396 211390
16 242227 242584
32 303441 309288
64 362680 339211
128 378645 344291
256 340016 344291
512 290044 293337
768 140277 264618
1024 96191 247636
2048 35158 181488
4096 12164 118610
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Below are test results with v3 patch. +1 for back-patching it.
The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.
It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.
It's possible that I'm too pessimistic, though.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 11/1/22 8:37 AM, Robert Haas wrote:
On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:Below are test results with v3 patch. +1 for back-patching it.
First, awesome find and proposed solution!
The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.
If someone is using the database to throttle activity for their app, I
have a bunch of follow up questions to understand why.
It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.
Having the privilege of reading through the release notes for every
update release, on average 1-2 "performance improvements" in each
release. I believe they tend to be more negligible, though.
I do understand the concerns. Say you discover your workload does have a
regression with this patch and then there's a CVE that you want to
accept -- what do you do? Reading the thread / patch, it seems as if
this is a lower risk "performance fix", but still nonzero.
While this does affect all supported versions, we could also consider
backpatching only for PG15. That at least 1/ limits impact on users
running older versions (opting into a major version upgrade) and 2/
we're still very early in the major upgrade cycle for PG15 that it's
lower risk if there are issues.
Users are generally happy when they can perform a simple upgrade and get
a performance boost, particularly the set of users that this patch
affects most (high throughput, high connection count). This is the type
of fix that would make headlines in a major release announcement (10x
TPS improvement w/4096 connections?!). That is also part of the tradeoff
of backpatching this, is that we may lose some of the higher visibility
marketing opportunities to discuss this (though I'm sure there will be
plenty of blog posts, etc.)
Andres: when you suggested backpatching, were you thinking of the Nov
2022 release or the Feb 2023 release?
Thanks,
Jonathan
Hi,
On 2022-11-01 08:37:39 -0400, Robert Haas wrote:
On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:Below are test results with v3 patch. +1 for back-patching it.
The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.
In general I agree, we shouldn't default to backpatching performance
fixes. The reason I am even considering it in this case, is that it's a
readily reproducible issue, leading to a quadratic behaviour that's extremely
hard to pinpoint. There's no increase in CPU usage, no wait event for
spinlocks, the system doesn't even get stuck (because the wait list lock is
held after the lwlock lock release). I don't think users have a decent chance
at figuring out that this is the issue.
I'm not at all convinced we should backpatch either, just to be clear.
Greetings,
Andres Freund
Hi,
On 2022-11-01 11:19:02 -0400, Jonathan S. Katz wrote:
This is the type of fix that would make headlines in a major release
announcement (10x TPS improvement w/4096 connections?!). That is also part
of the tradeoff of backpatching this, is that we may lose some of the higher
visibility marketing opportunities to discuss this (though I'm sure there
will be plenty of blog posts, etc.)
(read the next paragraph with the caveat that results below prove it somewhat
wrong)
I don't think the fix is as big a deal as the above make it sound - you need
to do somewhat extreme things to hit the problem. Yes, it drastically improves
the scalability of e.g. doing SELECT txid_current() across as many sessions as
possible - but that's not something you normally do (it was a good candidate
to show the problem because it's a single lock but doesn't trigger WAL flushes
at commit).
You can probably hit the problem with many concurrent single-tx INSERTs, but
you'd need to have synchronous_commit=off or fsync=off (or a very expensive
server class SSD with battery backup) and the effect is likely smaller.
Andres: when you suggested backpatching, were you thinking of the Nov 2022
release or the Feb 2023 release?
I wasn't thinking that concretely. Even if we decide to backpatch, I'd be very
hesitant to do it in a few days.
<goes and runs test while in meeting>
I tested with browser etc running, so this is plenty noisy. I used the best of
the two pgbench -T21 -P5 tps, after ignoring the first two periods (they're
too noisy). I used an ok-ish NVMe SSD, rather than the the expensive one that
has "free" fsync.
synchronous_commit=on:
clients master fix
16 6196 6202
64 25716 25545
256 90131 90240
1024 128556 151487
2048 59417 157050
4096 32252 178823
synchronous_commit=off:
clients master fix
16 409828 409016
64 454257 455804
256 304175 452160
1024 135081 334979
2048 66124 291582
4096 27019 245701
Hm. That's a bigger effect than I anticipated. I guess sc=off isn't actually
required, due to the level of concurrency making group commit very
effective.
This is without an index, serial column or anything. But a quick comparison
for just 4096 clients shows that to still be a big difference if I create an
serial primary key:
master: 26172
fix: 155813
Greetings,
Andres Freund
On 11/1/22 1:41 PM, Andres Freund wrote:
Andres: when you suggested backpatching, were you thinking of the Nov 2022
release or the Feb 2023 release?I wasn't thinking that concretely. Even if we decide to backpatch, I'd be very
hesitant to do it in a few days.
Yeah this was my thinking (and also why I took a few days to reply given
the lack of urgency for this release). It would at least give some more
time for others to test it to feel confident that we're not introducing
noticeable regressions.
<goes and runs test while in meeting>
I tested with browser etc running, so this is plenty noisy. I used the best of
the two pgbench -T21 -P5 tps, after ignoring the first two periods (they're
too noisy). I used an ok-ish NVMe SSD, rather than the the expensive one that
has "free" fsync.synchronous_commit=on:
clients master fix
16 6196 6202
64 25716 25545
256 90131 90240
1024 128556 151487
2048 59417 157050
4096 32252 178823synchronous_commit=off:
clients master fix
16 409828 409016
64 454257 455804
256 304175 452160
1024 135081 334979
2048 66124 291582
4096 27019 245701Hm. That's a bigger effect than I anticipated. I guess sc=off isn't actually
required, due to the level of concurrency making group commit very
effective.This is without an index, serial column or anything. But a quick comparison
for just 4096 clients shows that to still be a big difference if I create an
serial primary key:
master: 26172
fix: 155813
🤯 (seeing if my exploding head makes it into the archives).
Given the lack of ABI changes (hesitant to say low-risk until after more
testing, but seemingly low-risk), I can get behind backpatching esp if
we're targeting Feb 2023 so we can tests some more.
With my advocacy hat on, it bums me that we may not get as much buzz
about this change given it's not in a major release, but 1/ it'll fix an
issue that will help users with high-concurrency and 2/ users would be
able to perform a simpler update to get the change.
Thanks,
Jonathan
On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Updated patch attached.
Thanks. It looks good to me. However, some minor comments on the v3 patch:
1. - if (MyProc->lwWaiting) + if (MyProc->lwWaiting != LW_WS_NOT_WAITING) elog(PANIC, "queueing for lock while waiting on another one");Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock) * another lock. */ pg_write_barrier(); - waiter->lwWaiting = false; + waiter->lwWaiting = LW_WS_NOT_WAITING; PGSemaphoreUnlock(waiter->sem); }/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.Below are test results with v3 patch. +1 for back-patching it.
HEAD PATCHED
1 34142 34289
2 72760 69720
4 136300 131848
8 210809 210192
16 240718 242744
32 297587 297354
64 341939 343036
128 383615 383801
256 342094 337680
512 263194 288629
768 145526 261553
1024 107267 241811
2048 35716 188389
4096 12415 120300PG15 PATCHED
1 34503 34078
2 73708 72054
4 139415 133321
8 212396 211390
16 242227 242584
32 303441 309288
64 362680 339211
128 378645 344291
256 340016 344291
512 290044 293337
768 140277 264618
1024 96191 247636
2048 35158 181488
4096 12164 118610
I looked at the v3 patch again today and ran some performance tests.
The results look impressive as they were earlier. Andres, any plans to
get this in?
pgbench with SELECT txid_current();:
Clients HEAD PATCHED
1 34613 33611
2 72634 70546
4 137885 136911
8 216470 216076
16 242535 245392
32 299952 304740
64 329788 347401
128 378296 386873
256 344939 343832
512 292196 295839
768 144212 260102
1024 101525 250263
2048 35594 185878
4096 11842 104227
pgbench with insert into pgbench_accounts table:
Clients HEAD PATCHED
1 1660 1600
2 1848 1746
4 3547 3395
8 7330 6754
16 13103 13613
32 26011 26372
64 52331 52594
128 93313 95526
256 127373 126182
512 126712 127857
768 116765 119227
1024 111464 112499
2048 58838 92756
4096 26066 60543
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2022-11-09 15:54:16 +0530, Bharath Rupireddy wrote:
On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:Updated patch attached.
Thanks. It looks good to me. However, some minor comments on the v3 patch:
1. - if (MyProc->lwWaiting) + if (MyProc->lwWaiting != LW_WS_NOT_WAITING) elog(PANIC, "queueing for lock while waiting on another one");Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?
I don't think that's a good idea - it'll just mean we have to modify more
places if we add another state, without making anything more robust.
2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock) * another lock. */ pg_write_barrier(); - waiter->lwWaiting = false; + waiter->lwWaiting = LW_WS_NOT_WAITING; PGSemaphoreUnlock(waiter->sem); }/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.
I guess it can't hurt - but it's not really related to the changes in the
patch, no?
I looked at the v3 patch again today and ran some performance tests.
The results look impressive as they were earlier. Andres, any plans to
get this in?
I definitely didn't want to backpatch before this point release. But it seems
we haven't quite got to an agreement what to do about backpatching. It's
probably best to just commit it to HEAD and let the backpatch discussion
happen concurrently.
I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.
Greetings,
Andres Freund
On 2022-11-09 09:38:08 -0800, Andres Freund wrote:
I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.
I didn't get to it in time, so I'll leave it for when I'm back.
Hi,
On 2022-11-09 17:03:13 -0800, Andres Freund wrote:
On 2022-11-09 09:38:08 -0800, Andres Freund wrote:
I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.I didn't get to it in time, so I'll leave it for when I'm back.
Took a few days longer, partially because I encountered an independent issue
(see 8c954168cff) while testing.
I pushed it to HEAD now.
I still think it might be worth to backpatch in a bit, but so far the votes on
that weren't clear enough on that to feel comfortable.
Regards,
Andres
On 11/20/22 2:56 PM, Andres Freund wrote:
Hi,
On 2022-11-09 17:03:13 -0800, Andres Freund wrote:
On 2022-11-09 09:38:08 -0800, Andres Freund wrote:
I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.I didn't get to it in time, so I'll leave it for when I'm back.
Took a few days longer, partially because I encountered an independent issue
(see 8c954168cff) while testing.I pushed it to HEAD now.
Thanks!
I still think it might be worth to backpatch in a bit, but so far the votes on
that weren't clear enough on that to feel comfortable.
My general feeling is "yes" on backpatching, particularly if this is a
bug and it's fixable without ABI breaks.
My comments were around performing additional workload benchmarking just
to ensure people feel comfortable that we're not introducing any
performance regressions, and to consider the Feb 2023 release as the
time to introduce this (vs. Nov 2022). That gives us ample time to
determine if there are any performance regressions introduced.
Thanks,
Jonathan
On Mon, Nov 21, 2022 at 10:31:14AM -0500, Jonathan S. Katz wrote:
On 11/20/22 2:56 PM, Andres Freund wrote:
I still think it might be worth to backpatch in a bit, but so far the votes on
that weren't clear enough on that to feel comfortable.My general feeling is "yes" on backpatching, particularly if this is a bug
and it's fixable without ABI breaks.
Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion. I see a few possibly-related reports [0]/messages/by-id/CAM527d-uDn5osa6QPKxHAC6srOfBH3M8iXUM=ewqHV6n=w1u8Q@mail.gmail.com [1]/messages/by-id/VI1PR05MB620666631A41186ACC3FC91ACFC70@VI1PR05MB6206.eurprd05.prod.outlook.com
[2]: /messages/by-id/dd0e070809430a31f7ddd8483fbcce59@mail.gmail.com
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.
Thoughts?
[0]: /messages/by-id/CAM527d-uDn5osa6QPKxHAC6srOfBH3M8iXUM=ewqHV6n=w1u8Q@mail.gmail.com
[1]: /messages/by-id/VI1PR05MB620666631A41186ACC3FC91ACFC70@VI1PR05MB6206.eurprd05.prod.outlook.com
[2]: /messages/by-id/dd0e070809430a31f7ddd8483fbcce59@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Jan 10, 2024 at 09:17:47PM -0600, Nathan Bossart wrote:
Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion. I see a few possibly-related reports [0] [1]
[2], and I'm now seeing this in the field, too. While it is debatable
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.
+1, I've seen this becoming a PITA for a few things. Knowing that the
size of PGPROC does not change at all, I would be in favor for a
backpatch, especially since it's been in the tree for more than 1
year, and even more knowing that we have 16 released with this stuff
in.
--
Michael
On 1/10/24 10:45 PM, Michael Paquier wrote:
On Wed, Jan 10, 2024 at 09:17:47PM -0600, Nathan Bossart wrote:
Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion. I see a few possibly-related reports [0] [1]
[2], and I'm now seeing this in the field, too. While it is debatable
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.+1, I've seen this becoming a PITA for a few things. Knowing that the
size of PGPROC does not change at all, I would be in favor for a
backpatch, especially since it's been in the tree for more than 1
year, and even more knowing that we have 16 released with this stuff
in.
I have similar data sources to Nathan/Michael and I'm trying to avoid
piling on, but one case that's interesting occurred after a major
version upgrade from PG10 to PG14 on a database supporting a very
active/highly concurrent workload. On inspection, it seems like
backpatching would help this particularly case.
With 10/11 EOL, I do wonder if we'll see more of these reports on
upgrade to < PG16.
(I was in favor of backpatching prior; opinion is unchanged).
Thanks,
Jonathan
On Thu, Jan 11, 2024 at 09:47:33AM -0500, Jonathan S. Katz wrote:
I have similar data sources to Nathan/Michael and I'm trying to avoid piling
on, but one case that's interesting occurred after a major version upgrade
from PG10 to PG14 on a database supporting a very active/highly concurrent
workload. On inspection, it seems like backpatching would help this
particularly case.With 10/11 EOL, I do wonder if we'll see more of these reports on upgrade to
< PG16.(I was in favor of backpatching prior; opinion is unchanged).
Hearing nothing, I have prepared a set of patches for v12~v15,
checking all the lwlock paths for all the branches. At the end the
set of changes look rather sane to me regarding the queue handlings.
I have also run some numbers on all the branches, and the test case
posted upthread falls off dramatically after 512 concurrent
connections at the top of all the stable branches :(
For example on REL_12_STABLE with and without the patch attached:
num v12 v12+patch
1 29717.151665 29096.707588
2 63257.709301 61889.476318
4 127921.873393 124575.901330
8 231400.571662 230562.725174
16 343911.185351 312432.897015
32 291748.985280 281011.787701
64 268998.728648 269975.605115
128 297332.597018 286449.176950
256 243902.817657 240559.122309
512 190069.602270 194510.718508
768 58915.650225 165714.707198
1024 39920.950552 149433.836901
2048 16922.391688 108164.301054
4096 6229.063321 69032.338708
I'd like to apply that, just let me know if you have any comments
and/or objections.
--
Michael
Attachments:
0001-lwlock-Fix-quadratic-behavior-with-very-long-wai.v15.patchtext/x-diff; charset=us-asciiDownload
From 83706cd5792bc002a668fd54cb93c15011a97063 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists
Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.
Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.
The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.
In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.
As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
---
src/include/storage/lwlock.h | 8 ++++
src/include/storage/proc.h | 2 +-
src/backend/access/transam/twophase.c | 2 +-
src/backend/storage/lmgr/lwlock.c | 53 +++++++++++++++------------
src/backend/storage/lmgr/proc.c | 4 +-
5 files changed, 42 insertions(+), 27 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e03d317eea..d88fa4b4ad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -23,6 +23,14 @@
struct PGPROC;
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+ LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+ LW_WS_WAITING, /* currently waiting */
+ LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
/*
* Code outside of lwlock.c should not manipulate the contents of this
* structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..2ea773098b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -211,7 +211,7 @@ struct PGPROC
bool recoveryConflictPending;
/* Info about LWLock the process is currently waiting for, if any. */
- bool lwWaiting; /* true if waiting for an LW lock */
+ uint8 lwWaiting; /* see LWLockWaitState */
uint8 lwWaitMode; /* lwlock mode being waited for */
proclist_node lwWaitLink; /* position in LW lock wait list */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5293c6920b..ca7037eb2f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -486,7 +486,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
proc->roleId = owner;
proc->tempNamespaceId = InvalidOid;
proc->isBackgroundWorker = false;
- proc->lwWaiting = false;
+ proc->lwWaiting = LW_WS_NOT_WAITING;
proc->lwWaitMode = 0;
proc->waitLock = NULL;
proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 3bb448627c..ea34ee785c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -987,6 +987,15 @@ LWLockWakeup(LWLock *lock)
wokeup_somebody = true;
}
+ /*
+ * Signal that the process isn't on the wait list anymore. This allows
+ * LWLockDequeueSelf() to remove itself of the waitlist with a
+ * proclist_delete(), rather than having to check if it has been
+ * removed from the list.
+ */
+ Assert(waiter->lwWaiting == LW_WS_WAITING);
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
/*
* Once we've woken up an exclusive lock, there's no point in waking
* up anybody else.
@@ -1044,7 +1053,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
@@ -1065,7 +1074,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
if (MyProc == NULL)
elog(PANIC, "cannot wait without a PGPROC structure");
- if (MyProc->lwWaiting)
+ if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");
LWLockWaitListLock(lock);
@@ -1073,7 +1082,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
/* setting the flag is protected by the spinlock */
pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
- MyProc->lwWaiting = true;
+ MyProc->lwWaiting = LW_WS_WAITING;
MyProc->lwWaitMode = mode;
/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1100,8 +1109,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
static void
LWLockDequeueSelf(LWLock *lock)
{
- bool found = false;
- proclist_mutable_iter iter;
+ bool on_waitlist;
#ifdef LWLOCK_STATS
lwlock_stats *lwstats;
@@ -1114,18 +1122,13 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListLock(lock);
/*
- * Can't just remove ourselves from the list, but we need to iterate over
- * all entries as somebody else could have dequeued us.
+ * Remove ourselves from the waitlist, unless we've already been
+ * removed. The removal happens with the wait list lock held, so there's
+ * no race in this check.
*/
- proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
- {
- if (iter.cur == MyProc->pgprocno)
- {
- found = true;
- proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
- break;
- }
- }
+ on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+ if (on_waitlist)
+ proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
if (proclist_is_empty(&lock->waiters) &&
(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1137,8 +1140,8 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListUnlock(lock);
/* clear waiting state again, nice for debugging */
- if (found)
- MyProc->lwWaiting = false;
+ if (on_waitlist)
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
else
{
int extraWaits = 0;
@@ -1162,7 +1165,7 @@ LWLockDequeueSelf(LWLock *lock)
for (;;)
{
PGSemaphoreLock(MyProc->sem);
- if (!MyProc->lwWaiting)
+ if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1313,7 +1316,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1478,7 +1481,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1694,7 +1697,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1772,6 +1775,10 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+ /* see LWLockWakeup() */
+ Assert(waiter->lwWaiting == LW_WS_WAITING);
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
}
/* We are done updating shared state of the lock itself. */
@@ -1787,7 +1794,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1b514f44e3..c60707b9db 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -397,7 +397,7 @@ InitProcess(void)
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
MyProc->statusFlags |= PROC_IS_AUTOVACUUM;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
MyProc->isBackgroundWorker = IsBackgroundWorker;
MyProc->delayChkptFlags = 0;
MyProc->statusFlags = 0;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
--
2.43.0
0001-lwlock-Fix-quadratic-behavior-with-very-long-wai.v14.patchtext/x-diff; charset=us-asciiDownload
From 6066487bc301e5f725a1f0c2e696ccde0f5e06bf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists
Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.
Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.
The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.
In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.
As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
---
src/include/storage/lwlock.h | 8 ++++
src/include/storage/proc.h | 2 +-
src/backend/access/transam/twophase.c | 2 +-
src/backend/storage/lmgr/lwlock.c | 53 +++++++++++++++------------
src/backend/storage/lmgr/proc.c | 4 +-
5 files changed, 42 insertions(+), 27 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 78afe57fb2..b2c9a9dbbc 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -23,6 +23,14 @@
struct PGPROC;
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+ LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+ LW_WS_WAITING, /* currently waiting */
+ LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
/*
* Code outside of lwlock.c should not manipulate the contents of this
* structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 1464fad9b9..9cc37aa0d9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -180,7 +180,7 @@ struct PGPROC
bool recoveryConflictPending;
/* Info about LWLock the process is currently waiting for, if any. */
- bool lwWaiting; /* true if waiting for an LW lock */
+ uint8 lwWaiting; /* see LWLockWaitState */
uint8 lwWaitMode; /* lwlock mode being waited for */
proclist_node lwWaitLink; /* position in LW lock wait list */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 076315e7a2..db68b14443 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -482,7 +482,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
proc->roleId = owner;
proc->tempNamespaceId = InvalidOid;
proc->isBackgroundWorker = false;
- proc->lwWaiting = false;
+ proc->lwWaiting = LW_WS_NOT_WAITING;
proc->lwWaitMode = 0;
proc->waitLock = NULL;
proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d63dc1b93e..28a540961e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -993,6 +993,15 @@ LWLockWakeup(LWLock *lock)
wokeup_somebody = true;
}
+ /*
+ * Signal that the process isn't on the wait list anymore. This allows
+ * LWLockDequeueSelf() to remove itself of the waitlist with a
+ * proclist_delete(), rather than having to check if it has been
+ * removed from the list.
+ */
+ Assert(waiter->lwWaiting == LW_WS_WAITING);
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
/*
* Once we've woken up an exclusive lock, there's no point in waking
* up anybody else.
@@ -1050,7 +1059,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
@@ -1071,7 +1080,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
if (MyProc == NULL)
elog(PANIC, "cannot wait without a PGPROC structure");
- if (MyProc->lwWaiting)
+ if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");
LWLockWaitListLock(lock);
@@ -1079,7 +1088,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
/* setting the flag is protected by the spinlock */
pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
- MyProc->lwWaiting = true;
+ MyProc->lwWaiting = LW_WS_WAITING;
MyProc->lwWaitMode = mode;
/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1107,8 +1116,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
static void
LWLockDequeueSelf(LWLock *lock)
{
- bool found = false;
- proclist_mutable_iter iter;
+ bool on_waitlist;
#ifdef LWLOCK_STATS
lwlock_stats *lwstats;
@@ -1121,18 +1129,13 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListLock(lock);
/*
- * Can't just remove ourselves from the list, but we need to iterate over
- * all entries as somebody else could have dequeued us.
+ * Remove ourselves from the waitlist, unless we've already been
+ * removed. The removal happens with the wait list lock held, so there's
+ * no race in this check.
*/
- proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
- {
- if (iter.cur == MyProc->pgprocno)
- {
- found = true;
- proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
- break;
- }
- }
+ on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+ if (on_waitlist)
+ proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
if (proclist_is_empty(&lock->waiters) &&
(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1144,8 +1147,8 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListUnlock(lock);
/* clear waiting state again, nice for debugging */
- if (found)
- MyProc->lwWaiting = false;
+ if (on_waitlist)
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
else
{
int extraWaits = 0;
@@ -1169,7 +1172,7 @@ LWLockDequeueSelf(LWLock *lock)
for (;;)
{
PGSemaphoreLock(MyProc->sem);
- if (!MyProc->lwWaiting)
+ if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1320,7 +1323,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1485,7 +1488,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1701,7 +1704,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1779,6 +1782,10 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+ /* see LWLockWakeup() */
+ Assert(waiter->lwWaiting == LW_WS_WAITING);
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
}
/* We are done updating shared state of the lock itself. */
@@ -1794,7 +1801,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1e06d6580c..f62e622a7b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -400,7 +400,7 @@ InitProcess(void)
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
MyProc->statusFlags |= PROC_IS_AUTOVACUUM;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
@@ -583,7 +583,7 @@ InitAuxiliaryProcess(void)
MyProc->delayChkpt = false;
MyProc->delayChkptEnd = false;
MyProc->statusFlags = 0;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
--
2.43.0
0001-lwlock-Fix-quadratic-behavior-with-very-long-wai.v13.patchtext/x-diff; charset=us-asciiDownload
From 2cecd5135e294fbb81ff726ab2f46baf2a73d69c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists
Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.
Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.
The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.
In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.
As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
---
src/include/storage/lwlock.h | 8 ++++
src/include/storage/proc.h | 2 +-
src/backend/access/transam/twophase.c | 2 +-
src/backend/storage/lmgr/lwlock.c | 53 +++++++++++++++------------
src/backend/storage/lmgr/proc.c | 4 +-
5 files changed, 42 insertions(+), 27 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cdbfbed118..a1197a8007 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -24,6 +24,14 @@
struct PGPROC;
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+ LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+ LW_WS_WAITING, /* currently waiting */
+ LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
/*
* Code outside of lwlock.c should not manipulate the contents of this
* structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 7c85b5645b..ffb185d3df 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -140,7 +140,7 @@ struct PGPROC
bool recoveryConflictPending;
/* Info about LWLock the process is currently waiting for, if any. */
- bool lwWaiting; /* true if waiting for an LW lock */
+ uint8 lwWaiting; /* see LWLockWaitState */
uint8 lwWaitMode; /* lwlock mode being waited for */
proclist_node lwWaitLink; /* position in LW lock wait list */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d375b1012b..f8272c2f0d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -484,7 +484,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
proc->roleId = owner;
proc->tempNamespaceId = InvalidOid;
proc->isBackgroundWorker = false;
- proc->lwWaiting = false;
+ proc->lwWaiting = LW_WS_NOT_WAITING;
proc->lwWaitMode = 0;
proc->waitLock = NULL;
proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db89137c03..49423ea4d2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -998,6 +998,15 @@ LWLockWakeup(LWLock *lock)
wokeup_somebody = true;
}
+ /*
+ * Signal that the process isn't on the wait list anymore. This allows
+ * LWLockDequeueSelf() to remove itself of the waitlist with a
+ * proclist_delete(), rather than having to check if it has been
+ * removed from the list.
+ */
+ Assert(waiter->lwWaiting == LW_WS_WAITING);
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
/*
* Once we've woken up an exclusive lock, there's no point in waking
* up anybody else.
@@ -1055,7 +1064,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
@@ -1076,7 +1085,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
if (MyProc == NULL)
elog(PANIC, "cannot wait without a PGPROC structure");
- if (MyProc->lwWaiting)
+ if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");
LWLockWaitListLock(lock);
@@ -1084,7 +1093,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
/* setting the flag is protected by the spinlock */
pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
- MyProc->lwWaiting = true;
+ MyProc->lwWaiting = LW_WS_WAITING;
MyProc->lwWaitMode = mode;
/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1112,8 +1121,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
static void
LWLockDequeueSelf(LWLock *lock)
{
- bool found = false;
- proclist_mutable_iter iter;
+ bool on_waitlist;
#ifdef LWLOCK_STATS
lwlock_stats *lwstats;
@@ -1126,18 +1134,13 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListLock(lock);
/*
- * Can't just remove ourselves from the list, but we need to iterate over
- * all entries as somebody else could have dequeued us.
+ * Remove ourselves from the waitlist, unless we've already been
+ * removed. The removal happens with the wait list lock held, so there's
+ * no race in this check.
*/
- proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
- {
- if (iter.cur == MyProc->pgprocno)
- {
- found = true;
- proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
- break;
- }
- }
+ on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+ if (on_waitlist)
+ proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
if (proclist_is_empty(&lock->waiters) &&
(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1149,8 +1152,8 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListUnlock(lock);
/* clear waiting state again, nice for debugging */
- if (found)
- MyProc->lwWaiting = false;
+ if (on_waitlist)
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
else
{
int extraWaits = 0;
@@ -1174,7 +1177,7 @@ LWLockDequeueSelf(LWLock *lock)
for (;;)
{
PGSemaphoreLock(MyProc->sem);
- if (!MyProc->lwWaiting)
+ if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1325,7 +1328,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1490,7 +1493,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1706,7 +1709,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1787,6 +1790,10 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+ /* see LWLockWakeup() */
+ Assert(waiter->lwWaiting == LW_WS_WAITING);
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
}
/* We are done updating shared state of the lock itself. */
@@ -1802,7 +1809,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 199c0eae07..81a2025eb5 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -402,7 +402,7 @@ InitProcess(void)
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
@@ -582,7 +582,7 @@ InitAuxiliaryProcess(void)
MyProc->delayChkpt = false;
MyProc->delayChkptEnd = false;
MyPgXact->vacuumFlags = 0;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
--
2.43.0
0001-lwlock-Fix-quadratic-behavior-with-very-long-wai.v12.patchtext/x-diff; charset=us-asciiDownload
From 52b403443ac4a9986af6d95d175606332bbc538e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists
Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.
Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.
The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.
In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.
As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
Backpatch-through: 12
---
src/include/storage/lwlock.h | 8 ++++
src/include/storage/proc.h | 2 +-
src/backend/access/transam/twophase.c | 2 +-
src/backend/storage/lmgr/lwlock.c | 53 +++++++++++++++------------
src/backend/storage/lmgr/proc.c | 4 +-
5 files changed, 42 insertions(+), 27 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 87b1ce083d..a30a7d08e9 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -24,6 +24,14 @@
struct PGPROC;
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+ LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+ LW_WS_WAITING, /* currently waiting */
+ LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
/*
* Code outside of lwlock.c should not manipulate the contents of this
* structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 7024df206d..5edb22f49d 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -134,7 +134,7 @@ struct PGPROC
bool recoveryConflictPending;
/* Info about LWLock the process is currently waiting for, if any. */
- bool lwWaiting; /* true if waiting for an LW lock */
+ uint8 lwWaiting; /* see LWLockWaitState */
uint8 lwWaitMode; /* lwlock mode being waited for */
proclist_node lwWaitLink; /* position in LW lock wait list */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e9ab612717..71b54e0292 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -485,7 +485,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
proc->roleId = owner;
proc->tempNamespaceId = InvalidOid;
proc->isBackgroundWorker = false;
- proc->lwWaiting = false;
+ proc->lwWaiting = LW_WS_NOT_WAITING;
proc->lwWaitMode = 0;
proc->waitLock = NULL;
proc->waitProcLock = NULL;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 6b6e920704..933831edf5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -913,6 +913,15 @@ LWLockWakeup(LWLock *lock)
wokeup_somebody = true;
}
+ /*
+ * Signal that the process isn't on the wait list anymore. This allows
+ * LWLockDequeueSelf() to remove itself of the waitlist with a
+ * proclist_delete(), rather than having to check if it has been
+ * removed from the list.
+ */
+ Assert(waiter->lwWaiting == LW_WS_WAITING);
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+
/*
* Once we've woken up an exclusive lock, there's no point in waking
* up anybody else.
@@ -970,7 +979,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
@@ -991,7 +1000,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
if (MyProc == NULL)
elog(PANIC, "cannot wait without a PGPROC structure");
- if (MyProc->lwWaiting)
+ if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");
LWLockWaitListLock(lock);
@@ -999,7 +1008,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
/* setting the flag is protected by the spinlock */
pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
- MyProc->lwWaiting = true;
+ MyProc->lwWaiting = LW_WS_WAITING;
MyProc->lwWaitMode = mode;
/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1027,8 +1036,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
static void
LWLockDequeueSelf(LWLock *lock)
{
- bool found = false;
- proclist_mutable_iter iter;
+ bool on_waitlist;
#ifdef LWLOCK_STATS
lwlock_stats *lwstats;
@@ -1041,18 +1049,13 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListLock(lock);
/*
- * Can't just remove ourselves from the list, but we need to iterate over
- * all entries as somebody else could have dequeued us.
+ * Remove ourselves from the waitlist, unless we've already been
+ * removed. The removal happens with the wait list lock held, so there's
+ * no race in this check.
*/
- proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
- {
- if (iter.cur == MyProc->pgprocno)
- {
- found = true;
- proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
- break;
- }
- }
+ on_waitlist = MyProc->lwWaiting == LW_WS_WAITING;
+ if (on_waitlist)
+ proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
if (proclist_is_empty(&lock->waiters) &&
(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) != 0)
@@ -1064,8 +1067,8 @@ LWLockDequeueSelf(LWLock *lock)
LWLockWaitListUnlock(lock);
/* clear waiting state again, nice for debugging */
- if (found)
- MyProc->lwWaiting = false;
+ if (on_waitlist)
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
else
{
int extraWaits = 0;
@@ -1089,7 +1092,7 @@ LWLockDequeueSelf(LWLock *lock)
for (;;)
{
PGSemaphoreLock(MyProc->sem);
- if (!MyProc->lwWaiting)
+ if (MyProc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1239,7 +1242,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1399,7 +1402,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1611,7 +1614,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
for (;;)
{
PGSemaphoreLock(proc->sem);
- if (!proc->lwWaiting)
+ if (proc->lwWaiting == LW_WS_NOT_WAITING)
break;
extraWaits++;
}
@@ -1690,6 +1693,10 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+ /* see LWLockWakeup() */
+ Assert(waiter->lwWaiting == LW_WS_WAITING);
+ waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
}
/* We are done updating shared state of the lock itself. */
@@ -1705,7 +1712,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
- waiter->lwWaiting = false;
+ waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}
}
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 80a8b48c3c..83b0610dce 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -403,7 +403,7 @@ InitProcess(void)
/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
if (IsAutoVacuumWorkerProcess())
MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
@@ -583,7 +583,7 @@ InitAuxiliaryProcess(void)
MyPgXact->delayChkpt = false;
MyProc->delayChkptEnd = false;
MyPgXact->vacuumFlags = 0;
- MyProc->lwWaiting = false;
+ MyProc->lwWaiting = LW_WS_NOT_WAITING;
MyProc->lwWaitMode = 0;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
--
2.43.0
On 1/16/24 1:11 AM, Michael Paquier wrote:
On Thu, Jan 11, 2024 at 09:47:33AM -0500, Jonathan S. Katz wrote:
I have similar data sources to Nathan/Michael and I'm trying to avoid piling
on, but one case that's interesting occurred after a major version upgrade
from PG10 to PG14 on a database supporting a very active/highly concurrent
workload. On inspection, it seems like backpatching would help this
particularly case.With 10/11 EOL, I do wonder if we'll see more of these reports on upgrade to
< PG16.(I was in favor of backpatching prior; opinion is unchanged).
Hearing nothing, I have prepared a set of patches for v12~v15,
checking all the lwlock paths for all the branches. At the end the
set of changes look rather sane to me regarding the queue handlings.I have also run some numbers on all the branches, and the test case
posted upthread falls off dramatically after 512 concurrent
connections at the top of all the stable branches :(For example on REL_12_STABLE with and without the patch attached:
num v12 v12+patch
1 29717.151665 29096.707588
2 63257.709301 61889.476318
4 127921.873393 124575.901330
8 231400.571662 230562.725174
16 343911.185351 312432.897015
32 291748.985280 281011.787701
64 268998.728648 269975.605115
128 297332.597018 286449.176950
256 243902.817657 240559.122309
512 190069.602270 194510.718508
768 58915.650225 165714.707198
1024 39920.950552 149433.836901
2048 16922.391688 108164.301054
4096 6229.063321 69032.338708I'd like to apply that, just let me know if you have any comments
and/or objections.
Wow. All I can say is that my opinion remains unchanged on going forward
with backpatching.
Looking at the code, I understand an argument for not backpatching given
we modify the struct, but this does seem low-risk/high-reward and should
help PostgreSQL to run better on this higher throughput workloads.
Thanks,
Jonathan
On Tue, Jan 16, 2024 at 11:24:49PM -0500, Jonathan S. Katz wrote:
On 1/16/24 1:11 AM, Michael Paquier wrote:
I'd like to apply that, just let me know if you have any comments
and/or objections.Looking at the code, I understand an argument for not backpatching given we
modify the struct, but this does seem low-risk/high-reward and should help
PostgreSQL to run better on this higher throughput workloads.
Just to be clear here. I have repeated tests on all the stable
branches yesterday, and the TPS falls off drastically around 256
concurrent sessions for all of them with patterns similar to what I've
posted for 12, getting back a lot of performance for the cases with
more than 1k connections.
--
Michael
On Tue, Jan 16, 2024 at 03:11:48PM +0900, Michael Paquier wrote:
I'd like to apply that, just let me know if you have any comments
and/or objections.
And done on 12~15.
While on it, I have also looked at source code references on github
and debian that involve lwWaiting, and all of them rely on lwWaiting
when not waiting, making LW_WS_NOT_WAITING an equivalent.
--
Michael
On Thu, Jan 18, 2024 at 7:17 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jan 16, 2024 at 03:11:48PM +0900, Michael Paquier wrote:
I'd like to apply that, just let me know if you have any comments
and/or objections.And done on 12~15.
Hi Michael, just to reassure you that it is a good thing. We have a
customer who reported much better performance on 16.x than on 13~15 in
very heavy duty LWLock/lockmanager scenarios (ofc, before that was
committed/released), so I gave it a try here today to see how much can
be attributed to that single commit.
Given:
# $s=10, $p=10,100, DURATION=10s, m=prepared,simple, no reruns , just
single $DURATION run to save time
pgbench -i -s $s --partitions $p $DBNAME
ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT;
UPDATE pgbench_accounts SET aid_parent = aid
CREATE INDEX ON pgbench_accounts(aid_parent)
pgbench -n -M $m -T $DURATION -c $c -j $c -f join.sql $DBNAME
join.sql was:
\set aid random(1, 100000 * :scale)
select * from pgbench_accounts pa join pgbench_branches pb on pa.bid =
pb.bid where pa.aid_parent = :aid;
see attached results.The benefits are observable (at least when active
working sessions >= VCPUs [threads not cores]) and give up to ~2.65x
boost in certain cases at least for this testcase. Hopefully others
will find it useful.
-J.
On Fri, Jan 19, 2024 at 01:49:59PM +0100, Jakub Wartak wrote:
Hi Michael, just to reassure you that it is a good thing. We have a
customer who reported much better performance on 16.x than on 13~15 in
very heavy duty LWLock/lockmanager scenarios (ofc, before that was
committed/released), so I gave it a try here today to see how much can
be attributed to that single commit.
Ahh. Thanks for the feedback.
--
Michael