heavily contended lwlocks with long wait queues scale badly

Started by Andres Freundover 3 years ago31 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

I am working on posting a patch series making relation extension more
scalable. As part of that I was running some benchmarks for workloads that I
thought should not or just positively impacted - but I was wrong, there was
some very significant degradation at very high client counts. After pulling my
hair out for quite a while to try to understand that behaviour, I figured out
that it's just a side-effect of *removing* some other contention. This
morning, turns out sleeping helps, I managed to reproduce it in an unmodified
postgres.

$ cat ~/tmp/txid.sql
SELECT txid_current();
$ for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";pgbench -n -M prepared -f ~/tmp/txid.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
1 60174
2 116169
4 208119
8 373685
16 515247
32 554726
64 497508
128 415097
256 334923
512 243679
768 192959
1024 157734
2048 82904
4096 32007

(I didn't properly round TPS, but that doesn't matter here)

Performance completely falls off a cliff starting at ~256 clients. There's
actually plenty CPU available here, so this isn't a case of running out of
CPU time.

Rather, the problem is very bad contention on the "spinlock" for the lwlock
wait list. I realized that something in that direction was off when trying to
investigate why I was seeing spin delays of substantial duration (>100ms).

The problem isn't a fundamental issue with lwlocks, it's that
LWLockDequeueSelf() does this:

LWLockWaitListLock(lock);

/*
* Can't just remove ourselves from the list, but we need to iterate over
* all entries as somebody else could have dequeued us.
*/
proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
{
if (iter.cur == MyProc->pgprocno)
{
found = true;
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
break;
}
}

I.e. it iterates over the whole waitlist to "find itself". The longer the
waitlist gets, the longer this takes. And the longer it takes for
LWLockWakeup() to actually wake up all waiters, the more likely it becomes
that LWLockDequeueSelf() needs to be called.

We can't make the trivial optimization and use proclist_contains(), because
PGPROC->lwWaitLink is also used for the list of processes to wake up in
LWLockWakeup().

But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken up

which we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.

client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007

Not perfect with the patch, but not awful either.

I suspect this issue might actually explain quite a few odd performance
behaviours we've seen at the larger end in the past. I think it has gotten a
bit worse with the conversion of lwlock.c to proclists (I see lots of
expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely
exists at least as far back as ab5194e6f61, in 9.5.

I guess there's an argument for considering this a bug that we should
backpatch a fix for? But given the vintage, probably not? The only thing that
gives me pause is that this is quite hard to pinpoint as happening.

I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.

Greetings,

Andres Freund

Attachments:

fix-lwlock-queue.difftext/x-diff; charset=us-asciiDownload+14-14
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#1)
Re: heavily contended lwlocks with long wait queues scale badly

At Thu, 27 Oct 2022 09:59:14 -0700, Andres Freund <andres@anarazel.de> wrote in

But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken up

which we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.

Since LWLockDequeueSelf() is defined to be called in some restricted
situation, there's no chance that the proc to remove is in another
lock waiters list at the time the function is called. So it seems to
work well. It is simple and requires no additional memory or cycles...

No. It enlarges PRPC by 8 bytes, but changing lwWaiting to int8/uint8
keeps the size as it is. (Rocky8/x86-64)

It just shaves off looping cycles. So +1 for what the patch does.

client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007

Not perfect with the patch, but not awful either.

Fairly good? Agreed. The performance peak is improved by 6% and
shifted to larger number of clients (32->128).

I suspect this issue might actually explain quite a few odd performance
behaviours we've seen at the larger end in the past. I think it has gotten a
bit worse with the conversion of lwlock.c to proclists (I see lots of
expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely
exists at least as far back as ab5194e6f61, in 9.5.

I guess there's an argument for considering this a bug that we should
backpatch a fix for? But given the vintage, probably not? The only thing that
gives me pause is that this is quite hard to pinpoint as happening.

I don't think this is a bug but I think it might be back-patchable
since it doesn't change memory footprint (if adjusted), causes no
additional cost or interfarce breakage, thus it might be ok to
backpatch. Since this "bug" has the nature of positive-feedback so
reducing the coefficient is benetifical than the direct cause of the
change.

I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Dilip Kumar
dilipbalaut@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: heavily contended lwlocks with long wait queues scale badly

On Mon, Oct 31, 2022 at 11:03 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 27 Oct 2022 09:59:14 -0700, Andres Freund <andres@anarazel.de> wrote in

But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken up

which we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.

This looks like a good idea.

No. It enlarges PRPC by 8 bytes, but changing lwWaiting to int8/uint8
keeps the size as it is. (Rocky8/x86-64)

I agree

It just shaves off looping cycles. So +1 for what the patch does.

client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007

Not perfect with the patch, but not awful either.

Fairly good? Agreed. The performance peak is improved by 6% and
shifted to larger number of clients (32->128).

The performance result is promising so +1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#4Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#1)
Re: heavily contended lwlocks with long wait queues scale badly

On Thu, Oct 27, 2022 at 10:29 PM Andres Freund <andres@anarazel.de> wrote:

But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken up

which we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.

client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007

Not perfect with the patch, but not awful either.

Here are results from my testing [1]./configure --prefix=$PWD/inst/ --enable-tap-tests CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & shared_buffers = 8GB max_wal_size = 32GB max_connections = 4096 checkpoint_timeout = 10min. Results look impressive with the
patch at a higher number of clients, for instance, on HEAD TPS with
1024 clients is 103587 whereas it is 248702 with the patch.

HEAD, run 1:
1 34534
2 72088
4 135249
8 213045
16 243507
32 304108
64 375148
128 390658
256 345503
512 284510
768 146417
1024 103587
2048 34702
4096 12450

HEAD, run 2:
1 34110
2 72403
4 134421
8 211263
16 241606
32 295198
64 353580
128 385147
256 341672
512 295001
768 142341
1024 97721
2048 30229
4096 13179

PATCHED, run 1:
1 34412
2 71733
4 139141
8 211526
16 241692
32 308198
64 406198
128 385643
256 338464
512 295559
768 272639
1024 248702
2048 191402
4096 112074

PATCHED, run 2:
1 34087
2 73567
4 135624
8 211901
16 242819
32 310534
64 352663
128 381780
256 342483
512 301968
768 272596
1024 251014
2048 184939
4096 108186

I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.

Just for the record, here are some review comments posted in the other
thread - /messages/by-id/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com..

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1]./configure --prefix=$PWD/inst/ --enable-tap-tests CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & shared_buffers = 8GB max_wal_size = 32GB max_connections = 4096 checkpoint_timeout = 10min, I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

[1]: ./configure --prefix=$PWD/inst/ --enable-tap-tests CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & shared_buffers = 8GB max_wal_size = 32GB max_connections = 4096 checkpoint_timeout = 10min
install.log && make -j 8 install > install.log 2>&1 &
shared_buffers = 8GB
max_wal_size = 32GB
max_connections = 4096
checkpoint_timeout = 10min

ubuntu: cat << EOF >> txid.sql
SELECT txid_current();
EOF
ubuntu: for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do
echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f txid.sql
-c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#5Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Bharath Rupireddy (#4)
Re: heavily contended lwlocks with long wait queues scale badly

I was working on optimizing the LWLock queue in a little different way
and I also did a benchmarking of Andres' original patch from this
thread. [1]/messages/by-id/CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com
The results are quite impressive, indeed. Please feel free to see the
results and join the discussion in [1]/messages/by-id/CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com if you want.

Best regards,
Pavel

[1]: /messages/by-id/CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com

#6Alexander Korotkov
aekorotkov@gmail.com
In reply to: Pavel Borisov (#5)
Re: heavily contended lwlocks with long wait queues scale badly

Hi Andres,

Thank you for your patch. The results are impressive.

On Mon, Oct 31, 2022 at 2:10 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

I was working on optimizing the LWLock queue in a little different way
and I also did a benchmarking of Andres' original patch from this
thread. [1]
The results are quite impressive, indeed. Please feel free to see the
results and join the discussion in [1] if you want.

Best regards,
Pavel

[1] /messages/by-id/CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com

Pavel posted a patch implementing a lock-less queue for LWLock. The
results are interesting indeed, but slightly lower than your current
patch have. The current Pavel's patch probably doesn't utilize the
full potential of lock-less idea. I wonder what do you think about
this direction? We would be grateful for your guidance. Thank you.

------
Regards,
Alexander Korotkov

#7Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: heavily contended lwlocks with long wait queues scale badly

On Thu, Oct 27, 2022 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:

After pulling my
hair out for quite a while to try to understand that behaviour, I figured out
that it's just a side-effect of *removing* some other contention.

I've seen this kind of pattern on multiple occasions. I don't know if
they were all caused by this, or what, but I certainly like the idea
of making it better.

--
Robert Haas
EDB: http://www.enterprisedb.com

#8Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#4)
Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().

Updated patch attached.

Greetings,

Andres Freund

Attachments:

v3-0001-lwlock-fix-quadratic-behaviour-with-very-long-wai.patchtext/x-diff; charset=us-asciiDownload+39-28
#9Zhihong Yu
zyu@yugabyte.com
In reply to: Andres Freund (#8)
Re: heavily contended lwlocks with long wait queues scale badly

On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().

Updated patch attached.

Greetings,

Andres Freund

Hi,
Minor comment:

+ uint8 lwWaiting; /* see LWLockWaitState */

Why not declare `lwWaiting` of type LWLockWaitState ?

Cheers

#10Andres Freund
andres@anarazel.de
In reply to: Zhihong Yu (#9)
Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-10-31 17:17:03 -0700, Zhihong Yu wrote:

On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().

Updated patch attached.

Greetings,

Andres Freund

Hi,
Minor comment:

+ uint8 lwWaiting; /* see LWLockWaitState */

Why not declare `lwWaiting` of type LWLockWaitState ?

Unfortunately C99 (*) doesn't allow to specify the width of an enum
field. With most compilers we'd end up using 4 bytes.

Greetings,

Andres Freund

(*) C++ has allowed specifying this for quite a few years now and I think C23
will support it too, but that doesn't help us at this point.

#11Zhihong Yu
zyu@yugabyte.com
In reply to: Andres Freund (#10)
Re: heavily contended lwlocks with long wait queues scale badly

On Mon, Oct 31, 2022 at 5:19 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-31 17:17:03 -0700, Zhihong Yu wrote:

On Mon, Oct 31, 2022 at 4:51 PM Andres Freund <andres@anarazel.de>

wrote:

Hi,

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread

missed

updating LWLockUpdateVar().

Updated patch attached.

Greetings,

Andres Freund

Hi,
Minor comment:

+ uint8 lwWaiting; /* see LWLockWaitState */

Why not declare `lwWaiting` of type LWLockWaitState ?

Unfortunately C99 (*) doesn't allow to specify the width of an enum
field. With most compilers we'd end up using 4 bytes.

Greetings,

Andres Freund

(*) C++ has allowed specifying this for quite a few years now and I think
C23
will support it too, but that doesn't help us at this point.

Hi,
Thanks for the response.

If possible, it would be better to put your explanation in the code comment
(so that other people know the reasoning).

#12Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#8)
Re: heavily contended lwlocks with long wait queues scale badly

On Tue, Nov 1, 2022 at 5:21 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-10-31 16:21:06 +0530, Bharath Rupireddy wrote:

BTW, I've seen a sporadic crash (SEGV) with the patch in bg writer
with the same set up [1], I'm not sure if it's really because of the
patch. I'm unable to reproduce it now and unfortunately I didn't
capture further details when it occurred.

That's likely because the prototype patch I submitted in this thread missed
updating LWLockUpdateVar().

Updated patch attached.

Thanks. It looks good to me. However, some minor comments on the v3 patch:

1.
-    if (MyProc->lwWaiting)
+    if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
         elog(PANIC, "queueing for lock while waiting on another one");

Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?

Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?

2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{

@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
          * another lock.
          */
         pg_write_barrier();
-        waiter->lwWaiting = false;
+        waiter->lwWaiting = LW_WS_NOT_WAITING;
         PGSemaphoreUnlock(waiter->sem);
     }

/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);

proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;

Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.

Below are test results with v3 patch. +1 for back-patching it.

HEAD PATCHED
1 34142 34289
2 72760 69720
4 136300 131848
8 210809 210192
16 240718 242744
32 297587 297354
64 341939 343036
128 383615 383801
256 342094 337680
512 263194 288629
768 145526 261553
1024 107267 241811
2048 35716 188389
4096 12415 120300

PG15 PATCHED
1 34503 34078
2 73708 72054
4 139415 133321
8 212396 211390
16 242227 242584
32 303441 309288
64 362680 339211
128 378645 344291
256 340016 344291
512 290044 293337
768 140277 264618
1024 96191 247636
2048 35158 181488
4096 12164 118610

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13Robert Haas
robertmhaas@gmail.com
In reply to: Bharath Rupireddy (#12)
Re: heavily contended lwlocks with long wait queues scale badly

On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Below are test results with v3 patch. +1 for back-patching it.

The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.

It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.

It's possible that I'm too pessimistic, though.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Jonathan S. Katz
jkatz@postgresql.org
In reply to: Robert Haas (#13)
Re: heavily contended lwlocks with long wait queues scale badly

On 11/1/22 8:37 AM, Robert Haas wrote:

On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Below are test results with v3 patch. +1 for back-patching it.

First, awesome find and proposed solution!

The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.

If someone is using the database to throttle activity for their app, I
have a bunch of follow up questions to understand why.

It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.

Having the privilege of reading through the release notes for every
update release, on average 1-2 "performance improvements" in each
release. I believe they tend to be more negligible, though.

I do understand the concerns. Say you discover your workload does have a
regression with this patch and then there's a CVE that you want to
accept -- what do you do? Reading the thread / patch, it seems as if
this is a lower risk "performance fix", but still nonzero.

While this does affect all supported versions, we could also consider
backpatching only for PG15. That at least 1/ limits impact on users
running older versions (opting into a major version upgrade) and 2/
we're still very early in the major upgrade cycle for PG15 that it's
lower risk if there are issues.

Users are generally happy when they can perform a simple upgrade and get
a performance boost, particularly the set of users that this patch
affects most (high throughput, high connection count). This is the type
of fix that would make headlines in a major release announcement (10x
TPS improvement w/4096 connections?!). That is also part of the tradeoff
of backpatching this, is that we may lose some of the higher visibility
marketing opportunities to discuss this (though I'm sure there will be
plenty of blog posts, etc.)

Andres: when you suggested backpatching, were you thinking of the Nov
2022 release or the Feb 2023 release?

Thanks,

Jonathan

#15Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#13)
Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-11-01 08:37:39 -0400, Robert Haas wrote:

On Tue, Nov 1, 2022 at 3:17 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Below are test results with v3 patch. +1 for back-patching it.

The problem with back-patching stuff like this is that it can have
unanticipated consequences. I think that the chances of something like
this backfiring are less than for a patch that changes plans, but I
don't think that they're nil, either. It could turn out that this
patch, which has really promising results on the workloads we've
tested, harms some other workload due to some other contention pattern
we can't foresee. It could also turn out that improving performance at
the database level actually has negative consequences for some
application using the database, because the application could be
unknowingly relying on the database to throttle its activity.

It's hard for me to estimate exactly what the risk of a patch like
this is. I think that if we back-patched this, and only this, perhaps
the chances of something bad happening aren't incredibly high. But if
we get into the habit of back-patching seemingly-innocuous performance
improvements, it's only a matter of time before one of them turns out
not to be so innocuous as we thought. I would guess that the number of
times we have to back-patch something like this before somebody starts
complaining about a regression is likely to be somewhere between 3 and
5.

In general I agree, we shouldn't default to backpatching performance
fixes. The reason I am even considering it in this case, is that it's a
readily reproducible issue, leading to a quadratic behaviour that's extremely
hard to pinpoint. There's no increase in CPU usage, no wait event for
spinlocks, the system doesn't even get stuck (because the wait list lock is
held after the lwlock lock release). I don't think users have a decent chance
at figuring out that this is the issue.

I'm not at all convinced we should backpatch either, just to be clear.

Greetings,

Andres Freund

#16Andres Freund
andres@anarazel.de
In reply to: Jonathan S. Katz (#14)
Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-11-01 11:19:02 -0400, Jonathan S. Katz wrote:

This is the type of fix that would make headlines in a major release
announcement (10x TPS improvement w/4096 connections?!). That is also part
of the tradeoff of backpatching this, is that we may lose some of the higher
visibility marketing opportunities to discuss this (though I'm sure there
will be plenty of blog posts, etc.)

(read the next paragraph with the caveat that results below prove it somewhat
wrong)

I don't think the fix is as big a deal as the above make it sound - you need
to do somewhat extreme things to hit the problem. Yes, it drastically improves
the scalability of e.g. doing SELECT txid_current() across as many sessions as
possible - but that's not something you normally do (it was a good candidate
to show the problem because it's a single lock but doesn't trigger WAL flushes
at commit).

You can probably hit the problem with many concurrent single-tx INSERTs, but
you'd need to have synchronous_commit=off or fsync=off (or a very expensive
server class SSD with battery backup) and the effect is likely smaller.

Andres: when you suggested backpatching, were you thinking of the Nov 2022
release or the Feb 2023 release?

I wasn't thinking that concretely. Even if we decide to backpatch, I'd be very
hesitant to do it in a few days.

<goes and runs test while in meeting>

I tested with browser etc running, so this is plenty noisy. I used the best of
the two pgbench -T21 -P5 tps, after ignoring the first two periods (they're
too noisy). I used an ok-ish NVMe SSD, rather than the the expensive one that
has "free" fsync.

synchronous_commit=on:

clients master fix
16 6196 6202
64 25716 25545
256 90131 90240
1024 128556 151487
2048 59417 157050
4096 32252 178823

synchronous_commit=off:

clients master fix
16 409828 409016
64 454257 455804
256 304175 452160
1024 135081 334979
2048 66124 291582
4096 27019 245701

Hm. That's a bigger effect than I anticipated. I guess sc=off isn't actually
required, due to the level of concurrency making group commit very
effective.

This is without an index, serial column or anything. But a quick comparison
for just 4096 clients shows that to still be a big difference if I create an
serial primary key:
master: 26172
fix: 155813

Greetings,

Andres Freund

#17Jonathan S. Katz
jkatz@postgresql.org
In reply to: Andres Freund (#16)
Re: heavily contended lwlocks with long wait queues scale badly

On 11/1/22 1:41 PM, Andres Freund wrote:

Andres: when you suggested backpatching, were you thinking of the Nov 2022
release or the Feb 2023 release?

I wasn't thinking that concretely. Even if we decide to backpatch, I'd be very
hesitant to do it in a few days.

Yeah this was my thinking (and also why I took a few days to reply given
the lack of urgency for this release). It would at least give some more
time for others to test it to feel confident that we're not introducing
noticeable regressions.

<goes and runs test while in meeting>

I tested with browser etc running, so this is plenty noisy. I used the best of
the two pgbench -T21 -P5 tps, after ignoring the first two periods (they're
too noisy). I used an ok-ish NVMe SSD, rather than the the expensive one that
has "free" fsync.

synchronous_commit=on:

clients master fix
16 6196 6202
64 25716 25545
256 90131 90240
1024 128556 151487
2048 59417 157050
4096 32252 178823

synchronous_commit=off:

clients master fix
16 409828 409016
64 454257 455804
256 304175 452160
1024 135081 334979
2048 66124 291582
4096 27019 245701

Hm. That's a bigger effect than I anticipated. I guess sc=off isn't actually
required, due to the level of concurrency making group commit very
effective.

This is without an index, serial column or anything. But a quick comparison
for just 4096 clients shows that to still be a big difference if I create an
serial primary key:
master: 26172
fix: 155813

🤯 (seeing if my exploding head makes it into the archives).

Given the lack of ABI changes (hesitant to say low-risk until after more
testing, but seemingly low-risk), I can get behind backpatching esp if
we're targeting Feb 2023 so we can tests some more.

With my advocacy hat on, it bums me that we may not get as much buzz
about this change given it's not in a major release, but 1/ it'll fix an
issue that will help users with high-concurrency and 2/ users would be
able to perform a simpler update to get the change.

Thanks,

Jonathan

#18Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#12)
Re: heavily contended lwlocks with long wait queues scale badly

On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Updated patch attached.

Thanks. It looks good to me. However, some minor comments on the v3 patch:

1.
-    if (MyProc->lwWaiting)
+    if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");

Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?

Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?

2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{

@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
-        waiter->lwWaiting = false;
+        waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}

/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);

proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;

Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.

Below are test results with v3 patch. +1 for back-patching it.

HEAD PATCHED
1 34142 34289
2 72760 69720
4 136300 131848
8 210809 210192
16 240718 242744
32 297587 297354
64 341939 343036
128 383615 383801
256 342094 337680
512 263194 288629
768 145526 261553
1024 107267 241811
2048 35716 188389
4096 12415 120300

PG15 PATCHED
1 34503 34078
2 73708 72054
4 139415 133321
8 212396 211390
16 242227 242584
32 303441 309288
64 362680 339211
128 378645 344291
256 340016 344291
512 290044 293337
768 140277 264618
1024 96191 247636
2048 35158 181488
4096 12164 118610

I looked at the v3 patch again today and ran some performance tests.
The results look impressive as they were earlier. Andres, any plans to
get this in?

pgbench with SELECT txid_current();:
Clients HEAD PATCHED
1 34613 33611
2 72634 70546
4 137885 136911
8 216470 216076
16 242535 245392
32 299952 304740
64 329788 347401
128 378296 386873
256 344939 343832
512 292196 295839
768 144212 260102
1024 101525 250263
2048 35594 185878
4096 11842 104227

pgbench with insert into pgbench_accounts table:
Clients HEAD PATCHED
1 1660 1600
2 1848 1746
4 3547 3395
8 7330 6754
16 13103 13613
32 26011 26372
64 52331 52594
128 93313 95526
256 127373 126182
512 126712 127857
768 116765 119227
1024 111464 112499
2048 58838 92756
4096 26066 60543

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#19Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#18)
Re: heavily contended lwlocks with long wait queues scale badly

Hi,

On 2022-11-09 15:54:16 +0530, Bharath Rupireddy wrote:

On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Updated patch attached.

Thanks. It looks good to me. However, some minor comments on the v3 patch:

1.
-    if (MyProc->lwWaiting)
+    if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
elog(PANIC, "queueing for lock while waiting on another one");

Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?

Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
LW_WS_WAITING?

I don't think that's a good idea - it'll just mean we have to modify more
places if we add another state, without making anything more robust.

2.
/* Awaken any waiters I removed from the queue. */
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{

@@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
* another lock.
*/
pg_write_barrier();
-        waiter->lwWaiting = false;
+        waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);
}

/*
* Awaken any waiters I removed from the queue.
*/
proclist_foreach_modify(iter, &wakeup, lwWaitLink)
{
PGPROC *waiter = GetPGProcByNumber(iter.cur);

proclist_delete(&wakeup, iter.cur, lwWaitLink);
/* check comment in LWLockWakeup() about this barrier */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;

Can we add an assertion Assert(waiter->lwWaiting ==
LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
list and set the LW_WS_NOT_WAITING flag in above loops, having an
assertion is better here IMO.

I guess it can't hurt - but it's not really related to the changes in the
patch, no?

I looked at the v3 patch again today and ran some performance tests.
The results look impressive as they were earlier. Andres, any plans to
get this in?

I definitely didn't want to backpatch before this point release. But it seems
we haven't quite got to an agreement what to do about backpatching. It's
probably best to just commit it to HEAD and let the backpatch discussion
happen concurrently.

I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.

Greetings,

Andres Freund

#20Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#19)
Re: heavily contended lwlocks with long wait queues scale badly

On 2022-11-09 09:38:08 -0800, Andres Freund wrote:

I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.

I didn't get to it in time, so I'll leave it for when I'm back.

#21Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#20)
#22Jonathan S. Katz
jkatz@postgresql.org
In reply to: Andres Freund (#21)
#23Nathan Bossart
nathandbossart@gmail.com
In reply to: Jonathan S. Katz (#22)
#24Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#23)
#25Jonathan S. Katz
jkatz@postgresql.org
In reply to: Michael Paquier (#24)
#26Michael Paquier
michael@paquier.xyz
In reply to: Jonathan S. Katz (#25)
#27Jonathan S. Katz
jkatz@postgresql.org
In reply to: Michael Paquier (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Jonathan S. Katz (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#26)
#30Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Michael Paquier (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Jakub Wartak (#30)