Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish()
Hi,
While working on something else, I noticed that
WaitXLogInsertionsToFinish() goes the LWLockWaitForVar() route even
for a process that's holding a WAL insertion lock. Typically, a
process holding WAL insertion lock reaches
WaitXLogInsertionsToFinish() when it's in need of WAL buffer pages for
its insertion and waits for other older in-progress WAL insertions to
finish. This fact guarantees that the process holding a WAL insertion
lock will never have its insertingAt less than 'upto'.
With that said, here's a small improvement I can think of, that is, to
avoid calling LWLockWaitForVar() for the WAL insertion lock the caller
of WaitXLogInsertionsToFinish() currently holds. Since
LWLockWaitForVar() does a bunch of things - holds interrupts, does
atomic reads, acquires and releases wait list lock and so on, avoiding
it may be a good idea IMO.
I'm attaching a patch herewith. Here's the cirrus-ci tests -
https://github.com/BRupireddy/postgres/tree/avoid_LWLockWaitForVar_for_currently_held_wal_ins_lock_v1.
Thoughts?
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Avoid-LWLockWaitForVar-for-currently-held-WAL-ins.patchapplication/octet-stream; name=v1-0001-Avoid-LWLockWaitForVar-for-currently-held-WAL-ins.patchDownload+39-2
Hi,
On 2022-11-24 18:13:10 +0530, Bharath Rupireddy wrote:
With that said, here's a small improvement I can think of, that is, to
avoid calling LWLockWaitForVar() for the WAL insertion lock the caller
of WaitXLogInsertionsToFinish() currently holds. Since
LWLockWaitForVar() does a bunch of things - holds interrupts, does
atomic reads, acquires and releases wait list lock and so on, avoiding
it may be a good idea IMO.
That doesn't seem like a big win. We're still going to call LWLockWaitForVar()
for all the other locks.
I think we could improve this code more significantly by avoiding the call to
LWLockWaitForVar() for all locks that aren't acquired or don't have a
conflicting insertingAt, that'd require just a bit more work to handle systems
without tear-free 64bit writes/reads.
The easiest way would probably be to just make insertingAt a 64bit atomic,
that transparently does the required work to make even non-atomic read/writes
tear free. Then we could trivially avoid the spinlock in
LWLockConflictsWithVar(), LWLockReleaseClearVar() and with just a bit more
work add a fastpath to LWLockUpdateVar(). We don't need to acquire the wait
list lock if there aren't any waiters.
I'd bet that start to have visible effects in a workload with many small
records.
Greetings,
Andres Freund
On Fri, Nov 25, 2022 at 12:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-11-24 18:13:10 +0530, Bharath Rupireddy wrote:
With that said, here's a small improvement I can think of, that is, to
avoid calling LWLockWaitForVar() for the WAL insertion lock the caller
of WaitXLogInsertionsToFinish() currently holds. Since
LWLockWaitForVar() does a bunch of things - holds interrupts, does
atomic reads, acquires and releases wait list lock and so on, avoiding
it may be a good idea IMO.That doesn't seem like a big win. We're still going to call LWLockWaitForVar()
for all the other locks.I think we could improve this code more significantly by avoiding the call to
LWLockWaitForVar() for all locks that aren't acquired or don't have a
conflicting insertingAt, that'd require just a bit more work to handle systems
without tear-free 64bit writes/reads.The easiest way would probably be to just make insertingAt a 64bit atomic,
that transparently does the required work to make even non-atomic read/writes
tear free. Then we could trivially avoid the spinlock in
LWLockConflictsWithVar(), LWLockReleaseClearVar() and with just a bit more
work add a fastpath to LWLockUpdateVar(). We don't need to acquire the wait
list lock if there aren't any waiters.I'd bet that start to have visible effects in a workload with many small
records.
Thanks Andres! I quickly came up with the attached patch. I also ran
an insert test [1]./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & cd inst/bin ./pg_ctl -D data -l logfile stop rm -rf data logfile free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data ./pg_ctl -D data -l logfile start ./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";' ./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "16GB";' ./psql -d postgres -c 'ALTER SYSTEM SET max_connections = "4096";' ./pg_ctl -D data -l logfile restart ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done, below are the results. I also attached the results
graph. The cirrus-ci is happy with the patch -
https://github.com/BRupireddy/postgres/tree/wal_insertion_lock_improvements_v1_2.
Please let me know if the direction of the patch seems right.
clients HEAD PATCHED
1 1354 1499
2 1451 1464
4 3069 3073
8 5712 5797
16 11331 11157
32 22020 22074
64 41742 42213
128 71300 76638
256 103652 118944
512 111250 161582
768 99544 161987
1024 96743 164161
2048 72711 156686
4096 54158 135713
[1]: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & cd inst/bin ./pg_ctl -D data -l logfile stop rm -rf data logfile free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data ./pg_ctl -D data -l logfile start ./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";' ./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "16GB";' ./psql -d postgres -c 'ALTER SYSTEM SET max_connections = "4096";' ./pg_ctl -D data -l logfile restart ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &
cd inst/bin
./pg_ctl -D data -l logfile stop
rm -rf data logfile
free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m
./initdb -D data
./pg_ctl -D data -l logfile start
./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";'
./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "16GB";'
./psql -d postgres -c 'ALTER SYSTEM SET max_connections = "4096";'
./pg_ctl -D data -l logfile restart
./pgbench -i -s 1 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
ulimit -S -n 5000
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 2022-11-25 16:54:19 +0530, Bharath Rupireddy wrote:
On Fri, Nov 25, 2022 at 12:16 AM Andres Freund <andres@anarazel.de> wrote:
I think we could improve this code more significantly by avoiding the call to
LWLockWaitForVar() for all locks that aren't acquired or don't have a
conflicting insertingAt, that'd require just a bit more work to handle systems
without tear-free 64bit writes/reads.The easiest way would probably be to just make insertingAt a 64bit atomic,
that transparently does the required work to make even non-atomic read/writes
tear free. Then we could trivially avoid the spinlock in
LWLockConflictsWithVar(), LWLockReleaseClearVar() and with just a bit more
work add a fastpath to LWLockUpdateVar(). We don't need to acquire the wait
list lock if there aren't any waiters.I'd bet that start to have visible effects in a workload with many small
records.Thanks Andres! I quickly came up with the attached patch. I also ran
an insert test [1], below are the results. I also attached the results
graph. The cirrus-ci is happy with the patch -
https://github.com/BRupireddy/postgres/tree/wal_insertion_lock_improvements_v1_2.
Please let me know if the direction of the patch seems right.
clients HEAD PATCHED
1 1354 1499
2 1451 1464
4 3069 3073
8 5712 5797
16 11331 11157
32 22020 22074
64 41742 42213
128 71300 76638
256 103652 118944
512 111250 161582
768 99544 161987
1024 96743 164161
2048 72711 156686
4096 54158 135713
Nice.
From 293e789f9c1a63748147acd613c556961f1dc5c4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 25 Nov 2022 10:53:56 +0000
Subject: [PATCH v1] WAL Insertion Lock Improvements---
src/backend/access/transam/xlog.c | 8 +++--
src/backend/storage/lmgr/lwlock.c | 56 +++++++++++++++++--------------
src/include/storage/lwlock.h | 7 ++--
3 files changed, 41 insertions(+), 30 deletions(-)diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a31fbbff78..b3f758abb3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -376,7 +376,7 @@ typedef struct XLogwrtResult typedef struct { LWLock lock; - XLogRecPtr insertingAt; + pg_atomic_uint64 insertingAt; XLogRecPtr lastImportantAt; } WALInsertLock;@@ -1482,6 +1482,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
{
XLogRecPtr insertingat = InvalidXLogRecPtr;+ /* Quickly check and continue if no one holds the lock. */ + if (!IsLWLockHeld(&WALInsertLocks[i].l.lock)) + continue;
I'm not sure this is quite right - don't we need a memory barrier. But I don't
see a reason to not just leave this code as-is. I think this should be
optimized entirely in lwlock.c
I'd probably split the change to an atomic from other changes either way.
Greetings,
Andres Freund
On Fri, Dec 2, 2022 at 6:10 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-11-25 16:54:19 +0530, Bharath Rupireddy wrote:
On Fri, Nov 25, 2022 at 12:16 AM Andres Freund <andres@anarazel.de> wrote:
I think we could improve this code more significantly by avoiding the call to
LWLockWaitForVar() for all locks that aren't acquired or don't have a
conflicting insertingAt, that'd require just a bit more work to handle systems
without tear-free 64bit writes/reads.The easiest way would probably be to just make insertingAt a 64bit atomic,
that transparently does the required work to make even non-atomic read/writes
tear free. Then we could trivially avoid the spinlock in
LWLockConflictsWithVar(), LWLockReleaseClearVar() and with just a bit more
work add a fastpath to LWLockUpdateVar(). We don't need to acquire the wait
list lock if there aren't any waiters.I'd bet that start to have visible effects in a workload with many small
records.Thanks Andres! I quickly came up with the attached patch. I also ran
an insert test [1], below are the results. I also attached the results
graph. The cirrus-ci is happy with the patch -
https://github.com/BRupireddy/postgres/tree/wal_insertion_lock_improvements_v1_2.
Please let me know if the direction of the patch seems right.
clients HEAD PATCHED
1 1354 1499
2 1451 1464
4 3069 3073
8 5712 5797
16 11331 11157
32 22020 22074
64 41742 42213
128 71300 76638
256 103652 118944
512 111250 161582
768 99544 161987
1024 96743 164161
2048 72711 156686
4096 54158 135713Nice.
Thanks for taking a look at it.
From 293e789f9c1a63748147acd613c556961f1dc5c4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 25 Nov 2022 10:53:56 +0000
Subject: [PATCH v1] WAL Insertion Lock Improvements---
src/backend/access/transam/xlog.c | 8 +++--
src/backend/storage/lmgr/lwlock.c | 56 +++++++++++++++++--------------
src/include/storage/lwlock.h | 7 ++--
3 files changed, 41 insertions(+), 30 deletions(-)diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a31fbbff78..b3f758abb3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -376,7 +376,7 @@ typedef struct XLogwrtResult typedef struct { LWLock lock; - XLogRecPtr insertingAt; + pg_atomic_uint64 insertingAt; XLogRecPtr lastImportantAt; } WALInsertLock;@@ -1482,6 +1482,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
{
XLogRecPtr insertingat = InvalidXLogRecPtr;+ /* Quickly check and continue if no one holds the lock. */ + if (!IsLWLockHeld(&WALInsertLocks[i].l.lock)) + continue;I'm not sure this is quite right - don't we need a memory barrier. But I don't
see a reason to not just leave this code as-is. I think this should be
optimized entirely in lwlock.c
Actually, we don't need that at all as LWLockWaitForVar() will return
immediately if the lock is free. So, I removed it.
I'd probably split the change to an atomic from other changes either way.
Done. I've added commit messages to each of the patches.
I've also brought the patch from [1]/messages/by-id/CALj2ACXtQdrGXtb=rbUOXddm1wU1vD9z6q_39FQyX0166dq==A@mail.gmail.com here as 0003.
Thoughts?
[1]: /messages/by-id/CALj2ACXtQdrGXtb=rbUOXddm1wU1vD9z6q_39FQyX0166dq==A@mail.gmail.com
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Make-insertingAt-64-bit-atomic.patchapplication/octet-stream; name=v2-0001-Make-insertingAt-64-bit-atomic.patchDownload+20-35
v2-0002-Add-fastpath-to-LWLockUpdateVar.patchapplication/octet-stream; name=v2-0002-Add-fastpath-to-LWLockUpdateVar.patchDownload+7-1
v2-0003-Make-lastImportantAt-64-bit-atomic.patchapplication/octet-stream; name=v2-0003-Make-lastImportantAt-64-bit-atomic.patchDownload+6-10
On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote:
On Fri, Dec 2, 2022 at 6:10 AM Andres Freund <andres@anarazel.de> wrote:
I'm not sure this is quite right - don't we need a memory barrier. But I don't
see a reason to not just leave this code as-is. I think this should be
optimized entirely in lwlock.cActually, we don't need that at all as LWLockWaitForVar() will return
immediately if the lock is free. So, I removed it.
I briefly looked at the latest patch set, and I'm curious how this change
avoids introducing memory ordering bugs. Perhaps I am missing something
obvious.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
FWIW, I don't see an advantage in 0003. If it allows us to make something else
simpler / faster, cool, but on its own it doesn't seem worthwhile.
On 2022-12-02 16:31:58 -0800, Nathan Bossart wrote:
On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote:
On Fri, Dec 2, 2022 at 6:10 AM Andres Freund <andres@anarazel.de> wrote:
I'm not sure this is quite right - don't we need a memory barrier. But I don't
see a reason to not just leave this code as-is. I think this should be
optimized entirely in lwlock.cActually, we don't need that at all as LWLockWaitForVar() will return
immediately if the lock is free. So, I removed it.I briefly looked at the latest patch set, and I'm curious how this change
avoids introducing memory ordering bugs. Perhaps I am missing something
obvious.
I'm a bit confused too - the comment above talks about LWLockWaitForVar(), but
the patches seem to optimize LWLockUpdateVar().
I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
pre-existing, quite crufty, code. LWLockConflictsWithVar() says:
* Test first to see if it the slot is free right now.
*
* XXX: the caller uses a spinlock before this, so we don't need a memory
* barrier here as far as the current usage is concerned. But that might
* not be safe in general.
which happens to be true in the single, indirect, caller:
/* Read the current insert position */
SpinLockAcquire(&Insert->insertpos_lck);
bytepos = Insert->CurrBytePos;
SpinLockRelease(&Insert->insertpos_lck);
reservedUpto = XLogBytePosToEndRecPtr(bytepos);
I think at the very least we ought to have a comment in
WaitXLogInsertionsToFinish() highlighting this.
It's not at all clear to me that the proposed fast-path for LWLockUpdateVar()
is safe. I think at the very least we could end up missing waiters that we
should have woken up.
I think it ought to be safe to do something like
pg_atomic_exchange_u64()..
if (!(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS))
return;
because the pg_atomic_exchange_u64() will provide the necessary memory
barrier.
Greetings,
Andres Freund
On Tue, Dec 6, 2022 at 12:00 AM Andres Freund <andres@anarazel.de> wrote:
FWIW, I don't see an advantage in 0003. If it allows us to make something else
simpler / faster, cool, but on its own it doesn't seem worthwhile.
Thanks. I will discard it.
I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
pre-existing, quite crufty, code. LWLockConflictsWithVar() says:* Test first to see if it the slot is free right now.
*
* XXX: the caller uses a spinlock before this, so we don't need a memory
* barrier here as far as the current usage is concerned. But that might
* not be safe in general.which happens to be true in the single, indirect, caller:
/* Read the current insert position */
SpinLockAcquire(&Insert->insertpos_lck);
bytepos = Insert->CurrBytePos;
SpinLockRelease(&Insert->insertpos_lck);
reservedUpto = XLogBytePosToEndRecPtr(bytepos);I think at the very least we ought to have a comment in
WaitXLogInsertionsToFinish() highlighting this.
So, using a spinlock ensures no memory ordering occurs while reading
lock->state in LWLockConflictsWithVar()? How does spinlock that gets
acquired and released in the caller WaitXLogInsertionsToFinish()
itself and the memory barrier in the called function
LWLockConflictsWithVar() relate here? Can you please help me
understand this a bit?
It's not at all clear to me that the proposed fast-path for LWLockUpdateVar()
is safe. I think at the very least we could end up missing waiters that we
should have woken up.I think it ought to be safe to do something like
pg_atomic_exchange_u64()..
if (!(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS))
return;
pg_atomic_exchange_u64(&lock->state, exchange_with_what_?. Exchange
will change the value no?
because the pg_atomic_exchange_u64() will provide the necessary memory
barrier.
I'm reading some comments [1]* Full barrier semantics. */ static inline uint32 pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr,, are these also true for 64-bit atomic
CAS? Does it mean that an atomic CAS operation inherently provides a
memory barrier? Can you please point me if it's described better
somewhere else?
[1]: * Full barrier semantics. */ static inline uint32 pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr,
* Full barrier semantics.
*/
static inline uint32
pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr,
/*
* Get and clear the flags that are set for this backend. Note that
* pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
* read of the barrier generation above happens before we atomically
* extract the flags, and that any subsequent state changes happen
* afterward.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2022-12-08 12:29:54 +0530, Bharath Rupireddy wrote:
On Tue, Dec 6, 2022 at 12:00 AM Andres Freund <andres@anarazel.de> wrote:
I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
pre-existing, quite crufty, code. LWLockConflictsWithVar() says:* Test first to see if it the slot is free right now.
*
* XXX: the caller uses a spinlock before this, so we don't need a memory
* barrier here as far as the current usage is concerned. But that might
* not be safe in general.which happens to be true in the single, indirect, caller:
/* Read the current insert position */
SpinLockAcquire(&Insert->insertpos_lck);
bytepos = Insert->CurrBytePos;
SpinLockRelease(&Insert->insertpos_lck);
reservedUpto = XLogBytePosToEndRecPtr(bytepos);I think at the very least we ought to have a comment in
WaitXLogInsertionsToFinish() highlighting this.So, using a spinlock ensures no memory ordering occurs while reading
lock->state in LWLockConflictsWithVar()?
No, a spinlock *does* imply ordering. But your patch does remove several of
the spinlock acquisitions (via LWLockWaitListLock()). And moved the assignment
in LWLockUpdateVar() out from under the spinlock.
If you remove spinlock operations (or other barrier primitives), you need to
make sure that such modifications don't break the required memory ordering.
How does spinlock that gets acquired and released in the caller
WaitXLogInsertionsToFinish() itself and the memory barrier in the called
function LWLockConflictsWithVar() relate here? Can you please help me
understand this a bit?
The caller's barrier means that we'll see values that are at least as "up to
date" as at the time of the barrier (it's a bit more complicated than that, a
barrier always needs to be paired with another barrier).
It's not at all clear to me that the proposed fast-path for LWLockUpdateVar()
is safe. I think at the very least we could end up missing waiters that we
should have woken up.I think it ought to be safe to do something like
pg_atomic_exchange_u64()..
if (!(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS))
return;pg_atomic_exchange_u64(&lock->state, exchange_with_what_?. Exchange will
change the value no?
Not lock->state, but the atomic passed to LWLockUpdateVar(), which we do want
to update. An pg_atomic_exchange_u64() includes a memory barrier.
because the pg_atomic_exchange_u64() will provide the necessary memory
barrier.I'm reading some comments [1], are these also true for 64-bit atomic
CAS?
Yes. See
/* ----
* The 64 bit operations have the same semantics as their 32bit counterparts
* if they are available. Check the corresponding 32bit function for
* documentation.
* ----
*/
Does it mean that an atomic CAS operation inherently provides a
memory barrier?
Yes.
Can you please point me if it's described better somewhere else?
I'm not sure what you'd like to have described more extensively, tbh.
Greetings,
Andres Freund
On Tue, Dec 6, 2022 at 12:00 AM Andres Freund <andres@anarazel.de> wrote:
Hi
Thanks for reviewing.
FWIW, I don't see an advantage in 0003. If it allows us to make something else
simpler / faster, cool, but on its own it doesn't seem worthwhile.
I've discarded this change.
On 2022-12-02 16:31:58 -0800, Nathan Bossart wrote:
On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote:
On Fri, Dec 2, 2022 at 6:10 AM Andres Freund <andres@anarazel.de> wrote:
I'm not sure this is quite right - don't we need a memory barrier. But I don't
see a reason to not just leave this code as-is. I think this should be
optimized entirely in lwlock.cActually, we don't need that at all as LWLockWaitForVar() will return
immediately if the lock is free. So, I removed it.I briefly looked at the latest patch set, and I'm curious how this change
avoids introducing memory ordering bugs. Perhaps I am missing something
obvious.I'm a bit confused too - the comment above talks about LWLockWaitForVar(), but
the patches seem to optimize LWLockUpdateVar().I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
pre-existing, quite crufty, code. LWLockConflictsWithVar() says:* Test first to see if it the slot is free right now.
*
* XXX: the caller uses a spinlock before this, so we don't need a memory
* barrier here as far as the current usage is concerned. But that might
* not be safe in general.which happens to be true in the single, indirect, caller:
/* Read the current insert position */
SpinLockAcquire(&Insert->insertpos_lck);
bytepos = Insert->CurrBytePos;
SpinLockRelease(&Insert->insertpos_lck);
reservedUpto = XLogBytePosToEndRecPtr(bytepos);I think at the very least we ought to have a comment in
WaitXLogInsertionsToFinish() highlighting this.
Done.
It's not at all clear to me that the proposed fast-path for LWLockUpdateVar()
is safe. I think at the very least we could end up missing waiters that we
should have woken up.I think it ought to be safe to do something like
pg_atomic_exchange_u64()..
if (!(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS))
return;because the pg_atomic_exchange_u64() will provide the necessary memory
barrier.
Done.
I'm attaching the v3 patch with the above review comments addressed.
Hopefully, no memory ordering issues now. FWIW, I've added it to CF
https://commitfest.postgresql.org/42/4141/.
Test results with the v3 patch and insert workload are the same as
that of the earlier run - TPS starts to scale at higher clients as
expected after 512 clients and peaks at 2X with 2048 and 4096 clients.
HEAD:
1 1380.411086
2 1358.378988
4 2701.974332
8 5925.380744
16 10956.501237
32 20877.513953
64 40838.046774
128 70251.744161
256 108114.321299
512 120478.988268
768 99140.425209
1024 93645.984364
2048 70111.159909
4096 55541.804826
v3 PATCHED:
1 1493.800209
2 1569.414953
4 3154.186605
8 5965.578904
16 11912.587645
32 22720.964908
64 42001.094528
128 78361.158983
256 110457.926232
512 148941.378393
768 167256.590308
1024 155510.675372
2048 147499.376882
4096 119375.457779
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patchapplication/octet-stream; name=v3-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patchDownload+55-32
On Tue, Jan 24, 2023 at 7:00 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
I'm attaching the v3 patch with the above review comments addressed.
Hopefully, no memory ordering issues now. FWIW, I've added it to CF
https://commitfest.postgresql.org/42/4141/.Test results with the v3 patch and insert workload are the same as
that of the earlier run - TPS starts to scale at higher clients as
expected after 512 clients and peaks at 2X with 2048 and 4096 clients.HEAD:
1 1380.411086
2 1358.378988
4 2701.974332
8 5925.380744
16 10956.501237
32 20877.513953
64 40838.046774
128 70251.744161
256 108114.321299
512 120478.988268
768 99140.425209
1024 93645.984364
2048 70111.159909
4096 55541.804826v3 PATCHED:
1 1493.800209
2 1569.414953
4 3154.186605
8 5965.578904
16 11912.587645
32 22720.964908
64 42001.094528
128 78361.158983
256 110457.926232
512 148941.378393
768 167256.590308
1024 155510.675372
2048 147499.376882
4096 119375.457779
I slightly modified the comments and attached the v4 patch for further
review. I also took perf report - there's a clear reduction in the
functions that are affected by the patch - LWLockWaitListLock,
WaitXLogInsertionsToFinish, LWLockWaitForVar and
LWLockConflictsWithVar. Note that I compiled the source code with
-ggdb for capturing symbols for perf, still the benefit stands at > 2X
for a higher number of clients.
HEAD:
+ 16.87% 0.01% postgres [.] CommitTransactionCommand
+ 16.86% 0.00% postgres [.] finish_xact_command
+ 16.81% 0.01% postgres [.] CommitTransaction
+ 15.09% 0.20% postgres [.] LWLockWaitListLock
+ 14.53% 0.01% postgres [.] WaitXLogInsertionsToFinish
+ 14.51% 0.02% postgres [.] LWLockWaitForVar
+ 11.70% 11.63% postgres [.] pg_atomic_read_u32_impl
+ 11.66% 0.08% postgres [.] pg_atomic_read_u32
+ 9.96% 0.03% postgres [.] LWLockConflictsWithVar
+ 4.78% 0.00% postgres [.] LWLockQueueSelf
+ 1.91% 0.01% postgres [.] pg_atomic_fetch_or_u32
+ 1.91% 1.89% postgres [.] pg_atomic_fetch_or_u32_impl
+ 1.73% 0.00% postgres [.] XLogInsert
+ 1.69% 0.01% postgres [.] XLogInsertRecord
+ 1.41% 0.01% postgres [.] LWLockRelease
+ 1.37% 0.47% postgres [.] perform_spin_delay
+ 1.11% 1.11% postgres [.] spin_delay
+ 1.10% 0.03% postgres [.] exec_bind_message
+ 0.91% 0.00% postgres [.] WALInsertLockRelease
+ 0.91% 0.00% postgres [.] LWLockReleaseClearVar
+ 0.72% 0.02% postgres [.] LWLockAcquire
+ 0.60% 0.00% postgres [.] LWLockDequeueSelf
+ 0.58% 0.00% postgres [.] GetTransactionSnapshot
0.58% 0.49% postgres [.] GetSnapshotData
+ 0.58% 0.00% postgres [.] WALInsertLockAcquire
+ 0.55% 0.00% postgres [.] XactLogCommitRecord
TPS (compiled with -ggdb for capturing symbols for perf)
1 1392.512967
2 1435.899119
4 3104.091923
8 6159.305522
16 11477.641780
32 22701.000718
64 41662.425880
128 23743.426209
256 89837.651619
512 65164.221500
768 66015.733370
1024 56421.223080
2048 52909.018072
4096 40071.146985
PATCHED:
+ 2.19% 0.05% postgres [.] LWLockWaitListLock
+ 2.10% 0.01% postgres [.] LWLockQueueSelf
+ 1.73% 1.71% postgres [.] pg_atomic_read_u32_impl
+ 1.73% 0.02% postgres [.] pg_atomic_read_u32
+ 1.72% 0.02% postgres [.] LWLockRelease
+ 1.65% 0.04% postgres [.] exec_bind_message
+ 1.43% 0.00% postgres [.] XLogInsert
+ 1.42% 0.01% postgres [.] WaitXLogInsertionsToFinish
+ 1.40% 0.03% postgres [.] LWLockWaitForVar
+ 1.38% 0.02% postgres [.] XLogInsertRecord
+ 0.93% 0.03% postgres [.] LWLockAcquireOrWait
+ 0.91% 0.00% postgres [.] GetTransactionSnapshot
+ 0.91% 0.79% postgres [.] GetSnapshotData
+ 0.91% 0.00% postgres [.] WALInsertLockRelease
+ 0.91% 0.00% postgres [.] LWLockReleaseClearVar
+ 0.53% 0.02% postgres [.] ExecInitModifyTable
TPS (compiled with -ggdb for capturing symbols for perf)
1 1295.296611
2 1459.079162
4 2865.688987
8 5533.724983
16 10771.697842
32 20557.499312
64 39436.423783
128 42555.639048
256 73139.060227
512 124649.665196
768 131162.826976
1024 132185.160007
2048 117377.586644
4096 88240.336940
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patchapplication/x-patch; name=v4-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patchDownload+55-32
+ pg_atomic_exchange_u64(valptr, val);
nitpick: I'd add a (void) at the beginning of these calls to
pg_atomic_exchange_u64() so that it's clear that we are discarding the
return value.
+ /*
+ * Update the lock variable atomically first without having to acquire wait
+ * list lock, so that if anyone looking for the lock will have chance to
+ * grab it a bit quickly.
+ *
+ * NB: Note the use of pg_atomic_exchange_u64 as opposed to just
+ * pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is
+ * a full barrier, we're guaranteed that the subsequent atomic read of lock
+ * state to check if it has any waiters happens after we set the lock
+ * variable to new value here. Without a barrier, we could end up missing
+ * waiters that otherwise should have been woken up.
+ */
+ pg_atomic_exchange_u64(valptr, val);
+
+ /*
+ * Quick exit when there are no waiters. This avoids unnecessary lwlock's
+ * wait list lock acquisition and release.
+ */
+ if ((pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) == 0)
+ return;
I think this makes sense. A waiter could queue itself after the exchange,
but it'll recheck after queueing. IIUC this is basically how this works
today. We update the value and release the lock before waking up any
waiters, so the same principle applies.
Overall, I think this patch is in reasonable shape.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Feb 9, 2023 at 3:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
+ pg_atomic_exchange_u64(valptr, val);
nitpick: I'd add a (void) at the beginning of these calls to
pg_atomic_exchange_u64() so that it's clear that we are discarding the
return value.
I did that in the attached v5 patch although it's a mix elsewhere;
some doing explicit return value cast with (void) and some not.
+ /* + * Update the lock variable atomically first without having to acquire wait + * list lock, so that if anyone looking for the lock will have chance to + * grab it a bit quickly. + * + * NB: Note the use of pg_atomic_exchange_u64 as opposed to just + * pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is + * a full barrier, we're guaranteed that the subsequent atomic read of lock + * state to check if it has any waiters happens after we set the lock + * variable to new value here. Without a barrier, we could end up missing + * waiters that otherwise should have been woken up. + */ + pg_atomic_exchange_u64(valptr, val); + + /* + * Quick exit when there are no waiters. This avoids unnecessary lwlock's + * wait list lock acquisition and release. + */ + if ((pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS) == 0) + return;I think this makes sense. A waiter could queue itself after the exchange,
but it'll recheck after queueing. IIUC this is basically how this works
today. We update the value and release the lock before waking up any
waiters, so the same principle applies.
Yes, a waiter right after self-queuing (LWLockQueueSelf) checks for
the value (LWLockConflictsWithVar) before it goes and waits until
awakened in LWLockWaitForVar. A waiter added to the queue is
guaranteed to be woken up by the
LWLockUpdateVar but before that the lock value is set and we have
pg_atomic_exchange_u64 as a memory barrier, so no memory reordering.
Essentially, the order of these operations aren't changed. The benefit
that we're seeing is from avoiding LWLock's waitlist lock for reading
and updating the lock value relying on 64-bit atomics.
Overall, I think this patch is in reasonable shape.
Thanks for reviewing. Please see the attached v5 patch.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v5-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patchapplication/octet-stream; name=v5-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patchDownload+55-32
On Thu, Feb 09, 2023 at 11:51:28AM +0530, Bharath Rupireddy wrote:
On Thu, Feb 9, 2023 at 3:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
Overall, I think this patch is in reasonable shape.
Thanks for reviewing. Please see the attached v5 patch.
I'm marking this as ready-for-committer. I think a couple of the comments
could use some small adjustments, but that probably doesn't need to hold up
this patch.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi Andres Freund
This patch improves performance significantly,Commitfest 2023-03 is coming to an end,Is it not submitted yet since the patch still needs to be improved?
Best wish
________________________________
发件人: Nathan Bossart <nathandbossart@gmail.com>
发送时间: 2023年2月21日 13:49
收件人: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
抄送: Andres Freund <andres@anarazel.de>; PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
主题: Re: WAL Insertion Lock Improvements
On Thu, Feb 09, 2023 at 11:51:28AM +0530, Bharath Rupireddy wrote:
On Thu, Feb 9, 2023 at 3:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
Overall, I think this patch is in reasonable shape.
Thanks for reviewing. Please see the attached v5 patch.
I'm marking this as ready-for-committer. I think a couple of the comments
could use some small adjustments, but that probably doesn't need to hold up
this patch.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Feb 20, 2023 at 09:49:48PM -0800, Nathan Bossart wrote:
I'm marking this as ready-for-committer. I think a couple of the comments
could use some small adjustments, but that probably doesn't need to hold up
this patch.
Apologies. I was planning to have a thorough look at this patch but
life got in the way and I have not been able to study what's happening
on this thread this close to the feature freeze.
Anyway, I am attaching two modules I have written for the sake of this
thread while beginning my lookup of the patch:
- lwlock_test.tar.gz, validation module for LWLocks with variable
waits. This module can be loaded with shared_preload_libraries to
have two LWLocks and two variables in shmem, then have 2 backends play
ping-pong with each other's locks. An isolation test may be possible,
though I have not thought hard about it. Just use a SQL sequence like
that, for example, with N > 1 (see README):
Backend 1: SELECT lwlock_test_acquire();
Backend 2: SELECT lwlock_test_wait(N);
Backend 1: SELECT lwlock_test_update(N);
Backend 1: SELECT lwlock_test_release();
- custom_wal.tar.gz, thin wrapper for LogLogicalMessage() able to
generate N records of size M bytes in a single SQL call. This can be
used to generate records of various sizes for benchmarking, limiting
the overhead of individual calls to pg_logical_emit_message_bytea().
I have begun gathering numbers with WAL records of various size and
length, using pgbench like:
$ cat script.sql
\set record_size 1
\set record_number 5000
SELECT custom_wal(:record_size, :record_number);
$ pgbench -n -c 500 -t 100 -f script.sql
So this limits most the overhead of behind parsing, planning, and most
of the INSERT logic.
I have been trying to get some reproducible numbers, but I think that
I am going to need a bigger maching than what I have been using for
the last few days, up to 400 connections. It is worth noting that
00d1e02b may influence a bit the results, so we may want to have more
numbers with that in place particularly with INSERTs, and one of the
tests used upthread uses single row INSERTs.
Another question I had: would it be worth having some tests with
pg_wal/ mounted to a tmpfs so as I/O would not be a bottleneck? It
should be instructive to get more measurement with a fixed number of
transactions and a rather high amount of concurrent connections (1k at
least?), where the contention would be on the variable waits. My
first impression is that records should not be too small if you want
to see more the effects of this patch, either.
Looking at the patch.. LWLockConflictsWithVar() and
LWLockReleaseClearVar() are the trivial bits. These are OK.
+ * NB: Note the use of pg_atomic_exchange_u64 as opposed to just
+ * pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is
+ * a full barrier, we're guaranteed that the subsequent shared memory
+ * reads/writes, if any, happen after we reset the lock variable.
This mentions that the subsequent read/write operations are safe, so
this refers to anything happening after the variable is reset. As
a full barrier, should be also mention that this is also ordered with
respect to anything that the caller did before clearing the variable?
From this perspective using pg_atomic_exchange_u64() makes sense to me
in LWLockReleaseClearVar().
+ * XXX: Use of a spinlock at the beginning of this function to read
+ * current insert position implies memory ordering. That means that
+ * the immediate loads and stores to shared memory (for instance,
+ * in LWLockUpdateVar called via LWLockWaitForVar) don't need an
+ * explicit memory barrier as far as the current usage is
+ * concerned. But that might not be safe in general.
*/
What's the part where this is not safe? Based on what I see, this
code path is safe because of the previous spinlock. This is the same
comment as at the beginning of LWLockConflictsWithVar(). Is that
something that we ought to document at the top of LWLockWaitForVar()
as well? We have one caller of this function currently, but there may
be more in the future.
- * you're about to write out.
+ * you're about to write out. Using an atomic variable for insertingAt avoids
+ * taking any explicit lock for reads and writes.
Hmm. Not sure that we need to comment at all.
-LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
+LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
[...]
Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE);
- /* Update the lock's value */
- *valptr = val;
The sensitive change is in LWLockUpdateVar(). I am not completely
sure to understand this removal, though. Does that influence the case
where there are waiters?
Another thing I was wondering about: how much does the fast-path used
in LWLockUpdateVar() influence the performance numbers? Am I right to
guess that it counts for most of the gain seen? Or could it be that
the removal of the spin lock in
LWLockConflictsWithVar()/LWLockWaitForVar() the point that has the
highest effect?
--
Michael
On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier <michael@paquier.xyz> wrote:
I have been trying to get some reproducible numbers, but I think that
I am going to need a bigger maching than what I have been using for
the last few days, up to 400 connections. It is worth noting that
00d1e02b may influence a bit the results, so we may want to have more
numbers with that in place particularly with INSERTs, and one of the
tests used upthread uses single row INSERTs.
I ran performance tests on the patch with different use-cases. Clearly
the patch reduces burden on LWLock's waitlist lock (evident from perf
reports [1]test-case 1: -T5, WAL ~16 bytes HEAD: + 81.52% 0.03% postgres [.] __vstrfmon_l_internal + 81.52% 0.00% postgres [.] startup_hacks + 81.52% 0.00% postgres [.] PostmasterMain + 63.95% 1.01% postgres [.] LWLockWaitListLock + 61.93% 0.02% postgres [.] WaitXLogInsertionsToFinish + 61.89% 0.05% postgres [.] LWLockWaitForVar + 48.83% 48.33% postgres [.] pg_atomic_read_u32_impl + 48.78% 0.40% postgres [.] pg_atomic_read_u32 + 43.19% 0.12% postgres [.] LWLockConflictsWithVar + 19.81% 0.01% postgres [.] LWLockQueueSelf + 7.86% 2.46% postgres [.] perform_spin_delay + 6.14% 6.06% postgres [.] spin_delay + 5.82% 0.01% postgres [.] pg_atomic_fetch_or_u32 + 5.81% 5.76% postgres [.] pg_atomic_fetch_or_u32_impl + 4.00% 0.01% postgres [.] XLogInsert + 3.93% 0.03% postgres [.] XLogInsertRecord + 2.13% 0.02% postgres [.] LWLockRelease + 2.10% 0.03% postgres [.] LWLockAcquire + 1.92% 0.00% postgres [.] LWLockDequeueSelf + 1.87% 0.01% postgres [.] WALInsertLockAcquire + 1.68% 0.04% postgres [.] LWLockAcquireOrWait + 1.64% 0.01% postgres [.] pg_analyze_and_rewrite_fixedparams + 1.62% 0.00% postgres [.] WALInsertLockRelease + 1.62% 0.00% postgres [.] LWLockReleaseClearVar + 1.55% 0.01% postgres [.] parse_analyze_fixedparams + 1.51% 0.00% postgres [.] transformTopLevelStmt + 1.50% 0.00% postgres [.] transformOptionalSelectInto + 1.50% 0.01% postgres [.] transformStmt + 1.47% 0.02% postgres [.] transformSelectStmt + 1.29% 0.01% postgres [.] XactLogCommitRecord). However, to see visible impact in the output, the txns
must be generating small (between 16 bytes to 2 KB) amounts of WAL in
a highly concurrent manner, check the results below (FWIW, I've zipped
and attached perf images for better illustration along with test
setup).
When the txns are generating a small amount of WAL i.e. between 16
bytes to 2 KB in a highly concurrent manner, the benefit is clearly
visible in the TPS more than 2.3X improvement. When the txns are
generating more WAL i.e. more than 2 KB, the gain from reduced burden
on waitlist lock is offset by increase in the wait/release for WAL
insertion locks and no visible benefit is seen.
As the amount of WAL each txn generates increases, it looks like the
benefit gained from reduced burden on waitlist lock is offset by
increase in the wait for WAL insertion locks.
Note that I've used pg_logical_emit_message() for ease of
understanding about the txns generating various amounts of WAL, but
the pattern is the same if txns are generating various amounts of WAL
say with inserts.
test-case 1: -T5, WAL ~16 bytes
clients HEAD PATCHED
1 1437 1352
2 1376 1419
4 2919 2774
8 5875 6371
16 11148 12242
32 22108 23532
64 41414 46478
128 85304 85235
256 83771 152901
512 61970 141021
768 56514 118899
1024 51784 110960
2048 39141 84150
4096 16901 45759
test-case 1: -t1000, WAL ~16 bytes
clients HEAD PATCHED
1 1417 1333
2 1363 1791
4 2978 2970
8 5954 6198
16 11179 11164
32 23742 24043
64 45537 44103
128 84683 91762
256 80369 146293
512 61080 132079
768 57236 118046
1024 53497 114574
2048 46423 93588
4096 42067 85790
test-case 2: -T5, WAL ~256 bytes
clients HEAD PATCHED
1 1521 1386
2 1647 1637
4 3088 3270
8 6011 5631
16 12778 10317
32 24117 20006
64 43966 38199
128 72660 67936
256 93096 121261
512 57247 142418
768 53782 126218
1024 50279 109153
2048 35109 91602
4096 21184 39848
test-case 2: -t1000, WAL ~256 bytes
clients HEAD PATCHED
1 1265 1389
2 1522 1258
4 2802 2775
8 5875 5422
16 11664 10853
32 21961 22145
64 44304 40851
128 73278 80494
256 91172 122287
512 60966 136734
768 56590 125050
1024 52481 124341
2048 47878 104760
4096 42838 94121
test-case 3: -T5, WAL 512 bytes
clients HEAD PATCHED
1 1464 1284
2 1520 1381
4 2985 2877
8 6237 5261
16 11296 10621
32 22257 20789
64 40548 37243
128 66507 59891
256 92516 97506
512 56404 119716
768 51127 112482
1024 48463 103484
2048 38079 81424
4096 18977 40942
test-case 3: -t1000, WAL 512 bytes
clients HEAD PATCHED
1 1452 1434
2 1604 1649
4 3051 2971
8 5967 5650
16 10471 10702
32 20257 20899
64 39412 36750
128 62767 61110
256 81050 89768
512 56888 122786
768 51238 114444
1024 48972 106867
2048 43451 98847
4096 40018 111079
test-case 4: -T5, WAL 1024 bytes
clients HEAD PATCHED
1 1405 1395
2 1638 1607
4 3176 3207
8 6271 6024
16 11653 11103
32 20530 20260
64 34313 32367
128 55939 52079
256 74355 76420
512 56506 90983
768 50088 100410
1024 44589 99025
2048 39640 90931
4096 20942 36035
test-case 4: -t1000, WAL 1024 bytes
clients HEAD PATCHED
1 1330 1304
2 1615 1366
4 3117 2667
8 6179 5390
16 10524 10426
32 19819 18620
64 34844 29731
128 52180 48869
256 73284 71396
512 55714 96014
768 49336 108100
1024 46113 102789
2048 44627 104721
4096 44979 106189
test-case 5: -T5, WAL 2048 bytes
clients HEAD PATCHED
1 1407 1377
2 1518 1559
4 2589 2870
8 4883 5493
16 9075 9201
32 15957 16295
64 27471 25029
128 37493 38642
256 46369 45787
512 61755 62836
768 59144 68419
1024 52495 68933
2048 48608 72500
4096 26463 61252
test-case 5: -t1000, WAL 2048 bytes
clients HEAD PATCHED
1 1289 1366
2 1489 1628
4 2960 3036
8 5536 5965
16 9248 10399
32 15770 18140
64 27626 27800
128 36817 39483
256 48533 52105
512 64453 64007
768 59146 64160
1024 57637 61756
2048 59063 62109
4096 58268 61206
test-case 6: -T5, WAL 4096 bytes
clients HEAD PATCHED
1 1322 1325
2 1504 1551
4 2811 2880
8 5330 5159
16 8625 8315
32 12820 13534
64 19737 19965
128 26298 24633
256 34630 29939
512 34382 36669
768 33421 33316
1024 33525 32821
2048 37053 37752
4096 37334 39114
test-case 6: -t1000, WAL 4096 bytes
clients HEAD PATCHED
1 1212 1371
2 1383 1566
4 2858 2967
8 5092 5035
16 8233 8486
32 13353 13678
64 19052 20072
128 24803 24726
256 34065 33139
512 31590 32029
768 31432 31404
1024 31357 31366
2048 31465 31508
4096 32157 32180
test-case 7: -T5, WAL 8192 bytes
clients HEAD PATCHED
1 1287 1233
2 1552 1521
4 2658 2617
8 4680 4532
16 6732 7110
32 9649 9198
64 13276 12042
128 17100 17187
256 17408 17448
512 16595 16358
768 16599 16500
1024 16975 17300
2048 19073 19137
4096 21368 21735
test-case 7: -t1000, WAL 8192 bytes
clients HEAD PATCHED
1 1144 1190
2 1414 1395
4 2618 2438
8 4645 4485
16 6766 7001
32 9620 9804
64 12943 13023
128 15904 17148
256 16645 16035
512 15800 15796
768 15788 15810
1024 15814 15817
2048 17775 17771
4096 31715 31682
Looking at the patch.. LWLockConflictsWithVar() and
LWLockReleaseClearVar() are the trivial bits. These are OK.
Hm, the crux of the patch is avoiding LWLock's waitlist lock for
reading/writing the lock variable. Essentially, they are important
bits.
+ * NB: Note the use of pg_atomic_exchange_u64 as opposed to just + * pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is + * a full barrier, we're guaranteed that the subsequent shared memory + * reads/writes, if any, happen after we reset the lock variable.This mentions that the subsequent read/write operations are safe, so
this refers to anything happening after the variable is reset. As
a full barrier, should be also mention that this is also ordered with
respect to anything that the caller did before clearing the variable?
From this perspective using pg_atomic_exchange_u64() makes sense to me
in LWLockReleaseClearVar().
Wordsmithed that comment a bit.
+ * XXX: Use of a spinlock at the beginning of this function to read + * current insert position implies memory ordering. That means that + * the immediate loads and stores to shared memory (for instance, + * in LWLockUpdateVar called via LWLockWaitForVar) don't need an + * explicit memory barrier as far as the current usage is + * concerned. But that might not be safe in general. */ What's the part where this is not safe? Based on what I see, this code path is safe because of the previous spinlock. This is the same comment as at the beginning of LWLockConflictsWithVar(). Is that something that we ought to document at the top of LWLockWaitForVar() as well? We have one caller of this function currently, but there may be more in the future.
'But that might not be safe in general' applies only for
LWLockWaitForVar not for WaitXLogInsertionsToFinish for sure. My bad.
If there's another caller for LWLockWaitForVar without any spinlock,
that's when the LWLockWaitForVar needs to have an explicit memory
barrier.
Per a comment upthread
/messages/by-id/20221205183007.s72oygp63s43dqyz@awork3.anarazel.de,
I had a note in WaitXLogInsertionsToFinish before LWLockWaitForVar. I
now have modified that comment.
- * you're about to write out. + * you're about to write out. Using an atomic variable for insertingAt avoids + * taking any explicit lock for reads and writes.Hmm. Not sure that we need to comment at all.
Removed. I was being verbose. One who understands pg_atomic_uint64 can
get to that point easily.
-LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val) +LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val) [...] Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE);- /* Update the lock's value */
- *valptr = val;The sensitive change is in LWLockUpdateVar(). I am not completely
sure to understand this removal, though. Does that influence the case
where there are waiters?
I'll send about this in a follow-up email to not overload this
response with too much data.
Another thing I was wondering about: how much does the fast-path used
in LWLockUpdateVar() influence the performance numbers? Am I right to
guess that it counts for most of the gain seen?
I'll send about this in a follow-up email to not overload this
response with too much data.
Or could it be that
the removal of the spin lock in
LWLockConflictsWithVar()/LWLockWaitForVar() the point that has the
highest effect?
I'll send about this in a follow-up email to not overload this
response with too much data.
I've addressed the above review comments and attached the v6 patch.
[1]
test-case 1: -T5, WAL ~16 bytes HEAD:
+ 81.52% 0.03% postgres [.] __vstrfmon_l_internal
+ 81.52% 0.00% postgres [.] startup_hacks
+ 81.52% 0.00% postgres [.] PostmasterMain
+ 63.95% 1.01% postgres [.] LWLockWaitListLock
+ 61.93% 0.02% postgres [.] WaitXLogInsertionsToFinish
+ 61.89% 0.05% postgres [.] LWLockWaitForVar
+ 48.83% 48.33% postgres [.] pg_atomic_read_u32_impl
+ 48.78% 0.40% postgres [.] pg_atomic_read_u32
+ 43.19% 0.12% postgres [.] LWLockConflictsWithVar
+ 19.81% 0.01% postgres [.] LWLockQueueSelf
+ 7.86% 2.46% postgres [.] perform_spin_delay
+ 6.14% 6.06% postgres [.] spin_delay
+ 5.82% 0.01% postgres [.] pg_atomic_fetch_or_u32
+ 5.81% 5.76% postgres [.] pg_atomic_fetch_or_u32_impl
+ 4.00% 0.01% postgres [.] XLogInsert
+ 3.93% 0.03% postgres [.] XLogInsertRecord
+ 2.13% 0.02% postgres [.] LWLockRelease
+ 2.10% 0.03% postgres [.] LWLockAcquire
+ 1.92% 0.00% postgres [.] LWLockDequeueSelf
+ 1.87% 0.01% postgres [.] WALInsertLockAcquire
+ 1.68% 0.04% postgres [.] LWLockAcquireOrWait
+ 1.64% 0.01% postgres [.] pg_analyze_and_rewrite_fixedparams
+ 1.62% 0.00% postgres [.] WALInsertLockRelease
+ 1.62% 0.00% postgres [.] LWLockReleaseClearVar
+ 1.55% 0.01% postgres [.] parse_analyze_fixedparams
+ 1.51% 0.00% postgres [.] transformTopLevelStmt
+ 1.50% 0.00% postgres [.] transformOptionalSelectInto
+ 1.50% 0.01% postgres [.] transformStmt
+ 1.47% 0.02% postgres [.] transformSelectStmt
+ 1.29% 0.01% postgres [.] XactLogCommitRecord
test-case 1: -T5, WAL ~16 bytes PATCHED:
+ 74.49% 0.04% postgres [.] __vstrfmon_l_internal
+ 74.49% 0.00% postgres [.] startup_hacks
+ 74.49% 0.00% postgres [.] PostmasterMain
+ 51.60% 0.01% postgres [.] finish_xact_command
+ 51.60% 0.02% postgres [.] CommitTransactionCommand
+ 51.37% 0.03% postgres [.] CommitTransaction
+ 49.43% 0.05% postgres [.] RecordTransactionCommit
+ 46.55% 0.05% postgres [.] XLogFlush
+ 46.37% 0.85% postgres [.] LWLockWaitListLock
+ 43.79% 0.02% postgres [.] LWLockQueueSelf
+ 38.87% 0.03% postgres [.] WaitXLogInsertionsToFinish
+ 38.79% 0.11% postgres [.] LWLockWaitForVar
+ 34.99% 34.49% postgres [.] pg_atomic_read_u32_impl
+ 34.93% 0.35% postgres [.] pg_atomic_read_u32
+ 6.99% 2.12% postgres [.] perform_spin_delay
+ 6.64% 0.01% postgres [.] XLogInsert
+ 6.54% 0.06% postgres [.] XLogInsertRecord
+ 6.26% 0.08% postgres [.] LWLockAcquireOrWait
+ 5.31% 5.22% postgres [.] spin_delay
+ 4.23% 0.04% postgres [.] LWLockRelease
+ 3.74% 0.01% postgres [.] pg_atomic_fetch_or_u32
+ 3.73% 3.68% postgres [.] pg_atomic_fetch_or_u32_impl
+ 3.33% 0.06% postgres [.] LWLockAcquire
+ 2.97% 0.01% postgres [.] pg_plan_queries
+ 2.95% 0.01% postgres [.] WALInsertLockAcquire
+ 2.94% 0.02% postgres [.] planner
+ 2.94% 0.01% postgres [.] pg_plan_query
+ 2.92% 0.01% postgres [.] LWLockDequeueSelf
+ 2.89% 0.04% postgres [.] standard_planner
+ 2.81% 0.00% postgres [.] WALInsertLockRelease
+ 2.80% 0.00% postgres [.] LWLockReleaseClearVar
+ 2.38% 0.07% postgres [.] subquery_planner
+ 2.35% 0.01% postgres [.] XactLogCommitRecord
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, May 8, 2023 at 5:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier <michael@paquier.xyz> wrote:
-LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val) +LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val) [...] Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE);- /* Update the lock's value */
- *valptr = val;The sensitive change is in LWLockUpdateVar(). I am not completely
sure to understand this removal, though. Does that influence the case
where there are waiters?I'll send about this in a follow-up email to not overload this
response with too much data.
It helps the case when there are no waiters. IOW, it updates value
without waitlist lock when there are no waiters, so no extra waitlist
lock acquisition/release just to update the value. In turn, it helps
the other backend wanting to flush the WAL looking for the new updated
value of insertingAt in WaitXLogInsertionsToFinish(), now the flushing
backend can get the new value faster.
Another thing I was wondering about: how much does the fast-path used
in LWLockUpdateVar() influence the performance numbers? Am I right to
guess that it counts for most of the gain seen?I'll send about this in a follow-up email to not overload this
response with too much data.
The fastpath exit in LWLockUpdateVar() doesn't seem to influence the
results much, see below results. However, it avoids waitlist lock
acquisition when there are no waiters.
test-case 1: -T5, WAL ~16 bytes
clients HEAD PATCHED with fastpath PATCHED no fast path
1 1482 1486 1457
2 1617 1620 1569
4 3174 3233 3031
8 6136 6365 5725
16 12566 12269 11685
32 24284 23621 23177
64 50135 45528 46653
128 94903 89791 89103
256 82289 152915 152835
512 62498 138838 142084
768 57083 125074 126768
1024 51308 113593 115930
2048 41084 88764 85110
4096 19939 42257 43917
Or could it be that
the removal of the spin lock in
LWLockConflictsWithVar()/LWLockWaitForVar() the point that has the
highest effect?I'll send about this in a follow-up email to not overload this
response with too much data.
Out of 3 functions that got rid of waitlist lock
LWLockConflictsWithVar/LWLockWaitForVar, LWLockUpdateVar,
LWLockReleaseClearVar, perf reports tell that the biggest gain (for
the use-cases that I've tried) is for
LWLockConflictsWithVar/LWLockWaitForVar:
test-case 1: -T5, WAL ~16 bytes
HEAD:
+ 61.89% 0.05% postgres [.] LWLockWaitForVar
+ 43.19% 0.12% postgres [.] LWLockConflictsWithVar
+ 1.62% 0.00% postgres [.] LWLockReleaseClearVar
PATCHED:
+ 38.79% 0.11% postgres [.] LWLockWaitForVar
0.40% 0.02% postgres [.] LWLockConflictsWithVar
+ 2.80% 0.00% postgres [.] LWLockReleaseClearVar
test-case 6: -T5, WAL 4096 bytes
HEAD:
+ 29.66% 0.07% postgres [.] LWLockWaitForVar
+ 20.94% 0.08% postgres [.] LWLockConflictsWithVar
0.19% 0.03% postgres [.] LWLockUpdateVar
PATCHED:
+ 3.95% 0.08% postgres [.] LWLockWaitForVar
0.19% 0.03% postgres [.] LWLockConflictsWithVar
+ 1.73% 0.00% postgres [.] LWLockReleaseClearVar
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
I ran performance tests on the patch with different use-cases. Clearly
the patch reduces burden on LWLock's waitlist lock (evident from perf
reports [1]). However, to see visible impact in the output, the txns
must be generating small (between 16 bytes to 2 KB) amounts of WAL in
a highly concurrent manner, check the results below (FWIW, I've zipped
and attached perf images for better illustration along with test
setup).When the txns are generating a small amount of WAL i.e. between 16
bytes to 2 KB in a highly concurrent manner, the benefit is clearly
visible in the TPS more than 2.3X improvement. When the txns are
generating more WAL i.e. more than 2 KB, the gain from reduced burden
on waitlist lock is offset by increase in the wait/release for WAL
insertion locks and no visible benefit is seen.As the amount of WAL each txn generates increases, it looks like the
benefit gained from reduced burden on waitlist lock is offset by
increase in the wait for WAL insertion locks.
Nice.
test-case 1: -T5, WAL ~16 bytes
test-case 1: -t1000, WAL ~16 bytes
I wonder if it's worth doing a couple of long-running tests, too.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, May 08, 2023 at 04:04:10PM -0700, Nathan Bossart wrote:
On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
test-case 1: -T5, WAL ~16 bytes
test-case 1: -t1000, WAL ~16 bytesI wonder if it's worth doing a couple of long-running tests, too.
Yes, 5s or 1000 transactions per client is too small, though it shows
that things are going in the right direction.
(Will reply to the rest in a bit..)
--
Michael