walsender performance regression due to logical decoding on standby changes
Hi,
Unfortunately I have found the following commit to have caused a performance
regression:
commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
Author: Andres Freund <andres@anarazel.de>
Date: 2023-04-08 00:24:24 -0700
For cascading replication, wake physical and logical walsenders separately
Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.
Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.
Per idea from Jeff Davis and Amit Kapila.
Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: /messages/by-id/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com
The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).
WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.
When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.
The only saving grace is that this is not an issue on the primary.
I unfortunately spent less time on this commit of the
logical-decoding-on-standby series than on the others. There were so many
other senior contributors discussing it, that I "relaxed" a bit too much.
I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likely could,
especially when just checking if the the walsender is connected).
My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.
Greetings,
Andres Freund
On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:
I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?
Regards,
Jeff Davis
Hi,
On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:
On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:
I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?
That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.
Greetings,
Andres Freund
On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:
Unfortunately I have found the following commit to have caused a performance
regression:commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.
I played it with a simple primary -> standby1 -> standby2 setup. I ran
a pgbench script [1]./pg_ctl -D data -l logfile stop ./pg_ctl -D sbdata1 -l logfile1 stop ./pg_ctl -D sbdata2 -l logfile2 stop on primary and counted the number of times
WalSndWakeup() gets called from ApplyWalRecord() and the number of
times spinlock is acquired/released in WalSndWakeup(). It's a whopping
21 million times spinlock is acquired/released on the standby 1 and
standby 2 for just a < 5min of pgbench run on the primary:
standby 1:
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530
standby 2:
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530
In this case, there is no timeline switch or no logical decoding on
the standby or such, but WalSndWakeup() gets called because the
standby can't make out if the slot is for logical or physical
replication unless spinlock is acquired. Before e101dfac3a,
WalSndWakeup() was getting called only when there was a timeline
switch.
The only saving grace is that this is not an issue on the primary.
Yeah.
I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likely could,
especially when just checking if the the walsender is connected).
Right.
My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.
How about something like the attached? Recovery and subscription tests
don't complain with the patch.
[1]: ./pg_ctl -D data -l logfile stop ./pg_ctl -D sbdata1 -l logfile1 stop ./pg_ctl -D sbdata2 -l logfile2 stop
./pg_ctl -D data -l logfile stop
./pg_ctl -D sbdata1 -l logfile1 stop
./pg_ctl -D sbdata2 -l logfile2 stop
cd /home/ubuntu/postgres
rm -rf inst
make distclean
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &
primary -> standby1 -> standby2
free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m
cd inst/bin
./initdb -D data
mkdir archived_wal
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/postgres/inst/bin/archived_wal/%f'
EOF
./pg_ctl -D data -l logfile start
./pg_basebackup -D sbdata1
./psql -c "select pg_create_physical_replication_slot('sb1_slot',
true, false)" postgres
cat << EOF >> sbdata1/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=sb1'
primary_slot_name='sb1_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF
touch sbdata1/standby.signal
./pg_ctl -D sbdata1 -l logfile1 start
./pg_basebackup -D sbdata2 -p 5433
./psql -p 5433 -c "select
pg_create_physical_replication_slot('sb2_slot', true, false)" postgres
cat << EOF >> sbdata2/postgresql.conf
port=5434
primary_conninfo='host=localhost port=5433 dbname=postgres user=ubuntu
application_name=sb2'
primary_slot_name='sb2_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF
touch sbdata2/standby.signal
./pg_ctl -D sbdata2 -l logfile2 start
./psql -c "select pid, backend_type from pg_stat_activity where
backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5433 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5434 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres
ulimit -S -n 5000
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Use-conditional-variable-to-wake-up-walsenders.patchapplication/octet-stream; name=v1-0001-Use-conditional-variable-to-wake-up-walsenders.patchDownload+11-2
Hi,
On 5/9/23 11:00 PM, Andres Freund wrote:
Hi,
On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:
On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:
I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.
Just to clarify, do you mean that if there is only one remaining active walsender that, say,
has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 5/10/23 8:36 AM, Bharath Rupireddy wrote:
On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:
Unfortunately I have found the following commit to have caused a performance
regression:commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.
Thanks Andres for the call out! I do agree that this is a concern.
The only saving grace is that this is not an issue on the primary.
Yeah.
+1
My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.
Yeah, I think that "mis-using" a condition variable is a valid option. Unless I'm missing
something, I don't think there is anything wrong with using a CV that way (aka not using
ConditionVariableTimedSleep() or ConditionVariableSleep() in this particular case).
How about something like the attached? Recovery and subscription tests
don't complain with the patch.
Thanks Bharath for looking at it!
I launched a full Cirrus CI test with it but it failed on one environment (did not look in details,
just sharing this here): https://cirrus-ci.com/task/6570140767092736
Also I have a few comments:
@@ -1958,7 +1959,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
* ------
*/
if (AllowCascadeReplication())
- WalSndWakeup(switchedTLI, true);
+ ConditionVariableBroadcast(&WalSndCtl->cv);
I think the comment above this change needs to be updated.
@@ -3368,9 +3370,13 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
WaitEvent event;
ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+ ConditionVariablePrepareToSleep(&WalSndCtl->cv);
if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
(event.events & WL_POSTMASTER_DEATH))
proc_exit(1);
+
+ ConditionVariableCancelSleep();
May be worth to update the comment above WalSndWait() too? (to explain why a CV handling is part of it).
@@ -108,6 +109,8 @@ typedef struct
*/
bool sync_standbys_defined;
+ ConditionVariable cv;
Worth to give it a more meaning full name? (to give a clue as when it is used)
I think we also need to update the comment above WalSndWakeup():
"
* For cascading replication we need to wake up physical walsenders separately
* from logical walsenders (see the comment before calling WalSndWakeup() in
* ApplyWalRecord() for more details).
"
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de>
wrote:Unfortunately I have found the following commit to have caused a
performance
regression:
commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord()happens for
every record, instead of only happening when the timeline is changed (or
WAL
is flushed or ...).
WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.I played it with a simple primary -> standby1 -> standby2 setup. I ran
a pgbench script [1] on primary and counted the number of times
WalSndWakeup() gets called from ApplyWalRecord() and the number of
times spinlock is acquired/released in WalSndWakeup(). It's a whopping
21 million times spinlock is acquired/released on the standby 1 and
standby 2 for just a < 5min of pgbench run on the primary:standby 1:
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530standby 2:
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530In this case, there is no timeline switch or no logical decoding on
the standby or such, but WalSndWakeup() gets called because the
standby can't make out if the slot is for logical or physical
replication unless spinlock is acquired. Before e101dfac3a,
WalSndWakeup() was getting called only when there was a timeline
switch.The only saving grace is that this is not an issue on the primary.
Yeah.
I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likelycould,
especially when just checking if the the walsender is connected).
Right.
My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.How about something like the attached? Recovery and subscription tests
don't complain with the patch.
Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:
if (AllowCascadeReplication())
- WalSndWakeup(switchedTLI, true);
+ ConditionVariableBroadcast(&WalSndCtl->cv);
After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.
Best Regards,
Hou zj
On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.How about something like the attached? Recovery and subscription tests
don't complain with the patch.Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + ConditionVariableBroadcast(&WalSndCtl->cv);After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.
This raises the question of whether we need this condition variable
logic for physical walsenders?
--
With Regards,
Amit Kapila.
On Wed, May 10, 2023 at 7:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.How about something like the attached? Recovery and subscription tests
don't complain with the patch.Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + ConditionVariableBroadcast(&WalSndCtl->cv);After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.This raises the question of whether we need this condition variable
logic for physical walsenders?
It sounds like a good idea. We can have two condition variables for
logical and physical walsenders, and selectively wake up walsenders
sleeping on the condition variables. It should work, it seems like
much of a hack, though.
Regards,
[1]: /messages/by-id/2d314c22b9e03415aa1c7d8fd1f698dae60effa7.camel@j-davis.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Wednesday, May 10, 2023 3:03 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
Unfortunately I have found the following commit to have caused a
performance
regression:commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
Author: Andres Freund <andres@anarazel.de>
Date: 2023-04-08 00:24:24 -0700For cascading replication, wake physical and logical walsenders separately
Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.Per idea from Jeff Davis and Amit Kapila.
Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion:
/messages/by-id/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkb
W7Hqh6vQ@mail.gmail.comThe problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens
for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.
I did some simple tests for this to see the performance impact on
the streaming replication, just share it here for reference.
1) sync primary-standby setup, load data on primary and count the time spent on
replication. the degradation will be more obvious as the value of max_wal_senders
increases.
max_wal_senders before(ms) after(ms) degradation
100 13394.4013 14141.2615 5.58%
200 13280.8507 14597.1173 9.91%
300 13535.0232 16735.7379 23.65%
2) Similar as 1) but count the time that the standby startup process spent on
replaying WAL(via gprof).
10 senders
===========
before
% cumulative self self total
time seconds seconds calls s/call s/call name
4.12 0.45 0.11 1 0.11 2.46 PerformWalRecovery
after
% cumulative self self total
time seconds seconds calls s/call s/call name
17.99 0.59 0.59 4027383 0.00 0.00 WalSndWakeup
8.23 0.86 0.27 1 0.27 3.11 PerformWalRecovery
100 senders
===========
before
% cumulative self self total
time seconds seconds calls s/call s/call name
5.56 0.36 0.18 1 0.18 2.91 PerformWalRecovery
after
% cumulative self self total
time seconds seconds calls s/call s/call name
64.65 4.39 4.39 4027383 0.00 0.00 WalSndWakeup
2.95 4.59 0.20 1 0.20 6.62 PerformWalRecovery
Will test after applying the latest patch in this thread later.
Best Regards,
Hou zj
On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.How about something like the attached? Recovery and subscription tests
don't complain with the patch.I launched a full Cirrus CI test with it but it failed on one environment (did not look in details,
just sharing this here): https://cirrus-ci.com/task/6570140767092736
Yeah, v1 had ConditionVariableInit() such that the CV was getting
initialized for every backend as opposed to just once after the WAL
sender shmem was created.
Also I have a few comments:
Indeed, v1 was a WIP patch. Please have a look at the attached v2
patch, which has comments and passing CI runs on all platforms -
https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_logic_v2.
On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + ConditionVariableBroadcast(&WalSndCtl->cv);After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.
That's not the case with the attached v2 patch. Please have a look.
On Thu, May 11, 2023 at 10:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
We can have two condition variables for
logical and physical walsenders, and selectively wake up walsenders
sleeping on the condition variables. It should work, it seems like
much of a hack, though.
Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchapplication/x-patch; name=v2-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchDownload+55-25
On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:My current guess is that mis-using a condition variable is the best
bet. I think it should work to use
ConditionVariablePrepareToSleep() before a WalSndWait(), and then
ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set fromConditionVariableBroadcast() would still cause the necessary wakeup.
How about something like the attached? Recovery and subscription
tests don't complain with the patch.I launched a full Cirrus CI test with it but it failed on one
environment (did not look in details, just sharing this here):
https://cirrus-ci.com/task/6570140767092736Yeah, v1 had ConditionVariableInit() such that the CV was getting initialized for
every backend as opposed to just once after the WAL sender shmem was
created.Also I have a few comments:
Indeed, v1 was a WIP patch. Please have a look at the attached v2 patch, which
has comments and passing CI runs on all platforms -
https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_
logic_v2.On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + ConditionVariableBroadcast(&WalSndCtl->cv);After the change, we wakeup physical walsender regardless of switchedTLI
flag.
Is this intentional ? if so, I think It would be better to update the comments
above this.
That's not the case with the attached v2 patch. Please have a look.
Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1]max_wal_senders = 100 before regression(ms) after regression(ms) v2 patch(ms) 13394.4013 14141.2615 13455.2543 Compared with before regression 5.58% 0.45%.
One nitpick in the comment:
+ * walsenders. It makes WalSndWakeup() callers life easy.
callers life easy => callers' life easy.
[1]: max_wal_senders = 100 before regression(ms) after regression(ms) v2 patch(ms) 13394.4013 14141.2615 13455.2543 Compared with before regression 5.58% 0.45%
max_wal_senders = 100
before regression(ms) after regression(ms) v2 patch(ms)
13394.4013 14141.2615 13455.2543
Compared with before regression 5.58% 0.45%
max_wal_senders = 200
before regression(ms) after regression(ms) v2 patch(ms)
13280.8507 14597.1173 13632.0606
Compared with before regression 9.91% 1.64%
max_wal_senders = 300
before regression(ms) after regression(ms) v2 patch(ms)
13535.0232 16735.7379 13705.7135
Compared with before regression 23.65% 1.26%
Best Regards,
Hou zj
On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.
Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.
Hi,
On 5/15/23 4:19 AM, Zhijie Hou (Fujitsu) wrote:
On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
That's not the case with the attached v2 patch. Please have a look.
Thanks for V2! It does look good to me and I like the fact that
WalSndWakeup() does not need to loop on all the Walsenders slot
anymore (for both the physical and logical cases).
Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1].
Thanks for the performance regression measurement!
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.
+1
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.+1
Thanks for acknowledging the approach.
On Mon, May 15, 2023 at 2:11 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
Thanks for V2! It does look good to me and I like the fact that
WalSndWakeup() does not need to loop on all the Walsenders slot
anymore (for both the physical and logical cases).
Indeed, it doesn't have to.
On Mon, May 15, 2023 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1].One nitpick in the comment:
+ * walsenders. It makes WalSndWakeup() callers life easy.
callers life easy => callers' life easy.
Changed.
[1]
max_wal_senders = 100
before regression(ms) after regression(ms) v2 patch(ms)
13394.4013 14141.2615 13455.2543
Compared with before regression 5.58% 0.45%max_wal_senders = 200
before regression(ms) after regression(ms) v2 patch(ms)
13280.8507 14597.1173 13632.0606
Compared with before regression 9.91% 1.64%max_wal_senders = 300
before regression(ms) after regression(ms) v2 patch(ms)
13535.0232 16735.7379 13705.7135
Compared with before regression 23.65% 1.26%
Yes, the numbers with v2 patch look close to where we were before.
Thanks for confirming. Just wondering, where does this extra
0.45%/1.64%/1.26% coming from?
Please find the attached v3 with the review comment addressed.
Do we need to add an open item for this issue in
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can
anyone in this loop add one?
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchapplication/octet-stream; name=v3-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchDownload+55-25
Hi,
On 5/15/23 4:39 PM, Bharath Rupireddy wrote:
On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Do we need to add an open item for this issue in
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can
anyone in this loop add one?
I do think we need one for this issue and then just added it.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On 2023-05-10 08:39:08 +0200, Drouvot, Bertrand wrote:
On 5/9/23 11:00 PM, Andres Freund wrote:
Hi,
On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:
On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:
I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.Just to clarify, do you mean that if there is only one remaining active walsender that, say,
has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()?
I understood Jeff's proposal to just have an early exit if there are no
walsenders connected at all. But yes, even if we stopped iterating after
finding the number of slots we needed to, having to iterate over empty slots
would be an issue.
But TBH, even if we only did work for connected walsenders, I think this would
still be a performance issue. Acquiring O(#connected-walsenders) spinlocks for
every record is just too expensive.
Greetings,
Andres Freund
Hi,
On 2023-05-11 09:42:39 +0000, Zhijie Hou (Fujitsu) wrote:
I did some simple tests for this to see the performance impact on
the streaming replication, just share it here for reference.1) sync primary-standby setup, load data on primary and count the time spent on
replication. the degradation will be more obvious as the value of max_wal_senders
increases.
FWIW, using syncrep likely under-estimates the overhead substantially, because
that includes a lot overhead on the WAL generating side. I saw well over 20%
overhead for the default max_wal_senders=10.
I just created a standby, shut it down, then ran a deterministically-sized
workload on the primary, started the standby, and measured how long it took to
catch up. I just used the log messages to measure the time.
2) Similar as 1) but count the time that the standby startup process spent on
replaying WAL(via gprof).
I don't think that's the case here, but IME gprof's overhead is so high, that
it can move bottlenecks quite drastically. The problem is that it adds code to
every function enter/exit - for simple functions, that overhead is much higher
than the "original" cost of the function.
gprof style instrumentation is good for things like code coverage, but for
performance evaluation it's normally better to use a sampling profiler like
perf. That also causes slowdowns, but largely only in places that already take
up substantial execution time.
Greetings,
Andres Freund
On Wed, 2023-05-17 at 12:34 -0700, Andres Freund wrote:
I understood Jeff's proposal to just have an early exit if there are
no
walsenders connected at all.
My suggestion was we early exit unless there is at least one *waiting*
walsender of the appropriate type. In other words, increment the
counter on entry to WalSndWait() and decrement on exit. I didn't look
in detail yet, so perhaps it's not easy/cheap to do that safely.
Regards,
Jeff Davis