walsender performance regression due to logical decoding on standby changes

Started by Andres Freundalmost 3 years ago31 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

Unfortunately I have found the following commit to have caused a performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
Author: Andres Freund <andres@anarazel.de>
Date: 2023-04-08 00:24:24 -0700

For cascading replication, wake physical and logical walsenders separately

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Per idea from Jeff Davis and Amit Kapila.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: /messages/by-id/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

The only saving grace is that this is not an issue on the primary.

I unfortunately spent less time on this commit of the
logical-decoding-on-standby series than on the others. There were so many
other senior contributors discussing it, that I "relaxed" a bit too much.

I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likely could,
especially when just checking if the the walsender is connected).

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

Greetings,

Andres Freund

#2Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#1)
Re: walsender performance regression due to logical decoding on standby changes

On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:

I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -

Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?

Regards,
Jeff Davis

#3Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#2)
Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:

On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:

I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -

Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?

That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.

Greetings,

Andres Freund

#4Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#1)
Re: walsender performance regression due to logical decoding on standby changes

On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:

Unfortunately I have found the following commit to have caused a performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

I played it with a simple primary -> standby1 -> standby2 setup. I ran
a pgbench script [1]./pg_ctl -D data -l logfile stop ./pg_ctl -D sbdata1 -l logfile1 stop ./pg_ctl -D sbdata2 -l logfile2 stop on primary and counted the number of times
WalSndWakeup() gets called from ApplyWalRecord() and the number of
times spinlock is acquired/released in WalSndWakeup(). It's a whopping
21 million times spinlock is acquired/released on the standby 1 and
standby 2 for just a < 5min of pgbench run on the primary:

standby 1:
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

standby 2:
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

In this case, there is no timeline switch or no logical decoding on
the standby or such, but WalSndWakeup() gets called because the
standby can't make out if the slot is for logical or physical
replication unless spinlock is acquired. Before e101dfac3a,
WalSndWakeup() was getting called only when there was a timeline
switch.

The only saving grace is that this is not an issue on the primary.

Yeah.

I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likely could,
especially when just checking if the the walsender is connected).

Right.

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

[1]: ./pg_ctl -D data -l logfile stop ./pg_ctl -D sbdata1 -l logfile1 stop ./pg_ctl -D sbdata2 -l logfile2 stop
./pg_ctl -D data -l logfile stop
./pg_ctl -D sbdata1 -l logfile1 stop
./pg_ctl -D sbdata2 -l logfile2 stop

cd /home/ubuntu/postgres
rm -rf inst
make distclean
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &

primary -> standby1 -> standby2

free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m

cd inst/bin
./initdb -D data
mkdir archived_wal

cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/postgres/inst/bin/archived_wal/%f'
EOF

./pg_ctl -D data -l logfile start
./pg_basebackup -D sbdata1
./psql -c "select pg_create_physical_replication_slot('sb1_slot',
true, false)" postgres

cat << EOF >> sbdata1/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=sb1'
primary_slot_name='sb1_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF

touch sbdata1/standby.signal

./pg_ctl -D sbdata1 -l logfile1 start
./pg_basebackup -D sbdata2 -p 5433
./psql -p 5433 -c "select
pg_create_physical_replication_slot('sb2_slot', true, false)" postgres

cat << EOF >> sbdata2/postgresql.conf
port=5434
primary_conninfo='host=localhost port=5433 dbname=postgres user=ubuntu
application_name=sb2'
primary_slot_name='sb2_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF

touch sbdata2/standby.signal

./pg_ctl -D sbdata2 -l logfile2 start
./psql -c "select pid, backend_type from pg_stat_activity where
backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5433 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5434 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres

ulimit -S -n 5000
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Use-conditional-variable-to-wake-up-walsenders.patchapplication/octet-stream; name=v1-0001-Use-conditional-variable-to-wake-up-walsenders.patchDownload+11-2
#5Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#3)
Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 5/9/23 11:00 PM, Andres Freund wrote:

Hi,

On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:

On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:

I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -

Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?

That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.

Just to clarify, do you mean that if there is only one remaining active walsender that, say,
has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#6Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bharath Rupireddy (#4)
Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 5/10/23 8:36 AM, Bharath Rupireddy wrote:

On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:

Unfortunately I have found the following commit to have caused a performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

Thanks Andres for the call out! I do agree that this is a concern.

The only saving grace is that this is not an issue on the primary.

Yeah.

+1

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

Yeah, I think that "mis-using" a condition variable is a valid option. Unless I'm missing
something, I don't think there is anything wrong with using a CV that way (aka not using
ConditionVariableTimedSleep() or ConditionVariableSleep() in this particular case).

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

Thanks Bharath for looking at it!

I launched a full Cirrus CI test with it but it failed on one environment (did not look in details,
just sharing this here): https://cirrus-ci.com/task/6570140767092736

Also I have a few comments:

@@ -1958,7 +1959,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
          * ------
          */
         if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);

I think the comment above this change needs to be updated.

@@ -3368,9 +3370,13 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
WaitEvent event;

         ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+       ConditionVariablePrepareToSleep(&WalSndCtl->cv);
         if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
                 (event.events & WL_POSTMASTER_DEATH))
                 proc_exit(1);
+
+       ConditionVariableCancelSleep();

May be worth to update the comment above WalSndWait() too? (to explain why a CV handling is part of it).

@@ -108,6 +109,8 @@ typedef struct
*/
bool sync_standbys_defined;

+ ConditionVariable cv;

Worth to give it a more meaning full name? (to give a clue as when it is used)

I think we also need to update the comment above WalSndWakeup():

"
* For cascading replication we need to wake up physical walsenders separately
* from logical walsenders (see the comment before calling WalSndWakeup() in
* ApplyWalRecord() for more details).
"

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#7Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Bharath Rupireddy (#4)
RE: walsender performance regression due to logical decoding on standby changes

On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de>
wrote:

Unfortunately I have found the following commit to have caused a

performance

regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord()

happens for

every record, instead of only happening when the timeline is changed (or

WAL

is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

I played it with a simple primary -> standby1 -> standby2 setup. I ran
a pgbench script [1] on primary and counted the number of times
WalSndWakeup() gets called from ApplyWalRecord() and the number of
times spinlock is acquired/released in WalSndWakeup(). It's a whopping
21 million times spinlock is acquired/released on the standby 1 and
standby 2 for just a < 5min of pgbench run on the primary:

standby 1:
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

standby 2:
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

In this case, there is no timeline switch or no logical decoding on
the standby or such, but WalSndWakeup() gets called because the
standby can't make out if the slot is for logical or physical
replication unless spinlock is acquired. Before e101dfac3a,
WalSndWakeup() was getting called only when there was a timeline
switch.

The only saving grace is that this is not an issue on the primary.

Yeah.

I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likely

could,

especially when just checking if the the walsender is connected).

Right.

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:

	if (AllowCascadeReplication())
-		WalSndWakeup(switchedTLI, true);
+		ConditionVariableBroadcast(&WalSndCtl->cv);

After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.

Best Regards,
Hou zj

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#7)
Re: walsender performance regression due to logical decoding on standby changes

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:

if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);

After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.

This raises the question of whether we need this condition variable
logic for physical walsenders?

--
With Regards,
Amit Kapila.

#9Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#8)
Re: walsender performance regression due to logical decoding on standby changes

On Wed, May 10, 2023 at 7:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:

if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);

After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.

This raises the question of whether we need this condition variable
logic for physical walsenders?

It sounds like a good idea. We can have two condition variables for
logical and physical walsenders, and selectively wake up walsenders
sleeping on the condition variables. It should work, it seems like
much of a hack, though.

Regards,

[1]: /messages/by-id/2d314c22b9e03415aa1c7d8fd1f698dae60effa7.camel@j-davis.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Andres Freund (#1)
RE: walsender performance regression due to logical decoding on standby changes

On Wednesday, May 10, 2023 3:03 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Unfortunately I have found the following commit to have caused a
performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
Author: Andres Freund <andres@anarazel.de>
Date: 2023-04-08 00:24:24 -0700

For cascading replication, wake physical and logical walsenders separately

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Per idea from Jeff Davis and Amit Kapila.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion:
/messages/by-id/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkb
W7Hqh6vQ@mail.gmail.com

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens
for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

I did some simple tests for this to see the performance impact on
the streaming replication, just share it here for reference.

1) sync primary-standby setup, load data on primary and count the time spent on
replication. the degradation will be more obvious as the value of max_wal_senders
increases.

max_wal_senders before(ms) after(ms) degradation
100 13394.4013 14141.2615 5.58%
200 13280.8507 14597.1173 9.91%
300 13535.0232 16735.7379 23.65%

2) Similar as 1) but count the time that the standby startup process spent on
replaying WAL(via gprof).

10 senders
===========
before
% cumulative self self total
time seconds seconds calls s/call s/call name
4.12 0.45 0.11 1 0.11 2.46 PerformWalRecovery

after
% cumulative self self total
time seconds seconds calls s/call s/call name
17.99 0.59 0.59 4027383 0.00 0.00 WalSndWakeup
8.23 0.86 0.27 1 0.27 3.11 PerformWalRecovery

100 senders
===========
before
% cumulative self self total
time seconds seconds calls s/call s/call name
5.56 0.36 0.18 1 0.18 2.91 PerformWalRecovery

after
% cumulative self self total
time seconds seconds calls s/call s/call name
64.65 4.39 4.39 4027383 0.00 0.00 WalSndWakeup
2.95 4.59 0.20 1 0.20 6.62 PerformWalRecovery

Will test after applying the latest patch in this thread later.

Best Regards,
Hou zj

#11Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bertrand Drouvot (#6)
Re: walsender performance regression due to logical decoding on standby changes

On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

I launched a full Cirrus CI test with it but it failed on one environment (did not look in details,
just sharing this here): https://cirrus-ci.com/task/6570140767092736

Yeah, v1 had ConditionVariableInit() such that the CV was getting
initialized for every backend as opposed to just once after the WAL
sender shmem was created.

Also I have a few comments:

Indeed, v1 was a WIP patch. Please have a look at the attached v2
patch, which has comments and passing CI runs on all platforms -
https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_logic_v2.

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);

After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.

That's not the case with the attached v2 patch. Please have a look.

On Thu, May 11, 2023 at 10:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

We can have two condition variables for
logical and physical walsenders, and selectively wake up walsenders
sleeping on the condition variables. It should work, it seems like
much of a hack, though.

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchapplication/x-patch; name=v2-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchDownload+55-25
#12Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Bharath Rupireddy (#11)
RE: walsender performance regression due to logical decoding on standby changes

On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

My current guess is that mis-using a condition variable is the best
bet. I think it should work to use
ConditionVariablePrepareToSleep() before a WalSndWait(), and then
ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from

ConditionVariableBroadcast() would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription
tests don't complain with the patch.

I launched a full Cirrus CI test with it but it failed on one
environment (did not look in details, just sharing this here):
https://cirrus-ci.com/task/6570140767092736

Yeah, v1 had ConditionVariableInit() such that the CV was getting initialized for
every backend as opposed to just once after the WAL sender shmem was
created.

Also I have a few comments:

Indeed, v1 was a WIP patch. Please have a look at the attached v2 patch, which
has comments and passing CI runs on all platforms -
https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_
logic_v2.

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);

After the change, we wakeup physical walsender regardless of switchedTLI

flag.

Is this intentional ? if so, I think It would be better to update the comments

above this.

That's not the case with the attached v2 patch. Please have a look.

Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1]max_wal_senders = 100 before regression(ms) after regression(ms) v2 patch(ms) 13394.4013 14141.2615 13455.2543 Compared with before regression 5.58% 0.45%.

One nitpick in the comment:

+ * walsenders. It makes WalSndWakeup() callers life easy.

callers life easy => callers' life easy.

[1]: max_wal_senders = 100 before regression(ms) after regression(ms) v2 patch(ms) 13394.4013 14141.2615 13455.2543 Compared with before regression 5.58% 0.45%
max_wal_senders = 100
before regression(ms) after regression(ms) v2 patch(ms)
13394.4013 14141.2615 13455.2543
Compared with before regression 5.58% 0.45%

max_wal_senders = 200
before regression(ms) after regression(ms) v2 patch(ms)
13280.8507 14597.1173 13632.0606
Compared with before regression 9.91% 1.64%

max_wal_senders = 300
before regression(ms) after regression(ms) v2 patch(ms)
13535.0232 16735.7379 13705.7135
Compared with before regression 23.65% 1.26%

Best Regards,
Hou zj

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Bharath Rupireddy (#11)
Re: walsender performance regression due to logical decoding on standby changes

On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.

#14Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#12)
Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 5/15/23 4:19 AM, Zhijie Hou (Fujitsu) wrote:

On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand

That's not the case with the attached v2 patch. Please have a look.

Thanks for V2! It does look good to me and I like the fact that
WalSndWakeup() does not need to loop on all the Walsenders slot
anymore (for both the physical and logical cases).

Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1].

Thanks for the performance regression measurement!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Thomas Munro (#13)
Re: walsender performance regression due to logical decoding on standby changes

On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.

+1

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#15)
Re: walsender performance regression due to logical decoding on standby changes

On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.

+1

Thanks for acknowledging the approach.

On Mon, May 15, 2023 at 2:11 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Thanks for V2! It does look good to me and I like the fact that
WalSndWakeup() does not need to loop on all the Walsenders slot
anymore (for both the physical and logical cases).

Indeed, it doesn't have to.

On Mon, May 15, 2023 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1].

One nitpick in the comment:

+ * walsenders. It makes WalSndWakeup() callers life easy.

callers life easy => callers' life easy.

Changed.

[1]
max_wal_senders = 100
before regression(ms) after regression(ms) v2 patch(ms)
13394.4013 14141.2615 13455.2543
Compared with before regression 5.58% 0.45%

max_wal_senders = 200
before regression(ms) after regression(ms) v2 patch(ms)
13280.8507 14597.1173 13632.0606
Compared with before regression 9.91% 1.64%

max_wal_senders = 300
before regression(ms) after regression(ms) v2 patch(ms)
13535.0232 16735.7379 13705.7135
Compared with before regression 23.65% 1.26%

Yes, the numbers with v2 patch look close to where we were before.
Thanks for confirming. Just wondering, where does this extra
0.45%/1.64%/1.26% coming from?

Please find the attached v3 with the review comment addressed.

Do we need to add an open item for this issue in
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can
anyone in this loop add one?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchapplication/octet-stream; name=v3-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchDownload+55-25
#17Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bharath Rupireddy (#16)
Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 5/15/23 4:39 PM, Bharath Rupireddy wrote:

On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Do we need to add an open item for this issue in
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can
anyone in this loop add one?

I do think we need one for this issue and then just added it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#18Andres Freund
andres@anarazel.de
In reply to: Bertrand Drouvot (#5)
Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-10 08:39:08 +0200, Drouvot, Bertrand wrote:

On 5/9/23 11:00 PM, Andres Freund wrote:

Hi,

On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:

On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:

I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -

Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?

That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.

Just to clarify, do you mean that if there is only one remaining active walsender that, say,
has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()?

I understood Jeff's proposal to just have an early exit if there are no
walsenders connected at all. But yes, even if we stopped iterating after
finding the number of slots we needed to, having to iterate over empty slots
would be an issue.

But TBH, even if we only did work for connected walsenders, I think this would
still be a performance issue. Acquiring O(#connected-walsenders) spinlocks for
every record is just too expensive.

Greetings,

Andres Freund

#19Andres Freund
andres@anarazel.de
In reply to: Zhijie Hou (Fujitsu) (#10)
Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-11 09:42:39 +0000, Zhijie Hou (Fujitsu) wrote:

I did some simple tests for this to see the performance impact on
the streaming replication, just share it here for reference.

1) sync primary-standby setup, load data on primary and count the time spent on
replication. the degradation will be more obvious as the value of max_wal_senders
increases.

FWIW, using syncrep likely under-estimates the overhead substantially, because
that includes a lot overhead on the WAL generating side. I saw well over 20%
overhead for the default max_wal_senders=10.

I just created a standby, shut it down, then ran a deterministically-sized
workload on the primary, started the standby, and measured how long it took to
catch up. I just used the log messages to measure the time.

2) Similar as 1) but count the time that the standby startup process spent on
replaying WAL(via gprof).

I don't think that's the case here, but IME gprof's overhead is so high, that
it can move bottlenecks quite drastically. The problem is that it adds code to
every function enter/exit - for simple functions, that overhead is much higher
than the "original" cost of the function.

gprof style instrumentation is good for things like code coverage, but for
performance evaluation it's normally better to use a sampling profiler like
perf. That also causes slowdowns, but largely only in places that already take
up substantial execution time.

Greetings,

Andres Freund

#20Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#18)
Re: walsender performance regression due to logical decoding on standby changes

On Wed, 2023-05-17 at 12:34 -0700, Andres Freund wrote:

I understood Jeff's proposal to just have an early exit if there are
no
walsenders connected at all.

My suggestion was we early exit unless there is at least one *waiting*
walsender of the appropriate type. In other words, increment the
counter on entry to WalSndWait() and decrement on exit. I didn't look
in detail yet, so perhaps it's not easy/cheap to do that safely.

Regards,
Jeff Davis

#21Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#16)
#22Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#21)
#23Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#21)
#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#23)
#25Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Kyotaro Horiguchi (#24)
#26Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#24)
#27Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#23)
#28Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Andres Freund (#26)
#29Andres Freund
andres@anarazel.de
In reply to: Zhijie Hou (Fujitsu) (#28)
#30Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Andres Freund (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Zhijie Hou (Fujitsu) (#30)