walsender performance regression due to logical decoding on standby changes

andres@anarazel.de

over 2 years ago

In reply to: Jeff Davis (#2)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:

On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:

I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -

Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?

That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.

Greetings,

Andres Freund

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Andres Freund (#1)

1 attachment(s)

Re: walsender performance regression due to logical decoding on standby changes

On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:

Unfortunately I have found the following commit to have caused a performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

I played it with a simple primary -> standby1 -> standby2 setup. I ran
a pgbench script [1]./pg_ctl -D data -l logfile stop ./pg_ctl -D sbdata1 -l logfile1 stop ./pg_ctl -D sbdata2 -l logfile2 stop on primary and counted the number of times
WalSndWakeup() gets called from ApplyWalRecord() and the number of
times spinlock is acquired/released in WalSndWakeup(). It's a whopping
21 million times spinlock is acquired/released on the standby 1 and
standby 2 for just a < 5min of pgbench run on the primary:

standby 1:
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

standby 2:
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

In this case, there is no timeline switch or no logical decoding on
the standby or such, but WalSndWakeup() gets called because the
standby can't make out if the slot is for logical or physical
replication unless spinlock is acquired. Before e101dfac3a,
WalSndWakeup() was getting called only when there was a timeline
switch.

The only saving grace is that this is not an issue on the primary.

Yeah.

I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likely could,
especially when just checking if the the walsender is connected).

Right.

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

[1]: ./pg_ctl -D data -l logfile stop ./pg_ctl -D sbdata1 -l logfile1 stop ./pg_ctl -D sbdata2 -l logfile2 stop
./pg_ctl -D data -l logfile stop
./pg_ctl -D sbdata1 -l logfile1 stop
./pg_ctl -D sbdata2 -l logfile2 stop

cd /home/ubuntu/postgres
rm -rf inst
make distclean
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &

primary -> standby1 -> standby2

free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m

cd inst/bin
./initdb -D data
mkdir archived_wal

cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/postgres/inst/bin/archived_wal/%f'
EOF

./pg_ctl -D data -l logfile start
./pg_basebackup -D sbdata1
./psql -c "select pg_create_physical_replication_slot('sb1_slot',
true, false)" postgres

cat << EOF >> sbdata1/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=sb1'
primary_slot_name='sb1_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF

touch sbdata1/standby.signal

./pg_ctl -D sbdata1 -l logfile1 start
./pg_basebackup -D sbdata2 -p 5433
./psql -p 5433 -c "select
pg_create_physical_replication_slot('sb2_slot', true, false)" postgres

cat << EOF >> sbdata2/postgresql.conf
port=5434
primary_conninfo='host=localhost port=5433 dbname=postgres user=ubuntu
application_name=sb2'
primary_slot_name='sb2_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF

touch sbdata2/standby.signal

./pg_ctl -D sbdata2 -l logfile2 start
./psql -c "select pid, backend_type from pg_stat_activity where
backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5433 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5434 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres

ulimit -S -n 5000
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Use-conditional-variable-to-wake-up-walsenders.patchapplication/octet-stream; name=v1-0001-Use-conditional-variable-to-wake-up-walsenders.patchDownload

From 10e384581362cb1323a5638bbaaeb36600c73f27 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 10 May 2023 06:29:25 +0000
Subject: [PATCH v1] Use conditional variable to wake up walsenders

---
 src/backend/access/transam/xlogrecovery.c   | 3 ++-
 src/backend/replication/walsender.c         | 6 ++++++
 src/include/replication/walsender_private.h | 3 +++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..547862058d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -50,6 +50,7 @@
 #include "postmaster/startup.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
+#include "replication/walsender_private.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -1958,7 +1959,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	 * ------
 	 */
 	if (AllowCascadeReplication())
-		WalSndWakeup(switchedTLI, true);
+		ConditionVariableBroadcast(&WalSndCtl->cv);
 
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 45b8b3684f..177b0db844 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3310,6 +3310,8 @@ WalSndShmemInit(void)
 			SpinLockInit(&walsnd->mutex);
 		}
 	}
+
+	ConditionVariableInit(&WalSndCtl->cv);
 }
 
 /*
@@ -3368,9 +3370,13 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
 	WaitEvent	event;
 
 	ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+	ConditionVariablePrepareToSleep(&WalSndCtl->cv);
 	if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
 		(event.events & WL_POSTMASTER_DEATH))
 		proc_exit(1);
+
+	ConditionVariableCancelSleep();
 }
 
 /*
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index ff25aa70a8..c05f31e03c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -17,6 +17,7 @@
 #include "nodes/nodes.h"
 #include "nodes/replnodes.h"
 #include "replication/syncrep.h"
+#include "storage/condition_variable.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -108,6 +109,8 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	ConditionVariable cv;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
-- 
2.34.1

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Andres Freund (#3)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 5/9/23 11:00 PM, Andres Freund wrote:

Hi,

On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:

On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:

I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -

Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?

That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.

Just to clarify, do you mean that if there is only one remaining active walsender that, say,
has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#4)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 5/10/23 8:36 AM, Bharath Rupireddy wrote:

On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:

Unfortunately I have found the following commit to have caused a performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

Thanks Andres for the call out! I do agree that this is a concern.

The only saving grace is that this is not an issue on the primary.

Yeah.

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

Yeah, I think that "mis-using" a condition variable is a valid option. Unless I'm missing
something, I don't think there is anything wrong with using a CV that way (aka not using
ConditionVariableTimedSleep() or ConditionVariableSleep() in this particular case).

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

Thanks Bharath for looking at it!

I launched a full Cirrus CI test with it but it failed on one environment (did not look in details,
just sharing this here): https://cirrus-ci.com/task/6570140767092736

Also I have a few comments:

@@ -1958,7 +1959,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
          * ------
          */
         if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);

I think the comment above this change needs to be updated.

@@ -3368,9 +3370,13 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
WaitEvent event;

         ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+       ConditionVariablePrepareToSleep(&WalSndCtl->cv);
         if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
                 (event.events & WL_POSTMASTER_DEATH))
                 proc_exit(1);
+
+       ConditionVariableCancelSleep();

May be worth to update the comment above WalSndWait() too? (to explain why a CV handling is part of it).

@@ -108,6 +109,8 @@ typedef struct
*/
bool sync_standbys_defined;

+ ConditionVariable cv;

Worth to give it a more meaning full name? (to give a clue as when it is used)

I think we also need to update the comment above WalSndWakeup():

"
* For cascading replication we need to wake up physical walsenders separately
* from logical walsenders (see the comment before calling WalSndWakeup() in
* ApplyWalRecord() for more details).
"

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: /messages/by-id/2d314c22b9e03415aa1c7d8fd1f698dae60effa7.camel@j-davis.com

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Bharath Rupireddy (#4)

RE: walsender performance regression due to logical decoding on standby changes

On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, May 10, 2023 at 12:33 AM Andres Freund <andres@anarazel.de>
wrote:

Unfortunately I have found the following commit to have caused a

performance

regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord()

happens for

every record, instead of only happening when the timeline is changed (or

WAL

is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

I played it with a simple primary -> standby1 -> standby2 setup. I ran
a pgbench script [1] on primary and counted the number of times
WalSndWakeup() gets called from ApplyWalRecord() and the number of
times spinlock is acquired/released in WalSndWakeup(). It's a whopping
21 million times spinlock is acquired/released on the standby 1 and
standby 2 for just a < 5min of pgbench run on the primary:

standby 1:
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

standby 2:
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

In this case, there is no timeline switch or no logical decoding on
the standby or such, but WalSndWakeup() gets called because the
standby can't make out if the slot is for logical or physical
replication unless spinlock is acquired. Before e101dfac3a,
WalSndWakeup() was getting called only when there was a timeline
switch.

The only saving grace is that this is not an issue on the primary.

Yeah.

I don't think the approach of not having any sort of "registry" of whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -
WalSndWakeup() shows up even if I remove the spinlock (which we likely

could,

especially when just checking if the the walsender is connected).

Right.

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:

	if (AllowCascadeReplication())
-		WalSndWakeup(switchedTLI, true);
+		ConditionVariableBroadcast(&WalSndCtl->cv);

After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.

Best Regards,
Hou zj

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Zhijie Hou (Fujitsu) (#7)

Re: walsender performance regression due to logical decoding on standby changes

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:
if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);
After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.

This raises the question of whether we need this condition variable
logic for physical walsenders?

--
With Regards,
Amit Kapila.

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Amit Kapila (#8)

Re: walsender performance regression due to logical decoding on standby changes

On Wed, May 10, 2023 at 7:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

Thanks for the patch. I noticed one place where the logic is different from before and
just to confirm:
if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);
After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.
This raises the question of whether we need this condition variable
logic for physical walsenders?

It sounds like a good idea. We can have two condition variables for
logical and physical walsenders, and selectively wake up walsenders
sleeping on the condition variables. It should work, it seems like
much of a hack, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Andres Freund (#1)

RE: walsender performance regression due to logical decoding on standby changes

On Wednesday, May 10, 2023 3:03 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Unfortunately I have found the following commit to have caused a
performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
Author: Andres Freund <andres@anarazel.de>
Date: 2023-04-08 00:24:24 -0700

For cascading replication, wake physical and logical walsenders separately

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Per idea from Jeff Davis and Amit Kapila.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion:
/messages/by-id/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkb
W7Hqh6vQ@mail.gmail.com

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens
for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.

I did some simple tests for this to see the performance impact on
the streaming replication, just share it here for reference.

1) sync primary-standby setup, load data on primary and count the time spent on
replication. the degradation will be more obvious as the value of max_wal_senders
increases.

max_wal_senders before(ms) after(ms) degradation
100 13394.4013 14141.2615 5.58%
200 13280.8507 14597.1173 9.91%
300 13535.0232 16735.7379 23.65%

2) Similar as 1) but count the time that the standby startup process spent on
replaying WAL(via gprof).

10 senders
===========
before
% cumulative self self total
time seconds seconds calls s/call s/call name
4.12 0.45 0.11 1 0.11 2.46 PerformWalRecovery

after
% cumulative self self total
time seconds seconds calls s/call s/call name
17.99 0.59 0.59 4027383 0.00 0.00 WalSndWakeup
8.23 0.86 0.27 1 0.27 3.11 PerformWalRecovery

100 senders
===========
before
% cumulative self self total
time seconds seconds calls s/call s/call name
5.56 0.36 0.18 1 0.18 2.91 PerformWalRecovery

after
% cumulative self self total
time seconds seconds calls s/call s/call name
64.65 4.39 4.39 4027383 0.00 0.00 WalSndWakeup
2.95 4.59 0.20 1 0.20 6.62 PerformWalRecovery

Will test after applying the latest patch in this thread later.

Best Regards,
Hou zj

#11

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Drouvot, Bertrand (#6)

1 attachment(s)

Re: walsender performance regression due to logical decoding on standby changes

On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

I launched a full Cirrus CI test with it but it failed on one environment (did not look in details,
just sharing this here): https://cirrus-ci.com/task/6570140767092736

Yeah, v1 had ConditionVariableInit() such that the CV was getting
initialized for every backend as opposed to just once after the WAL
sender shmem was created.

Also I have a few comments:

Indeed, v1 was a WIP patch. Please have a look at the attached v2
patch, which has comments and passing CI runs on all platforms -
https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_logic_v2.

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);
After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments above this.

That's not the case with the attached v2 patch. Please have a look.

On Thu, May 11, 2023 at 10:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

We can have two condition variables for
logical and physical walsenders, and selectively wake up walsenders
sleeping on the condition variables. It should work, it seems like
much of a hack, though.

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchapplication/x-patch; name=v2-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchDownload

From 02c4770964737d1e7a3c7d579080bc6c5bc01fdc Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 12 May 2023 11:30:14 +0000
Subject: [PATCH v2] Optimize walsender wake up logic with Conditional
 Variables

WalSndWakeup() currently loops through all the walsenders slots,
with a spinlock acquisition and release for every iteration, just
to wake up only the waiting walsenders. WalSndWakeup() gets
costlier even when there's a single walsender available on the
standby (i.e., a cascading replica or a logical decoding client).

This wasn't a problem before e101dfac3a53c. But, for allowing
logical decoding on standby, we needed to wake up logical
walsenders after every WAL record is applied on the standby. This
really made WalSndWakeup() costly, causing performance regression.

To solve this, we use condition variable (CV) to efficiently wake
up walsenders in WalSndWakeup().

Every walsender prepares to sleep on a shared memory CV. Note that
it just prepares to sleep on the CV (i.e., adds itself to the CV's
waitlist), but not actually waits on the CV (IOW, it never calls
ConditionVariableSleep()). It still uses WaitEventSetWait() for
waiting, because CV infrastructure doesn't handle FeBe socket
events currently. The processes (startup process, walreceiver
etc.) wanting to wake up walsenders use
ConditionVariableBroadcast(), which in turn calls SetLatch(),
helping walsenders come out of WaitEventSetWait().

This approach is simple and efficient because it makes
WalSndWakeup() life easy.

When available, WaitEventSetWait() can be replaced with its CV's
counterpart.

And, we use separate shared memory CVs for physical and logical
walsenders for selective wake ups, see WalSndWakeup() for more
details.

Reported-by: Andres Freund
Suggested-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Drouvot, Bertrand
Reviewed-by: Zhijie Hou
Discussion: https://www.postgresql.org/message-id/20230509190247.3rrplhdgem6su6cg%40awork3.anarazel.de
---
 src/backend/replication/walsender.c         | 72 ++++++++++++++-------
 src/include/replication/walsender_private.h |  7 ++
 2 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 45b8b3684f..dc4d376e7c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3309,6 +3309,9 @@ WalSndShmemInit(void)
 
 			SpinLockInit(&walsnd->mutex);
 		}
+
+		ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+		ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
 	}
 }
 
@@ -3330,31 +3333,17 @@ WalSndShmemInit(void)
 void
 WalSndWakeup(bool physical, bool logical)
 {
-	int			i;
-
-	for (i = 0; i < max_wal_senders; i++)
-	{
-		Latch	   *latch;
-		ReplicationKind kind;
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
-
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes). While at it, also
-		 * get kind.
-		 */
-		SpinLockAcquire(&walsnd->mutex);
-		latch = walsnd->latch;
-		kind = walsnd->kind;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (latch == NULL)
-			continue;
+	/*
+	 * Let's wake up all the waiting physical and/or logical walsenders using
+	 * their respective condition variables (CVs). Note that every walsender
+	 * would have prepared to sleep on the CV (i.e., added itself to the CV's
+	 * waitlist) in WalSndWait before actually waiting.
+	 */
+	if (physical)
+		ConditionVariableBroadcast(&WalSndCtl->physicalWALSndCV);
 
-		if ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
-			(logical && kind == REPLICATION_KIND_LOGICAL))
-			SetLatch(latch);
-	}
+	if (logical)
+		ConditionVariableBroadcast(&WalSndCtl->logicalWALSndCV);
 }
 
 /*
@@ -3368,9 +3357,44 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
 	WaitEvent	event;
 
 	ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+	/*
+	 * We use condition variable (CV) to efficiently wake up walsenders in
+	 * WalSndWakeup().
+	 *
+	 * Every walsender prepares to sleep on a shared memory CV. Note that it
+	 * just prepares to sleep on the CV (i.e., adds itself to the CV's
+	 * waitlist), but not actually waits on the CV (IOW, it never calls
+	 * ConditionVariableSleep()). It still uses WaitEventSetWait() for waiting,
+	 * because CV infrastructure doesn't handle FeBe socket events currently.
+	 * The processes (startup process, walreceiver etc.) wanting to wake up
+	 * walsenders use ConditionVariableBroadcast(), which in turn calls
+	 * SetLatch(), helping walsenders come out of WaitEventSetWait().
+	 *
+	 * This approach is simple and efficient because, one doesn't have to loop
+	 * through all the walsenders slots, with a spinlock acquisition and
+	 * release for every iteration, just to wake up only the waiting
+	 * walsenders. It makes WalSndWakeup() callers life easy.
+	 *
+	 * XXX: When available, WaitEventSetWait() can be replaced with its CV's
+	 * counterpart.
+	 *
+	 * And, we use separate shared memory CVs for physical and logical
+	 * walsenders for selective wake ups, see WalSndWakeup() for more details.
+	 */
+	if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL)
+		ConditionVariablePrepareToSleep(&WalSndCtl->physicalWALSndCV);
+	else if (MyWalSnd->kind == REPLICATION_KIND_LOGICAL)
+		ConditionVariablePrepareToSleep(&WalSndCtl->logicalWALSndCV);
+
 	if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
 		(event.events & WL_POSTMASTER_DEATH))
+	{
+		ConditionVariableCancelSleep();
 		proc_exit(1);
+	}
+
+	ConditionVariableCancelSleep();
 }
 
 /*
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index ff25aa70a8..4b33a96d25 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -17,6 +17,7 @@
 #include "nodes/nodes.h"
 #include "nodes/replnodes.h"
 #include "replication/syncrep.h"
+#include "storage/condition_variable.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -108,6 +109,12 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/* Used to wake up physical walsenders */
+	ConditionVariable	physicalWALSndCV;
+
+	/* Used to wake up logical walsenders */
+	ConditionVariable	logicalWALSndCV;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
-- 
2.34.1

#12

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Bharath Rupireddy (#11)

RE: walsender performance regression due to logical decoding on standby changes

On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

My current guess is that mis-using a condition variable is the best
bet. I think it should work to use
ConditionVariablePrepareToSleep() before a WalSndWait(), and then
ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from

ConditionVariableBroadcast() would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription
tests don't complain with the patch.

I launched a full Cirrus CI test with it but it failed on one
environment (did not look in details, just sharing this here):
https://cirrus-ci.com/task/6570140767092736

Yeah, v1 had ConditionVariableInit() such that the CV was getting initialized for
every backend as opposed to just once after the WAL sender shmem was
created.

Also I have a few comments:

Indeed, v1 was a WIP patch. Please have a look at the attached v2 patch, which
has comments and passing CI runs on all platforms -
https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_
logic_v2.

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:
if (AllowCascadeReplication())
-               WalSndWakeup(switchedTLI, true);
+               ConditionVariableBroadcast(&WalSndCtl->cv);
After the change, we wakeup physical walsender regardless of switchedTLI
flag.

Is this intentional ? if so, I think It would be better to update the comments

above this.

That's not the case with the attached v2 patch. Please have a look.

Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1]max_wal_senders = 100 before regression(ms) after regression(ms) v2 patch(ms) 13394.4013 14141.2615 13455.2543 Compared with before regression 5.58% 0.45%.

One nitpick in the comment:

+ * walsenders. It makes WalSndWakeup() callers life easy.

callers life easy => callers' life easy.

[1]: max_wal_senders = 100 before regression(ms) after regression(ms) v2 patch(ms) 13394.4013 14141.2615 13455.2543 Compared with before regression 5.58% 0.45%
max_wal_senders = 100
before regression(ms) after regression(ms) v2 patch(ms)
13394.4013 14141.2615 13455.2543
Compared with before regression 5.58% 0.45%

max_wal_senders = 200
before regression(ms) after regression(ms) v2 patch(ms)
13280.8507 14597.1173 13632.0606
Compared with before regression 9.91% 1.64%

max_wal_senders = 300
before regression(ms) after regression(ms) v2 patch(ms)
13535.0232 16735.7379 13705.7135
Compared with before regression 23.65% 1.26%

Best Regards,
Hou zj

#13

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#11)

Re: walsender performance regression due to logical decoding on standby changes

On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.

#14

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Zhijie Hou (Fujitsu) (#12)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 5/15/23 4:19 AM, Zhijie Hou (Fujitsu) wrote:

On Friday, May 12, 2023 7:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand

That's not the case with the attached v2 patch. Please have a look.

Thanks for V2! It does look good to me and I like the fact that
WalSndWakeup() does not need to loop on all the Walsenders slot
anymore (for both the physical and logical cases).

Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1].

Thanks for the performance regression measurement!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Thomas Munro (#13)

Re: walsender performance regression due to logical decoding on standby changes

On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#15)

1 attachment(s)

Re: walsender performance regression due to logical decoding on standby changes

On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, May 12, 2023 at 11:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

Yeah, this seems OK, and better than the complicated alternatives. If
one day we want to implement CVs some other way so that this
I-know-that-CVs-are-really-made-out-of-latches abstraction leak
becomes a problem, and we still need this, well then we can make a
separate latch-wait-list thing.

+1

Thanks for acknowledging the approach.

On Mon, May 15, 2023 at 2:11 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:

Thanks for V2! It does look good to me and I like the fact that
WalSndWakeup() does not need to loop on all the Walsenders slot
anymore (for both the physical and logical cases).

Indeed, it doesn't have to.

On Mon, May 15, 2023 at 7:50 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Thanks for updating the patch. I did some simple primary->standby replication test for the
patch and can see the degradation doesn't happen in the replication after applying it[1].

One nitpick in the comment:

+ * walsenders. It makes WalSndWakeup() callers life easy.

callers life easy => callers' life easy.

Changed.

[1]
max_wal_senders = 100
before regression(ms) after regression(ms) v2 patch(ms)
13394.4013 14141.2615 13455.2543
Compared with before regression 5.58% 0.45%

max_wal_senders = 200
before regression(ms) after regression(ms) v2 patch(ms)
13280.8507 14597.1173 13632.0606
Compared with before regression 9.91% 1.64%

max_wal_senders = 300
before regression(ms) after regression(ms) v2 patch(ms)
13535.0232 16735.7379 13705.7135
Compared with before regression 23.65% 1.26%

Yes, the numbers with v2 patch look close to where we were before.
Thanks for confirming. Just wondering, where does this extra
0.45%/1.64%/1.26% coming from?

Please find the attached v3 with the review comment addressed.

Do we need to add an open item for this issue in
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can
anyone in this loop add one?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchapplication/octet-stream; name=v3-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchDownload

From 5a18b515efc8fe8226a4a9f4f56d46e84510dd39 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 15 May 2023 14:30:55 +0000
Subject: [PATCH v3] Optimize walsender wake up logic with Conditional
 Variables

WalSndWakeup() currently loops through all the walsenders slots,
with a spinlock acquisition and release for every iteration, just
to wake up only the waiting walsenders. WalSndWakeup() gets
costlier even when there's a single walsender available on the
standby (i.e., a cascading replica or a logical decoding client).

This wasn't a problem before e101dfac3a53c. But, for allowing
logical decoding on standby, we needed to wake up logical
walsenders after every WAL record is applied on the standby. This
really made WalSndWakeup() costly, causing performance regression.

To solve this, we use condition variable (CV) to efficiently wake
up walsenders in WalSndWakeup().

Every walsender prepares to sleep on a shared memory CV. Note that
it just prepares to sleep on the CV (i.e., adds itself to the CV's
waitlist), but not actually waits on the CV (IOW, it never calls
ConditionVariableSleep()). It still uses WaitEventSetWait() for
waiting, because CV infrastructure doesn't handle FeBe socket
events currently. The processes (startup process, walreceiver
etc.) wanting to wake up walsenders use
ConditionVariableBroadcast(), which in turn calls SetLatch(),
helping walsenders come out of WaitEventSetWait().

This approach is simple and efficient because it makes
WalSndWakeup() life easy.

When available, WaitEventSetWait() can be replaced with its CV's
counterpart.

And, we use separate shared memory CVs for physical and logical
walsenders for selective wake ups, see WalSndWakeup() for more
details.

Reported-by: Andres Freund
Suggested-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Drouvot, Bertrand
Reviewed-by: Zhijie Hou
Discussion: https://www.postgresql.org/message-id/20230509190247.3rrplhdgem6su6cg%40awork3.anarazel.de
---
 src/backend/replication/walsender.c         | 72 ++++++++++++++-------
 src/include/replication/walsender_private.h |  7 ++
 2 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 45b8b3684f..cf688cee0b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3309,6 +3309,9 @@ WalSndShmemInit(void)
 
 			SpinLockInit(&walsnd->mutex);
 		}
+
+		ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+		ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
 	}
 }
 
@@ -3330,31 +3333,17 @@ WalSndShmemInit(void)
 void
 WalSndWakeup(bool physical, bool logical)
 {
-	int			i;
-
-	for (i = 0; i < max_wal_senders; i++)
-	{
-		Latch	   *latch;
-		ReplicationKind kind;
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
-
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes). While at it, also
-		 * get kind.
-		 */
-		SpinLockAcquire(&walsnd->mutex);
-		latch = walsnd->latch;
-		kind = walsnd->kind;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (latch == NULL)
-			continue;
+	/*
+	 * Let's wake up all the waiting physical and/or logical walsenders using
+	 * their respective condition variables (CVs). Note that every walsender
+	 * would have prepared to sleep on the CV (i.e., added itself to the CV's
+	 * waitlist) in WalSndWait before actually waiting.
+	 */
+	if (physical)
+		ConditionVariableBroadcast(&WalSndCtl->physicalWALSndCV);
 
-		if ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
-			(logical && kind == REPLICATION_KIND_LOGICAL))
-			SetLatch(latch);
-	}
+	if (logical)
+		ConditionVariableBroadcast(&WalSndCtl->logicalWALSndCV);
 }
 
 /*
@@ -3368,9 +3357,44 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
 	WaitEvent	event;
 
 	ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+	/*
+	 * We use condition variable (CV) to efficiently wake up walsenders in
+	 * WalSndWakeup().
+	 *
+	 * Every walsender prepares to sleep on a shared memory CV. Note that it
+	 * just prepares to sleep on the CV (i.e., adds itself to the CV's
+	 * waitlist), but not actually waits on the CV (IOW, it never calls
+	 * ConditionVariableSleep()). It still uses WaitEventSetWait() for waiting,
+	 * because CV infrastructure doesn't handle FeBe socket events currently.
+	 * The processes (startup process, walreceiver etc.) wanting to wake up
+	 * walsenders use ConditionVariableBroadcast(), which in turn calls
+	 * SetLatch(), helping walsenders come out of WaitEventSetWait().
+	 *
+	 * This approach is simple and efficient because, one doesn't have to loop
+	 * through all the walsenders slots, with a spinlock acquisition and
+	 * release for every iteration, just to wake up only the waiting
+	 * walsenders. It makes WalSndWakeup() callers' life easy.
+	 *
+	 * XXX: When available, WaitEventSetWait() can be replaced with its CV's
+	 * counterpart.
+	 *
+	 * And, we use separate shared memory CVs for physical and logical
+	 * walsenders for selective wake ups, see WalSndWakeup() for more details.
+	 */
+	if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL)
+		ConditionVariablePrepareToSleep(&WalSndCtl->physicalWALSndCV);
+	else if (MyWalSnd->kind == REPLICATION_KIND_LOGICAL)
+		ConditionVariablePrepareToSleep(&WalSndCtl->logicalWALSndCV);
+
 	if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
 		(event.events & WL_POSTMASTER_DEATH))
+	{
+		ConditionVariableCancelSleep();
 		proc_exit(1);
+	}
+
+	ConditionVariableCancelSleep();
 }
 
 /*
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index ff25aa70a8..4b33a96d25 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -17,6 +17,7 @@
 #include "nodes/nodes.h"
 #include "nodes/replnodes.h"
 #include "replication/syncrep.h"
+#include "storage/condition_variable.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -108,6 +109,12 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/* Used to wake up physical walsenders */
+	ConditionVariable	physicalWALSndCV;
+
+	/* Used to wake up logical walsenders */
+	ConditionVariable	logicalWALSndCV;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
-- 
2.34.1

#17

bertranddrouvot.pg@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#16)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 5/15/23 4:39 PM, Bharath Rupireddy wrote:

On Mon, May 15, 2023 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, May 15, 2023 at 1:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Do we need to add an open item for this issue in
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items? If yes, can
anyone in this loop add one?

I do think we need one for this issue and then just added it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#18

andres@anarazel.de

over 2 years ago

In reply to: Drouvot, Bertrand (#5)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-10 08:39:08 +0200, Drouvot, Bertrand wrote:

On 5/9/23 11:00 PM, Andres Freund wrote:

Hi,

On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:

On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:

I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -

Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?

That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.

Just to clarify, do you mean that if there is only one remaining active walsender that, say,
has been located at slot n, then we’d still have to loop from 0 to n in WalSndWakeup()?

I understood Jeff's proposal to just have an early exit if there are no
walsenders connected at all. But yes, even if we stopped iterating after
finding the number of slots we needed to, having to iterate over empty slots
would be an issue.

But TBH, even if we only did work for connected walsenders, I think this would
still be a performance issue. Acquiring O(#connected-walsenders) spinlocks for
every record is just too expensive.

Greetings,

Andres Freund

#19

andres@anarazel.de

over 2 years ago

In reply to: Zhijie Hou (Fujitsu) (#10)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-11 09:42:39 +0000, Zhijie Hou (Fujitsu) wrote:

I did some simple tests for this to see the performance impact on
the streaming replication, just share it here for reference.

1) sync primary-standby setup, load data on primary and count the time spent on
replication. the degradation will be more obvious as the value of max_wal_senders
increases.

FWIW, using syncrep likely under-estimates the overhead substantially, because
that includes a lot overhead on the WAL generating side. I saw well over 20%
overhead for the default max_wal_senders=10.

I just created a standby, shut it down, then ran a deterministically-sized
workload on the primary, started the standby, and measured how long it took to
catch up. I just used the log messages to measure the time.

2) Similar as 1) but count the time that the standby startup process spent on
replaying WAL(via gprof).

I don't think that's the case here, but IME gprof's overhead is so high, that
it can move bottlenecks quite drastically. The problem is that it adds code to
every function enter/exit - for simple functions, that overhead is much higher
than the "original" cost of the function.

gprof style instrumentation is good for things like code coverage, but for
performance evaluation it's normally better to use a sampling profiler like
perf. That also causes slowdowns, but largely only in places that already take
up substantial execution time.

Greetings,

Andres Freund

#20

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Andres Freund (#18)

Re: walsender performance regression due to logical decoding on standby changes

On Wed, 2023-05-17 at 12:34 -0700, Andres Freund wrote:

I understood Jeff's proposal to just have an early exit if there are
no
walsenders connected at all.

My suggestion was we early exit unless there is at least one *waiting*
walsender of the appropriate type. In other words, increment the
counter on entry to WalSndWait() and decrement on exit. I didn't look
in detail yet, so perhaps it's not easy/cheap to do that safely.

Regards,
Jeff Davis

#21

andres@anarazel.de

over 2 years ago

In reply to: Bharath Rupireddy (#16)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

Thanks for working on the patch!

On 2023-05-15 20:09:00 +0530, Bharath Rupireddy wrote:

[1]
max_wal_senders = 100
before regression(ms) after regression(ms) v2 patch(ms)
13394.4013 14141.2615 13455.2543
Compared with before regression 5.58% 0.45%

max_wal_senders = 200
before regression(ms) after regression(ms) v2 patch(ms)
13280.8507 14597.1173 13632.0606
Compared with before regression 9.91% 1.64%

max_wal_senders = 300
before regression(ms) after regression(ms) v2 patch(ms)
13535.0232 16735.7379 13705.7135
Compared with before regression 23.65% 1.26%

Yes, the numbers with v2 patch look close to where we were before.
Thanks for confirming. Just wondering, where does this extra
0.45%/1.64%/1.26% coming from?

We still do more work for each WAL record than before, so I'd expect something
small. I'd say right now the main overhead with the patch comes from the
spinlock acquisitions in ConditionVariableBroadcast(), which happen even when
nobody is waiting.

I'll try to come up with a benchmark without the issues I pointed out in
/messages/by-id/20230517194331.ficfy5brpfq5lrmz@awork3.anarazel.de

+		ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+		ConditionVariableInit(&WalSndCtl->logicalWALSndCV);

It's not obvious to me that it's worth having two CVs, because it's more
expensive to find no waiters in two CVs than to find no waiters in one CV.

+	/*
+	 * We use condition variable (CV) to efficiently wake up walsenders in
+	 * WalSndWakeup().
+	 *
+	 * Every walsender prepares to sleep on a shared memory CV. Note that it
+	 * just prepares to sleep on the CV (i.e., adds itself to the CV's
+	 * waitlist), but not actually waits on the CV (IOW, it never calls
+	 * ConditionVariableSleep()). It still uses WaitEventSetWait() for waiting,
+	 * because CV infrastructure doesn't handle FeBe socket events currently.
+	 * The processes (startup process, walreceiver etc.) wanting to wake up
+	 * walsenders use ConditionVariableBroadcast(), which in turn calls
+	 * SetLatch(), helping walsenders come out of WaitEventSetWait().
+	 *
+	 * This approach is simple and efficient because, one doesn't have to loop
+	 * through all the walsenders slots, with a spinlock acquisition and
+	 * release for every iteration, just to wake up only the waiting
+	 * walsenders. It makes WalSndWakeup() callers' life easy.
+	 *
+	 * XXX: When available, WaitEventSetWait() can be replaced with its CV's
+	 * counterpart.

I don't really understand that XXX - the potential bright future would be to
add support for CVs into wait event sets, not to replace WES with a CV?

Greetings,

Andres Freund

#22

andres@anarazel.de

over 2 years ago

In reply to: Andres Freund (#21)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-17 12:53:15 -0700, Andres Freund wrote:

I'll try to come up with a benchmark without the issues I pointed out in
/messages/by-id/20230517194331.ficfy5brpfq5lrmz@awork3.anarazel.de

Here we go:

setup:

create primary
SELECT pg_create_physical_replication_slot('reserve', true);
create standby using pg_basebackup

create WAL:
psql -c CREATE TABLE testtable_logged(other_data int default 1);' && \
c=16; PGOPTIONS='-c synchronous_commit=off' /path-to-pgbench --random-seed=0 -n -c$c -j$c -t1000 -P1 -f <( echo "INSERT INTO testtable_logged SELECT generate_series(1, 1000)" ) && \
psql -c "SELECT pg_create_restore_point('end');"

benchmark:
rm -rf /tmp/test && \
cp -ar /srv/dev/pgdev-dev-standby /tmp/test && \
cp -ar /srv/dev/pgdev-dev/pg_wal/* /tmp/test/pg_wal/ && \
sync && \
/usr/bin/time -f '%es' /path-to-postgres -D /tmp/test -c recovery_target_action=shutdown -c recovery_target_name=end -c shared_buffers=1GB -c fsync=off

That way I can measure how long it takes to replay exactly the same WAL, and
also take profiles of exactly the same work, without influencing time results.

I copy the WAL files to the primary to ensure that walreceiver (standby) /
walsender (primary) performance doesn't make the result variability higher.

max_walsenders=10 max_walsenders=100
e101dfac3a5 reverted 7.01s 7.02s
093e5c57d50 / HEAD 8.25s 19.91s
bharat-v3 7.14s 7.13s

So indeed, bharat-v3 largely fixes the issue.

The regression of v3 compared to e101dfac3a5 reverted seems pretty constant at
~0.982x, independent of the concrete max_walsenders value. Which makes sense,
the work is constant.

To make it more extreme, I also tested a workload that is basically free to replay:

c=16; /srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -c$c -j$c -t5000 -P1 -f <( echo "SELECT pg_logical_emit_message(false, 'c', 'a') FROM generate_series(1, 1000)" ) && psql -c "SELECT pg_create_restore_point('end');"

max_walsenders=10 max_walsenders=100
e101dfac3a5 reverted 1.70s 1.70s
093e5c57d50 / HEAD 3.00s 14.56s
bharat-v3 1.88s 1.88s

In this extreme workload we still regress by ~0.904x.

I'm not sure how much it's worth worrying about that - this is a quite
unrealistic testcase.

FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but
reproducible, overhead: 1.72s - that's just the cost of the additional
external function call.

If I add a no-waiters fastpath using proclist_is_empty() to
ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining
slowdown indeed comes from the spinlock acquisition in
ConditionVariableBroadcast().

Greetings,

Andres Freund

#23

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Andres Freund (#21)

1 attachment(s)

Re: walsender performance regression due to logical decoding on standby changes

On Thu, May 18, 2023 at 1:23 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-05-15 20:09:00 +0530, Bharath Rupireddy wrote:

[1]
max_wal_senders = 100
before regression(ms) after regression(ms) v2 patch(ms)
13394.4013 14141.2615 13455.2543
Compared with before regression 5.58% 0.45%

max_wal_senders = 200
before regression(ms) after regression(ms) v2 patch(ms)
13280.8507 14597.1173 13632.0606
Compared with before regression 9.91% 1.64%

max_wal_senders = 300
before regression(ms) after regression(ms) v2 patch(ms)
13535.0232 16735.7379 13705.7135
Compared with before regression 23.65% 1.26%

Yes, the numbers with v2 patch look close to where we were before.
Thanks for confirming. Just wondering, where does this extra
0.45%/1.64%/1.26% coming from?

We still do more work for each WAL record than before, so I'd expect something
small. I'd say right now the main overhead with the patch comes from the
spinlock acquisitions in ConditionVariableBroadcast(), which happen even when
nobody is waiting.

+             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
It's not obvious to me that it's worth having two CVs, because it's more
expensive to find no waiters in two CVs than to find no waiters in one CV.

I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
wakes up logical walsenders for every WAL record, but it wakes up
physical walsenders only if the applied WAL record causes a TLI
switch. Therefore, the extra cost of spinlock acquire-release for per
WAL record applies only for logical walsenders. On the other hand, if
we were to use a single CV, we would be unnecessarily waking up (if at
all they are sleeping) physical walsenders for every WAL record -
which is costly IMO.

I still think separate CVs are good for selective wake ups given there
can be not so many TLI switch WAL records.

+      *
+      * XXX: When available, WaitEventSetWait() can be replaced with its CV's
+      * counterpart.
I don't really understand that XXX - the potential bright future would be to
add support for CVs into wait event sets, not to replace WES with a CV?

Yes, I meant it and modified that part to:

* XXX: A desirable future improvement would be to add support for CVs into
* WaitEventSetWait().

FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but
reproducible, overhead: 1.72s - that's just the cost of the additional
external function call.

Hm, this is unavoidable.

If I add a no-waiters fastpath using proclist_is_empty() to
ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining
slowdown indeed comes from the spinlock acquisition in
ConditionVariableBroadcast().

Acquiring spinlock for every replayed WAL record seems not great. In
general, this problem exists for CV infrastructure as a whole if
ConditionVariableBroadcast()/CVB() is called in a loop/hot path. I
think this can be proven easily by doing something like - select
pg_replication_origin_session_setup()/pg_wal_replay_resume()/pg_create_physical_replication_slot()
from generate_series(1, 1000000...);, all of these functions end up in
CVB().

Do you think adding a fastpath exit when no waiters to CVB() is worth
implementing? Something like - an atomic state variable to each CV,
similar to LWLock's state variable, setting it when adding waiters and
resetting it when removing waiters, CVB() atomically reading the state
for fastpath (of course, we need memory barriers here). This might
complicate things as each CV structure gets an extra state variable (4
bytes), memory barriers, and atomic writes and reads.

Or given the current uses of CVB() with no callers except recovery
calling it in a hot path with patch, maybe we can add an atomic waiter
count to WalSndCtl, incrementing it atomically in WalSndWait() before
the wait, decrementing it after the wait, and a fastpath in
WalSndWakeup() reading it atomically to avoid CVB() calls. IIUC, this
is something Jeff proposed upthread.

Thoughts?

Please find the attached v4 patch addressing the review comment (not
the fastpath one).

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchapplication/octet-stream; name=v4-0001-Optimize-walsender-wake-up-logic-with-Conditional.patchDownload

From 00234c83cedcb684f4b6e12d22608f929f4d3c0b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 18 May 2023 14:15:05 +0000
Subject: [PATCH v4] Optimize walsender wake up logic with Conditional
 Variables

WalSndWakeup() currently loops through all the walsenders slots,
with a spinlock acquisition and release for every iteration, just
to wake up only the waiting walsenders. WalSndWakeup() gets
costlier even when there's a single walsender available on the
standby (i.e., a cascading replica or a logical decoding client).

This wasn't a problem before e101dfac3a53c. But, for allowing
logical decoding on standby, we needed to wake up logical
walsenders after every WAL record is applied on the standby. This
really made WalSndWakeup() costly, causing performance regression.

To solve this, we use condition variable (CV) to efficiently wake
up walsenders in WalSndWakeup().

Every walsender prepares to sleep on a shared memory CV. Note that
it just prepares to sleep on the CV (i.e., adds itself to the CV's
waitlist), but does not actually wait on the CV (IOW, it never
calls ConditionVariableSleep()). It still uses WaitEventSetWait()
for waiting, because CV infrastructure doesn't handle FeBe socket
events currently. The processes (startup process, walreceiver etc.)
wanting to wake up walsenders use ConditionVariableBroadcast(),
which in turn calls SetLatch(), helping walsenders come out of
WaitEventSetWait().

This approach is simple and efficient because it makes
WalSndWakeup() life easy. A desirable future improvement would be to
add support for CVs into WaitEventSetWait().

And, we use separate shared memory CVs for physical and logical
walsenders for selective wake ups, see WalSndWakeup() for more
details.

Reported-by: Andres Freund
Suggested-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Drouvot, Bertrand
Reviewed-by: Zhijie Hou
Discussion: https://www.postgresql.org/message-id/20230509190247.3rrplhdgem6su6cg%40awork3.anarazel.de
---
 src/backend/replication/walsender.c         | 72 ++++++++++++++-------
 src/include/replication/walsender_private.h |  7 ++
 2 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 45b8b3684f..57ef3dbb67 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3309,6 +3309,9 @@ WalSndShmemInit(void)
 
 			SpinLockInit(&walsnd->mutex);
 		}
+
+		ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+		ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
 	}
 }
 
@@ -3330,31 +3333,17 @@ WalSndShmemInit(void)
 void
 WalSndWakeup(bool physical, bool logical)
 {
-	int			i;
-
-	for (i = 0; i < max_wal_senders; i++)
-	{
-		Latch	   *latch;
-		ReplicationKind kind;
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
-
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely case that
-		 * pointer reads aren't atomic (as they're 8 bytes). While at it, also
-		 * get kind.
-		 */
-		SpinLockAcquire(&walsnd->mutex);
-		latch = walsnd->latch;
-		kind = walsnd->kind;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (latch == NULL)
-			continue;
+	/*
+	 * Let's wake up all the waiting physical and/or logical walsenders using
+	 * their respective condition variables (CVs). Note that every walsender
+	 * would have prepared to sleep on the CV (i.e., added itself to the CV's
+	 * waitlist) in WalSndWait before actually waiting.
+	 */
+	if (physical)
+		ConditionVariableBroadcast(&WalSndCtl->physicalWALSndCV);
 
-		if ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
-			(logical && kind == REPLICATION_KIND_LOGICAL))
-			SetLatch(latch);
-	}
+	if (logical)
+		ConditionVariableBroadcast(&WalSndCtl->logicalWALSndCV);
 }
 
 /*
@@ -3368,9 +3357,44 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event)
 	WaitEvent	event;
 
 	ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+	/*
+	 * We use condition variable (CV) to efficiently wake up walsenders in
+	 * WalSndWakeup().
+	 *
+	 * Every walsender prepares to sleep on a shared memory CV. Note that it
+	 * just prepares to sleep on the CV (i.e., adds itself to the CV's
+	 * waitlist), but does not actually wait on the CV (IOW, it never calls
+	 * ConditionVariableSleep()). It still uses WaitEventSetWait() for waiting,
+	 * because CV infrastructure doesn't handle FeBe socket events currently.
+	 * The processes (startup process, walreceiver etc.) wanting to wake up
+	 * walsenders use ConditionVariableBroadcast(), which in turn calls
+	 * SetLatch(), helping walsenders come out of WaitEventSetWait().
+	 *
+	 * This approach is simple and efficient because, one doesn't have to loop
+	 * through all the walsenders slots, with a spinlock acquisition and
+	 * release for every iteration, just to wake up only the waiting
+	 * walsenders. It makes WalSndWakeup() callers' life easy.
+	 *
+	 * XXX: A desirable future improvement would be to add support for CVs into
+	 * WaitEventSetWait().
+	 *
+	 * And, we use separate shared memory CVs for physical and logical
+	 * walsenders for selective wake ups, see WalSndWakeup() for more details.
+	 */
+	if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL)
+		ConditionVariablePrepareToSleep(&WalSndCtl->physicalWALSndCV);
+	else if (MyWalSnd->kind == REPLICATION_KIND_LOGICAL)
+		ConditionVariablePrepareToSleep(&WalSndCtl->logicalWALSndCV);
+
 	if (WaitEventSetWait(FeBeWaitSet, timeout, &event, 1, wait_event) == 1 &&
 		(event.events & WL_POSTMASTER_DEATH))
+	{
+		ConditionVariableCancelSleep();
 		proc_exit(1);
+	}
+
+	ConditionVariableCancelSleep();
 }
 
 /*
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index ff25aa70a8..4b33a96d25 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -17,6 +17,7 @@
 #include "nodes/nodes.h"
 #include "nodes/replnodes.h"
 #include "replication/syncrep.h"
+#include "storage/condition_variable.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -108,6 +109,12 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/* Used to wake up physical walsenders */
+	ConditionVariable	physicalWALSndCV;
+
+	/* Used to wake up logical walsenders */
+	ConditionVariable	logicalWALSndCV;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
-- 
2.34.1

#24

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#23)

Re: walsender performance regression due to logical decoding on standby changes

At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

+             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
It's not obvious to me that it's worth having two CVs, because it's more
expensive to find no waiters in two CVs than to find no waiters in one CV.
I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
wakes up logical walsenders for every WAL record, but it wakes up
physical walsenders only if the applied WAL record causes a TLI
switch. Therefore, the extra cost of spinlock acquire-release for per
WAL record applies only for logical walsenders. On the other hand, if
we were to use a single CV, we would be unnecessarily waking up (if at
all they are sleeping) physical walsenders for every WAL record -
which is costly IMO.

As I was reading this, I start thinking that one reason for the
regression could be to exccessive frequency of wakeups during logical
replication. In physical replication, we make sure to avoid exccessive
wakeups when the stream is tightly packed. I'm just wondering why
logical replication doesn't (or can't) do the same thing, IMHO.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#25

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Kyotaro Horiguchi (#24)

RE: walsender performance regression due to logical decoding on standby changes

On Friday, May 19, 2023 11:08 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote in
+             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
It's not obvious to me that it's worth having two CVs, because it's more
expensive to find no waiters in two CVs than to find no waiters in one CV.
I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
wakes up logical walsenders for every WAL record, but it wakes up
physical walsenders only if the applied WAL record causes a TLI
switch. Therefore, the extra cost of spinlock acquire-release for per
WAL record applies only for logical walsenders. On the other hand, if
we were to use a single CV, we would be unnecessarily waking up (if at
all they are sleeping) physical walsenders for every WAL record -
which is costly IMO.
As I was reading this, I start thinking that one reason for the
regression could be to exccessive frequency of wakeups during logical
replication. In physical replication, we make sure to avoid exccessive
wakeups when the stream is tightly packed. I'm just wondering why
logical replication doesn't (or can't) do the same thing, IMHO.

I thought(from the e101dfa's commit message) physical walsenders can't send
data until it's been flushed, so it only gets wakeup at the time of flush or TLI
switch. For logical walsender, it can start to decode changes after the change
is applied, and to avoid the case if the walsender is asleep, and there's work
to be done, it wakeup logical walsender when applying each record.

Or maybe you mean to wakeup(ConditionVariableBroadcast) walsender after
applying some wal records like[1]--- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1833,6 +1833,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl { ErrorContextCallback errcallback; bool switchedTLI = false; + static int nreplay = 0;, But it seems it may delay the wakeup of
walsender a bit(the walsender may be asleep before reaching the threshold).

[1] 
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1833,6 +1833,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 {
 	ErrorContextCallback errcallback;
 	bool		switchedTLI = false;
+	static int	nreplay = 0;

 	/* Setup error traceback support for ereport() */
 	errcallback.callback = rm_redo_error_callback;
@@ -1957,8 +1958,12 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	 *    be created otherwise)
 	 * ------
 	 */
-	if (AllowCascadeReplication())
+	if (AllowCascadeReplication() &&
+		nreplay++ == 100)
+	{
 		WalSndWakeup(switchedTLI, true);
+		nreplay = 0;
+	}

Best Regards,
Hou zj

#26

andres@anarazel.de

over 2 years ago

In reply to: Kyotaro Horiguchi (#24)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-19 12:07:56 +0900, Kyotaro Horiguchi wrote:

At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
+             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
It's not obvious to me that it's worth having two CVs, because it's more
expensive to find no waiters in two CVs than to find no waiters in one CV.
I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
wakes up logical walsenders for every WAL record, but it wakes up
physical walsenders only if the applied WAL record causes a TLI
switch. Therefore, the extra cost of spinlock acquire-release for per
WAL record applies only for logical walsenders. On the other hand, if
we were to use a single CV, we would be unnecessarily waking up (if at
all they are sleeping) physical walsenders for every WAL record -
which is costly IMO.
As I was reading this, I start thinking that one reason for the
regression could be to exccessive frequency of wakeups during logical
replication. In physical replication, we make sure to avoid exccessive
wakeups when the stream is tightly packed. I'm just wondering why
logical replication doesn't (or can't) do the same thing, IMHO.

It's possible we could try to reduce the frequency by issuing wakeups only at
specific points. The most obvious thing to do would be to wake only when
waiting for more WAL or when crossing a page boundary, or such. Unfortunately
that could easily lead to deadlocks, because the startup process might be
blocked waiting for a lock, held by a backend doing logical decoding - which
can't progress until the startup process wakes the backend up.

So I don't think this is promising avenue in the near term.

Greetings,

Andres Freund

#27

andres@anarazel.de

over 2 years ago

In reply to: Bharath Rupireddy (#23)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-18 20:11:11 +0530, Bharath Rupireddy wrote:

Please find the attached v4 patch addressing the review comment (not
the fastpath one).

I pushed a mildly edited version of this. I didn't like the name of the CVs
much, so I renamed them to wal_flush_cv/wal_replay_cv. I also did some minor
comment polishing.

Greetings,

Andres Freund

#28

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Andres Freund (#26)

RE: walsender performance regression due to logical decoding on standby changes

Hi,

On Monday, May 22, 2023 12:11 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-05-19 12:07:56 +0900, Kyotaro Horiguchi wrote:

At Thu, 18 May 2023 20:11:11 +0530, Bharath Rupireddy

<bharath.rupireddyforpostgres@gmail.com> wrote in

+

ConditionVariableInit(&WalSndCtl->physicalWALSndCV);

+ ConditionVariableInit(&WalSndCtl->logicalWALSndCV);

It's not obvious to me that it's worth having two CVs, because it's more
expensive to find no waiters in two CVs than to find no waiters in one CV.

I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
wakes up logical walsenders for every WAL record, but it wakes up
physical walsenders only if the applied WAL record causes a TLI
switch. Therefore, the extra cost of spinlock acquire-release for per
WAL record applies only for logical walsenders. On the other hand, if
we were to use a single CV, we would be unnecessarily waking up (if at
all they are sleeping) physical walsenders for every WAL record -
which is costly IMO.

As I was reading this, I start thinking that one reason for the
regression could be to exccessive frequency of wakeups during logical
replication. In physical replication, we make sure to avoid exccessive
wakeups when the stream is tightly packed. I'm just wondering why
logical replication doesn't (or can't) do the same thing, IMHO.

It's possible we could try to reduce the frequency by issuing wakeups only at
specific points. The most obvious thing to do would be to wake only when
waiting for more WAL or when crossing a page boundary, or such.
Unfortunately
that could easily lead to deadlocks, because the startup process might be
blocked waiting for a lock, held by a backend doing logical decoding - which
can't progress until the startup process wakes the backend up.

Just out of curiosity about the mentioned deadlock scenario, no comment on the
committed patch.

About "a backend doing logical decoding", do you mean the case when a user
start a backend and invoke pg_logical_slot_get_changes() to do the logical
decoding ? If so, it seems the logical decoding in a backend won't be waked up
by startup process because the backend won't be registered as a walsender so
the backend won't be found in WalSndWakeup().

Or do you mean the deadlock between the real logical walsender and startup
process ? (I might miss something) I think the logical decoding doesn't lock
the target user relation when decoding because it normally can get the needed
information from WAL. Besides, the walsender sometimes will lock the system
table(e.g. use RelidByRelfilenumber() to get the relid) but it will unlock it
after finishing systable scan.

So, if possible, would you be able to share some details about the deadlock
case you mentioned earlier? it's helpful as it can prevent similar problems in
the future development.

Best Regards,
Hou zj

#29

andres@anarazel.de

over 2 years ago

In reply to: Zhijie Hou (Fujitsu) (#28)

Re: walsender performance regression due to logical decoding on standby changes

Hi,

On 2023-05-22 12:15:07 +0000, Zhijie Hou (Fujitsu) wrote:

About "a backend doing logical decoding", do you mean the case when a user
start a backend and invoke pg_logical_slot_get_changes() to do the logical
decoding ? If so, it seems the logical decoding in a backend won't be waked up
by startup process because the backend won't be registered as a walsender so
the backend won't be found in WalSndWakeup().

I meant logical decoding happening inside a walsender instance.

Or do you mean the deadlock between the real logical walsender and startup
process ? (I might miss something) I think the logical decoding doesn't lock
the target user relation when decoding because it normally can get the needed
information from WAL.

It does lock catalog tables briefly. There's no guarantee that such locks are
released immediately. I forgot the details, but IIRC there's some outfuncs
(enum?) that intentionally delay releasing locks till transaction commit.

Greetings,

Andres Freund

#30

houzj.fnst@fujitsu.com

over 2 years ago

In reply to: Andres Freund (#29)

RE: walsender performance regression due to logical decoding on standby changes

On Tuesday, May 23, 2023 1:53 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-05-22 12:15:07 +0000, Zhijie Hou (Fujitsu) wrote:

About "a backend doing logical decoding", do you mean the case when a

user

start a backend and invoke pg_logical_slot_get_changes() to do the logical
decoding ? If so, it seems the logical decoding in a backend won't be waked

up

by startup process because the backend won't be registered as a walsender

so

the backend won't be found in WalSndWakeup().

I meant logical decoding happening inside a walsender instance.

Or do you mean the deadlock between the real logical walsender and startup
process ? (I might miss something) I think the logical decoding doesn't lock
the target user relation when decoding because it normally can get the

needed

information from WAL.

It does lock catalog tables briefly. There's no guarantee that such locks are
released immediately. I forgot the details, but IIRC there's some outfuncs
(enum?) that intentionally delay releasing locks till transaction commit.

Thanks for the explanation !

I understand that the startup process can take lock on the catalog(when
replaying record) which may conflict with the lock in walsender.

But in walsender, I think we only start transaction after entering
ReorderBufferProcessTXN(), and the transaction started here will be released
soon after processing and outputting the decoded transaction's data(as the
comment in ReorderBufferProcessTXN() says:" all locks acquired in here to be
released, not reassigned to the parent and we do not want any database access
have persistent effects.").

Besides, during the process and output of the decoded transaction, the
walsender won't wait for the wakeup of startup process(e.g.
WalSndWaitForWal()), it only waits if the data is being sent to subscriber. So
it seems the lock conflict here won't cause the deadlock for now, although it
may have a risk if we change this logic later. Sorry if I missed something, and
thanks again for your patience in explaining.

Best Regards,
Hou zj

#31