[BUG]: the walsender does not update its IO statistics until it exits

Started by Bertrand Drouvot11 months ago47 messages

bertranddrouvot.pg@gmail.com

11 months ago

1 attachment(s)

Hi hackers,

while doing some tests for [1]/messages/by-id/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal, I observed that $SUBJECT.

To observe this behavior on master:

1. create a logical replication slot

postgres=# SELECT * FROM pg_create_logical_replication_slot('logical_slot', 'test_decoding', false, true);
slot_name | lsn
--------------+------------
logical_slot | 0/40749508
(1 row)

2. create a table and add some data

postgres=# create table bdt (a int);
CREATE TABLE
postgres=# insert into bdt select a from generate_series(1,10000) a ;
INSERT 0 10000

3. starts pg_recvlogical that way

pg_recvlogical -d postgres -S logical_slot -f - --no-loop --start

4. query pg_stat_io

The non zeros stats that we see here are due to the pgstat_report_stat() call in
PostgresMain() but not to the walsender decoding activity per say (proof is that
you can see that the wal object values are empty while it certainly had to read
some WAL).

5. Once ctrl-c is done for pg_recvlogical then we get:

Now we can see that the numbers increased for the relation object and that we
get non zeros numbers for the wal object too (which makes fully sense).

With the attached patch applied, we would get the same numbers already in
step 4. (means the stats are flushed without the need to wait for the walsender
to exit).

Remarks:

R1. The extra flush are done in WalSndLoop(): I believe this is the right place
for them.
R2. A test is added in 035_standby_logical_decoding.pl: while this TAP test
is already racy ([2]/messages/by-id/Z6oRgmD8m7zBo732@ip-10-97-1-34.eu-west-3.compute.internal) that looks like a good place as we don't want pg_recvlogical
to stop/exit.
R3. The test can also be back-patched till 16_STABLE as 035_standby_logical_decoding.pl
has been introduced in 16 (and so do pg_stat_io).
R4. The test fails if the extra flushs are not applied/done, which makes fully
sense.

[1]: /messages/by-id/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal
[2]: /messages/by-id/Z6oRgmD8m7zBo732@ip-10-97-1-34.eu-west-3.compute.internal

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Flush-the-IO-statistics-of-active-walsenders.patchtext/x-diff; charset=us-asciiDownload

From b16a0ccada3c2acc80bf733e71935677b9ab6382 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Feb 2025 10:18:05 +0000
Subject: [PATCH v1] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. adds a test
---
 src/backend/replication/walsender.c               |  7 +++++++
 .../recovery/t/035_standby_logical_decoding.pl    | 15 +++++++++++++++
 2 files changed, 22 insertions(+)
  26.2% src/backend/replication/
  73.7% src/test/recovery/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..3664d7d0c75 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -90,6 +90,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -2793,6 +2794,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
 
+		/*
+		 * Report IO statistics
+		 */
+		pgstat_flush_io(false);
+		(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
 		/* If nothing remains to be sent right now ... */
 		if (WalSndCaughtUp && !pq_is_send_pending())
 		{
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8903177d883..ab34d290165 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -543,6 +543,9 @@ $node_subscriber->stop;
 reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 	0, 1);
 
+# To check that an active walsender updates its IO statistics below.
+$node_standby->safe_psql('testdb', "SELECT pg_stat_reset_shared('io')");
+
 # Ensure that replication slot stats are not empty before triggering the
 # conflict.
 $node_primary->safe_psql('testdb',
@@ -552,6 +555,18 @@ $node_standby->poll_query_until('testdb',
 	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
 ) or die "replication slot stats of vacuum_full_activeslot not updated";
 
+# Ensure an active walsender updates its IO statistics.
+is( $node_standby->safe_psql(
+		'testdb',
+		qq(SELECT reads > 0
+		FROM pg_catalog.pg_stat_io
+		WHERE backend_type = 'walsender'
+		AND object = 'relation'
+		AND context = 'normal')
+	),
+	qq(t),
+	"Check that an active walsender updates its IO statistics");
+
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
-- 
2.34.1

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Bertrand Drouvot (#1)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Tue, Feb 25, 2025 at 01:42:08PM +0000, Bertrand Drouvot wrote:

Now we can see that the numbers increased for the relation object and that we
get non zeros numbers for the wal object too (which makes fully sense).

With the attached patch applied, we would get the same numbers already in
step 4. (means the stats are flushed without the need to wait for the walsender
to exit).

@@ -2793,6 +2794,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
if (pq_flush_if_writable() != 0)
WalSndShutdown();

+        /*
+         * Report IO statistics
+         */
+        pgstat_flush_io(false);
+        (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
         /* If nothing remains to be sent right now ... */
         if (WalSndCaughtUp && !pq_is_send_pending())
         {

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

A non-forced periodic flush like you are proposing here sounds OK to
me, but the position of the flush could be positioned better in the
loop. If there is a SIGUSR2 (aka got_SIGUSR2 is true), WAL senders
would shut down, so it seems rather pointless to do a flush just
before exiting the process in WalSndDone(), no? I'd suggest to move
the flush attempt closer to where we wait for some activity, just
after WalSndKeepaliveIfNecessary().
--
Michael

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

11 months ago

In reply to: Michael Paquier (#2)

1 attachment(s)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Feb 26, 2025 at 03:37:10PM +0900, Michael Paquier wrote:

On Tue, Feb 25, 2025 at 01:42:08PM +0000, Bertrand Drouvot wrote:

Now we can see that the numbers increased for the relation object and that we
get non zeros numbers for the wal object too (which makes fully sense).

With the attached patch applied, we would get the same numbers already in
step 4. (means the stats are flushed without the need to wait for the walsender
to exit).

@@ -2793,6 +2794,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
if (pq_flush_if_writable() != 0)
WalSndShutdown();
+        /*
+         * Report IO statistics
+         */
+        pgstat_flush_io(false);
+        (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
/* If nothing remains to be sent right now ... */
if (WalSndCaughtUp && !pq_is_send_pending())
{
That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

Yeah... And while the example shared up-thread is related to logical walsender,
the same issue exists for a physical walsender.

OTOH, It's also great to see that the new stats that have been added (the WAL
ones) helped to spot the issue.

A non-forced periodic flush like you are proposing here sounds OK to
me,

Thanks for looking at it!

but the position of the flush could be positioned better in the
loop. If there is a SIGUSR2 (aka got_SIGUSR2 is true), WAL senders
would shut down,

That's true for a physical walsender but I'm not sure it is for a logical
walsender (due to the "sentPtr == replicatedPtr" check in WalSndDone()).

so it seems rather pointless to do a flush just
before exiting the process in WalSndDone(), no? I'd suggest to move
the flush attempt closer to where we wait for some activity, just
after WalSndKeepaliveIfNecessary().

Yeah I think that makes sense, done that way in the attached.

Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
(would also fail without the fix).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Flush-the-IO-statistics-of-active-walsenders.patchtext/x-diff; charset=us-asciiDownload

From 0b610eaea0f71c3f61e1cded137dc59db596322e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Feb 2025 10:18:05 +0000
Subject: [PATCH v2] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. adds a test for a physical walsender (a logical walsender had the same issue
but the fix is in the same code path)
---
 src/backend/replication/walsender.c   |  7 +++++++
 src/test/recovery/t/001_stream_rep.pl | 14 ++++++++++++++
 2 files changed, 21 insertions(+)
  26.8% src/backend/replication/
  73.1% src/test/recovery/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..9ddf111af25 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -90,6 +90,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -2829,6 +2830,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		/*
+		 * Report IO statistics
+		 */
+		pgstat_flush_io(false);
+		(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
 		/*
 		 * Block if we have unsent data.  XXX For logical replication, let
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index ee57d234c86..aea32f68b79 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -42,6 +42,9 @@ $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# To check that an active walsender updates its IO statistics below.
+$node_primary->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
@@ -69,6 +72,17 @@ ALTER EVENT TRIGGER on_login_trigger ENABLE ALWAYS;
 $node_primary->wait_for_replay_catchup($node_standby_1);
 $node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary);
 
+# Ensure an active walsender updates its IO statistics.
+is( $node_primary->safe_psql(
+		'postgres',
+		qq(SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal')
+	),
+	qq(t),
+	"Check that the walsender updates its IO statistics");
+
 my $result =
   $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 print "standby 1: $result\n";
-- 
2.34.1

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Michael Paquier (#2)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On 2025-02-26 15:37:10 +0900, Michael Paquier wrote:

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

FWIW, I think medium term we need to work on splitting stats flushing into two
separate kinds of flushes:
1) non-transactional stats, which should be flushed at a regular interval,
unless a process is completely idle
2) transaction stats, which can only be flushed at transaction boundaries,
because before the transaction boundary we don't know if e.g. newly
inserted rows should be counted as live or dead

So far we have some timer logic for 2), but we have basically no support for
1). Which means we have weird ad-hoc logic in various kinds of
non-plain-connection processes. And that will often have holes, as Bertrand
noticed here.

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction. The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

Greetings,

Andres Freund

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Andres Freund (#4)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction. The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

Agreed that something in the lines of non-transaction update of the
entries could be adapted in some cases, so +1 for the idea. I suspect
that there would be cases where a single stats kind should be able to
handle both transactional and non-transactional flush cases.
Splitting that across two stats kinds would lead to a lot of
duplication.
--
Michael

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

11 months ago

In reply to: Andres Freund (#4)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:

Hi,

On 2025-02-26 15:37:10 +0900, Michael Paquier wrote:

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

FWIW, I think medium term we need to work on splitting stats flushing into two
separate kinds of flushes:
1) non-transactional stats, which should be flushed at a regular interval,
unless a process is completely idle
2) transaction stats, which can only be flushed at transaction boundaries,
because before the transaction boundary we don't know if e.g. newly
inserted rows should be counted as live or dead

So far we have some timer logic for 2), but we have basically no support for
1). Which means we have weird ad-hoc logic in various kinds of
non-plain-connection processes. And that will often have holes, as Bertrand
noticed here.

Thanks for sharing your thoughts on it!

Yeah, agree that's a good medium term idea to avoid missing flushing some
stats.

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

+++1 There is no need to wait for the transaction boundary for some stats as
those would be more "useful"/"actionable" if flushed during the transaction is
in progress.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction. The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

Adding this to my bucket unless someone beats me on it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Michael Paquier (#5)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On 2025-02-27 12:09:46 +0900, Michael Paquier wrote:

I suspect that there would be cases where a single stats kind should be able
to handle both transactional and non-transactional flush cases. Splitting
that across two stats kinds would lead to a lot of duplication.

Agreed. Table stats will be the prime case where that's important.

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Bertrand Drouvot (#3)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Wed, Feb 26, 2025 at 09:48:50AM +0000, Bertrand Drouvot wrote:

Yeah I think that makes sense, done that way in the attached.

Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
(would also fail without the fix).

Hmm. I was doing some more checks with this patch, and on closer look
I am wondering if the location you have chosen for the stats reports
is too aggressive: this requires a LWLock for the WAL sender backend
type taken in exclusive mode, with each step of WalSndLoop() taken
roughly each time a record or a batch of records is sent. A single
installcheck with a primary/standby setup can lead to up to 50k stats
report calls.

With smaller records, the loop can become hotter, can't it? Also,
there can be a high number of WAL senders on a single node, and I've
heard of some customers with complex logical decoding deployments with
dozens of logical WAL senders. Isn't there a risk of having this code
path become a point of contention? It seems to me that we should
benchmark this change more carefully, perhaps even reduce the
frequency of the report calls.
--
Michael

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Michael Paquier (#8)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Fri, Feb 28, 2025 at 02:41:34PM +0900, Michael Paquier wrote:

With smaller records, the loop can become hotter, can't it? Also,
there can be a high number of WAL senders on a single node, and I've
heard of some customers with complex logical decoding deployments with
dozens of logical WAL senders. Isn't there a risk of having this code
path become a point of contention? It seems to me that we should
benchmark this change more carefully, perhaps even reduce the
frequency of the report calls.

One idea here would be to have on a single host one server with a set
of N pg_receivewal processes dumping their WAL segments into a tmpfs,
while a single session generates a bunch of records with a minimal
size using pg_logical_emit_message(). Monitoring the maximum
replication lag with pg_stat_replication and looking at some perf
profiles of the cluster should show how these stats reports affect the
replication setup efficiency.
--
Michael

#10

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

11 months ago

In reply to: Michael Paquier (#8)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Fri, Feb 28, 2025 at 02:41:34PM +0900, Michael Paquier wrote:

On Wed, Feb 26, 2025 at 09:48:50AM +0000, Bertrand Drouvot wrote:

Yeah I think that makes sense, done that way in the attached.

Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
(would also fail without the fix).

Hmm. I was doing some more checks with this patch, and on closer look
I am wondering if the location you have chosen for the stats reports
is too aggressive: this requires a LWLock for the WAL sender backend
type taken in exclusive mode, with each step of WalSndLoop() taken
roughly each time a record or a batch of records is sent. A single
installcheck with a primary/standby setup can lead to up to 50k stats
report calls.

Yeah, what I can observe (installcheck with a primary/standby):

- extras flush are called about 70K times.

Among those 70K:

- about 575 are going after the "have_iostats" check in pgstat_io_flush_cb()
- about 575 are going after the PendingBackendStats pg_memory_is_all_zeros()
check in pgstat_flush_backend() (makes sense due to the above)
- about 575 are going after the PendingBackendStats.pending_io pg_memory_is_all_zeros()
check in pgstat_flush_backend_entry_io() (makes sense due to the above)

It means that only a very few of them are "really" flushing IO stats.

With smaller records, the loop can become hotter, can't it? Also,
there can be a high number of WAL senders on a single node, and I've
heard of some customers with complex logical decoding deployments with
dozens of logical WAL senders. Isn't there a risk of having this code
path become a point of contention? It seems to me that we should
benchmark this change more carefully, perhaps even reduce the
frequency of the report calls.

That sounds a good idea to measure the impact of those extra calls and see
if we'd need to mitigate the impacts. I'll collect some data.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#11

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Bertrand Drouvot (#10)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:

That sounds a good idea to measure the impact of those extra calls and see
if we'd need to mitigate the impacts. I'll collect some data.

Thanks.
--
Michael

#12

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

11 months ago

In reply to: Michael Paquier (#11)

2 attachment(s)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Mon, Mar 03, 2025 at 10:51:19AM +0900, Michael Paquier wrote:

On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:

That sounds a good idea to measure the impact of those extra calls and see
if we'd need to mitigate the impacts. I'll collect some data.

So I did some tests using only one walsender (given the fact that the extra
lock you mentioned above is "only" for this particular backend).

=== Test with pg_receivewal

I was using one pg_receivewal process and did some tests that way:

pgbench -n -c8 -j8 -T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 1));";)

I did not measure any noticeable extra lag (I did measure the time it took
for pg_size_pretty(sent_lsn - write_lsn) from pg_stat_replication to be back
to zero).

During the pgbench run a "perf record --call-graph fp -p <walsender_pid>" would
report (perf report -n):

1. pgstat_flush_backend() appears at about 3%
2. pg_memory_is_all_zeros() at about 2.8%
3. pgstat_flush_io() at about 0.4%

So it does not look like what we're adding here can be seen as a primary bottleneck.

That said it looks like that there is room for improvment in pgstat_flush_backend()
and that relying on a "have_iostats" like variable would be better than those
pg_memory_is_all_zeros() calls.

That's done in 0001 attached, by doing so, pgstat_flush_backend() now appears at
about 0.2%.

=== Test with pg_recvlogical

Now it does not look like pg_receivewal had a lot of IO stats to report (looking at
pg_stat_get_backend_io() output for the walsender).

Doing the same test with "pg_recvlogical -d postgres -S logical_slot -f /dev/null --start"
reports much more IO stats.

What I observe without the "have_iostats" optimization is:

1. I did not measure any noticeable extra lag
2. pgstat_flush_io() at about 5.5% (pgstat_io_flush_cb() at about 5.3%)
3 pgstat_flush_backend() at about 4.8%

and with the "have_iostats" optimization I now see pgstat_flush_backend() at
about 2.51%.

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Flush-the-IO-statistics-of-active-walsenders.patchtext/x-diff; charset=us-asciiDownload

From 226f767f90e82ee128319f43226d4d0056756fa9 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Feb 2025 10:18:05 +0000
Subject: [PATCH v3 1/2] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. adds a test for a physical walsender (a logical walsender had the same issue
but the fix is in the same code path)
---
 src/backend/replication/walsender.c   |  7 +++++++
 src/test/recovery/t/001_stream_rep.pl | 14 ++++++++++++++
 2 files changed, 21 insertions(+)
  26.8% src/backend/replication/
  73.1% src/test/recovery/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..9ddf111af25 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -90,6 +90,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -2829,6 +2830,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		/* Send keepalive if the time has come */
 		WalSndKeepaliveIfNecessary();
 
+		/*
+		 * Report IO statistics
+		 */
+		pgstat_flush_io(false);
+		(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
 		/*
 		 * Block if we have unsent data.  XXX For logical replication, let
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index ee57d234c86..aea32f68b79 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -42,6 +42,9 @@ $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# To check that an active walsender updates its IO statistics below.
+$node_primary->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
@@ -69,6 +72,17 @@ ALTER EVENT TRIGGER on_login_trigger ENABLE ALWAYS;
 $node_primary->wait_for_replay_catchup($node_standby_1);
 $node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary);
 
+# Ensure an active walsender updates its IO statistics.
+is( $node_primary->safe_psql(
+		'postgres',
+		qq(SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal')
+	),
+	qq(t),
+	"Check that the walsender updates its IO statistics");
+
 my $result =
   $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 print "standby 1: $result\n";
-- 
2.34.1

v3-0002-Add-a-new-backend_has_iostats-global-variable.patchtext/x-diff; charset=us-asciiDownload

From 42cce18fd4565118c430bdd422f509361c8f86e5 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Mar 2025 10:02:40 +0000
Subject: [PATCH v3 2/2] Add a new backend_has_iostats global variable

It behaves as the existing have_iostats and replace the existing pg_memory_is_all_zeros()
calls in flushing backend stats functions. Indeed some perf measurements report
that those calls are responsible for a large part of the flushing backend stats
functions.
---
 src/backend/utils/activity/pgstat_backend.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index a9343b7b59e..0ae3a1d4d23 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -37,6 +37,7 @@
  * memory allocation.
  */
 static PgStat_BackendPending PendingBackendStats;
+static bool backend_has_iostats = false;
 
 /*
  * Utility routines to report I/O stats for backends, kept here to avoid
@@ -68,6 +69,8 @@ pgstat_count_backend_io_op(IOObject io_object, IOContext io_context,
 
 	PendingBackendStats.pending_io.counts[io_object][io_context][io_op] += cnt;
 	PendingBackendStats.pending_io.bytes[io_object][io_context][io_op] += bytes;
+
+	backend_has_iostats = true;
 }
 
 /*
@@ -150,8 +153,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 	 * statistics.  In this case, avoid unnecessarily modifying the stats
 	 * entry.
 	 */
-	if (pg_memory_is_all_zeros(&PendingBackendStats.pending_io,
-							   sizeof(struct PgStat_PendingIO)))
+	if (!backend_has_iostats)
 		return;
 
 	shbackendent = (PgStatShared_Backend *) entry_ref->shared_stats;
@@ -182,6 +184,8 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_PendingIO));
+
+	backend_has_iostats = false;
 }
 
 /*
@@ -198,8 +202,7 @@ pgstat_flush_backend(bool nowait, bits32 flags)
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return false;
 
-	if (pg_memory_is_all_zeros(&PendingBackendStats,
-							   sizeof(struct PgStat_BackendPending)))
+	if (!pgstat_backend_have_pending_cb())
 		return false;
 
 	entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_BACKEND, InvalidOid,
@@ -222,8 +225,7 @@ pgstat_flush_backend(bool nowait, bits32 flags)
 bool
 pgstat_backend_have_pending_cb(void)
 {
-	return (!pg_memory_is_all_zeros(&PendingBackendStats,
-									sizeof(struct PgStat_BackendPending)));
+	return backend_has_iostats;
 }
 
 /*
-- 
2.34.1

#13

Michael Paquier

michael@paquier.xyz

10 months ago

In reply to: Bertrand Drouvot (#12)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

Okay. The number of reports really worry me anyway as proposed in the
patch. It's not really complicated to see dozens of pgstats calls in
a single millisecond in the WAL sender, even with some of the
regression tests. That's more present in the logirep scenarios, it
seems..

For example, here are some numbers based on some elog() calls planted
in walreceiver.c, looking at the logs of the TAP tests. Through a
single check-world, I am able to gather the following information:
- The location proposed by the patch leads to 680k stats reports,
in WalSndLoop() because the "Block if we have unsent data". Some
tests show quite some contention.
- A different location, in WalSndLoop() just before WalSndWait() leads
to 59k calls. And there's some contention in the logirep tests..
- I've spotted a third candidate which looks pretty solid, actually:
WalSndWaitForWal() before WalSndWait(). This leads to 2.9k reports in
the whole test suite, with much less contention in the reports. These
can still be rather frequent, up to ~50 calls per seconds, but that's
really less than the two others.

Stats data is useful as long as it is possible to get an idea of how
the system behaves, particularly with a steady workload. More
frequent reports are useful for spikey data detection, showing more
noise. Still, too many reports may cause the part gathering the
reports to become a bottleneck, while we want it to offer hints about
bottlenecks. So I would argue in favor of a more conservative choice
in the back branches than what the patch is proposing. Choice 3 i'm
quoting above is tempting by design: not too much, still frequent
enough to offer enough relevant information in the stats.
--
Michael

#14

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Michael Paquier (#13)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Mon, Mar 10, 2025 at 09:23:50AM +0900, Michael Paquier wrote:

On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

- I've spotted a third candidate which looks pretty solid, actually:
WalSndWaitForWal() before WalSndWait(). This leads to 2.9k reports in
the whole test suite, with much less contention in the reports. These
can still be rather frequent, up to ~50 calls per seconds, but that's
really less than the two others.

Stats data is useful as long as it is possible to get an idea of how
the system behaves, particularly with a steady workload. More
frequent reports are useful for spikey data detection, showing more
noise. Still, too many reports may cause the part gathering the
reports to become a bottleneck, while we want it to offer hints about
bottlenecks. So I would argue in favor of a more conservative choice
in the back branches than what the patch is proposing.

Yeah, fully agree. Anyway, having so many frequent stats reports makes little sense.

Choice 3 i'm
quoting above is tempting by design: not too much, still frequent
enough to offer enough relevant information in the stats.

Yeah, I also agree that we should reduce the number of "reports". I'll look
at the third option.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Bertrand Drouvot (#14)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Mon, Mar 10, 2025 at 02:52:42PM +0000, Bertrand Drouvot wrote:

Hi,

On Mon, Mar 10, 2025 at 09:23:50AM +0900, Michael Paquier wrote:

On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

- I've spotted a third candidate which looks pretty solid, actually:
WalSndWaitForWal() before WalSndWait(). This leads to 2.9k reports in
the whole test suite, with much less contention in the reports. These
can still be rather frequent, up to ~50 calls per seconds, but that's
really less than the two others.

Stats data is useful as long as it is possible to get an idea of how
the system behaves, particularly with a steady workload. More
frequent reports are useful for spikey data detection, showing more
noise. Still, too many reports may cause the part gathering the
reports to become a bottleneck, while we want it to offer hints about
bottlenecks. So I would argue in favor of a more conservative choice
in the back branches than what the patch is proposing.

Yeah, fully agree. Anyway, having so many frequent stats reports makes little sense.

Choice 3 i'm
quoting above is tempting by design: not too much, still frequent
enough to offer enough relevant information in the stats.

Yeah, I also agree that we should reduce the number of "reports". I'll look
at the third option.

WalSndWaitForWal() is being used only for logical walsender. So we'd need to
find another location for the physical walsender case. One option is to keep the
WalSndLoop() location and control the reports frequency.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Bertrand Drouvot (#15)

1 attachment(s)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Tue, Mar 11, 2025 at 06:10:32AM +0000, Bertrand Drouvot wrote:

WalSndWaitForWal() is being used only for logical walsender. So we'd need to
find another location for the physical walsender case. One option is to keep the
WalSndLoop() location and control the reports frequency.

I ended up with an idea in the same vein as your third one, but let's do it
in WalSndLoop() instead for both logical and physical walsenders. The idea
is to report the stats when the walsender is caught up or has pending data to
send. Also, we do flush only periodically (based on PGSTAT_MIN_INTERVAL).

Remarks/Questions:

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
update the related doc.

2. bonus point to make it here is that we don't need an extra call to
GetCurrentTimestamp(). We simply assign the one that is already done to "now".

3. add to use poll_query_until() has I've seen failures on the CI whith this
new approach. Also moving the test far away from the stats reset to to minimize
the risk of polling for too long.

With the attached in place, the number of times the stats are being flushed
reduce drastically but still seems enough to get an accurate idea of the walsender
IO activity. Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Flush-the-IO-statistics-of-active-walsenders.patchtext/x-diff; charset=us-asciiDownload

From 7978b1bce3d3f5cd47c983911431225b03d42efe Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Feb 2025 10:18:05 +0000
Subject: [PATCH v4] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. flush its IO statistics periodically to bot overload the WAL sender
3. adds a test for a physical walsender (a logical walsender had the same issue
but the fix is in the same code path)
---
 src/backend/replication/walsender.c   | 60 +++++++++++++++++++--------
 src/backend/utils/activity/pgstat.c   |  2 -
 src/include/pgstat.h                  |  2 +
 src/test/recovery/t/001_stream_rep.pl | 16 +++++++
 4 files changed, 60 insertions(+), 20 deletions(-)
  70.5% src/backend/replication/
   3.1% src/backend/utils/activity/
   3.6% src/include/
  22.6% src/test/recovery/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..9b44d4ae600 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -90,6 +90,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -2740,6 +2741,8 @@ WalSndCheckTimeOut(void)
 static void
 WalSndLoop(WalSndSendDataCallback send_data)
 {
+	TimestampTz last_flush = 0;
+
 	/*
 	 * Initialize the last reply timestamp. That enables timeout processing
 	 * from hereon.
@@ -2834,30 +2837,51 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
 		 * its additional actions.  For physical replication, also block if
 		 * caught up; its send_data does not block.
+		 *
+		 * When the WAL sender is caught up or has pending data to send, we
+		 * also periodically report I/O statistics. It's done periodically to
+		 * not overload the WAL sender.
 		 */
-		if ((WalSndCaughtUp && send_data != XLogSendLogical &&
-			 !streamingDoneSending) ||
-			pq_is_send_pending())
+		if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
 		{
-			long		sleeptime;
-			int			wakeEvents;
+			TimestampTz now;
 
-			if (!streamingDoneReceiving)
-				wakeEvents = WL_SOCKET_READABLE;
-			else
-				wakeEvents = 0;
+			now = GetCurrentTimestamp();
 
-			/*
-			 * Use fresh timestamp, not last_processing, to reduce the chance
-			 * of reaching wal_sender_timeout before sending a keepalive.
-			 */
-			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+			if (TimestampDifferenceExceeds(last_flush, now, PGSTAT_MIN_INTERVAL))
+			{
+				/*
+				 * Report IO statistics
+				 */
+				pgstat_flush_io(false);
+				(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+				last_flush = now;
+			}
 
-			if (pq_is_send_pending())
-				wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (send_data != XLogSendLogical || pq_is_send_pending())
+			{
+				long		sleeptime;
+				int			wakeEvents;
+
+				if (!streamingDoneReceiving)
+					wakeEvents = WL_SOCKET_READABLE;
+				else
+					wakeEvents = 0;
+
+				/*
+				 * Use fresh timestamp, not last_processing, to reduce the
+				 * chance of reaching wal_sender_timeout before sending a
+				 * keepalive.
+				 */
+				sleeptime = WalSndComputeSleeptime(now);
+
+				if (pq_is_send_pending())
+					wakeEvents |= WL_SOCKET_WRITEABLE;
+
+				/* Sleep until something happens or we time out */
+				WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
+			}
 
-			/* Sleep until something happens or we time out */
-			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
 		}
 	}
 }
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 3168b825e25..1bf84cbf64e 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -123,8 +123,6 @@
  * ----------
  */
 
-/* minimum interval non-forced stats flushes.*/
-#define PGSTAT_MIN_INTERVAL			1000
 /* how long until to block flushing pending stats updates */
 #define PGSTAT_MAX_INTERVAL			60000
 /* when to call pgstat_report_stat() again, even when idle */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index def6b370ac1..d1b15bf7757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -22,6 +22,8 @@
 #include "utils/relcache.h"
 #include "utils/wait_event.h"	/* for backward compatibility */	/* IWYU pragma: export */
 
+/* minimum interval non-forced stats flushes.*/
+#define PGSTAT_MIN_INTERVAL                 1000
 
 /* ----------
  * Paths for the statistics files (relative to installation's $PGDATA).
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 3945f00ab88..3371895ab1d 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -42,6 +42,9 @@ $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# To check that an active walsender updates its IO statistics below.
+$node_primary->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
@@ -329,6 +332,19 @@ $node_primary->psql(
 
 note "switching to physical replication slot";
 
+# Wait for the walsender to update its IO statistics.
+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_primary->poll_query_until(
+	'postgres',
+	qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+
 # Switch to using a physical replication slot. We can do this without a new
 # backup since physical slots can go backwards if needed. Do so on both
 # standbys. Since we're going to be testing things that affect the slot state,
-- 
2.34.1

#17

Xuneng Zhou

xunengzhou@gmail.com

10 months ago

In reply to: Bertrand Drouvot (#1)

Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Forgot to cc...

---------- Forwarded message ---------
发件人： Xuneng Zhou <xunengzhou@gmail.com>
Date: 2025年3月13日周四 19:15
Subject: Re: [BUG]: the walsender does not update its IO statistics until
it exits
To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>

Hi,

Thanks for working on this! I'm glad to see that the patch (
/messages/by-id/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal)
has been committed.

Regarding patch 0001, the optimization in pgstat_backend_have_pending_cb
looks good:

bool
pgstat_backend_have_pending_cb(void)
{
- return (!pg_memory_is_all_zeros(&PendingBackendStats,
- sizeof(struct PgStat_BackendPending)));
+ return backend_has_iostats;
}

Additionally, the function pgstat_flush_backend includes the check:

+ if (!pgstat_backend_have_pending_cb())
return false;

However, I think we might need to revise the comment (and possibly the
function name) for clarity:

/*
* Check if there are any backend stats waiting to be flushed.
*/

Originally, this function was intended to check multiple types of backend
statistics, which made sense when PendingBackendStats was the centralized
structure for various pending backend stats. However, since
PgStat_PendingWalStats was removed from PendingBackendStats earlier, and
now this patch introduces the backend_has_iostats variable, the scope of
this function appears even narrower. This narrowed functionality no longer
aligns closely with the original function name and its associated comment.

Bertrand Drouvot <bertranddrouvot.pg@gmail.com> 于2025年3月3日周一 19:54写道：

Show quoted text

Hi,

On Mon, Mar 03, 2025 at 10:51:19AM +0900, Michael Paquier wrote:

On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:

That sounds a good idea to measure the impact of those extra calls and

see

if we'd need to mitigate the impacts. I'll collect some data.

So I did some tests using only one walsender (given the fact that the extra
lock you mentioned above is "only" for this particular backend).

=== Test with pg_receivewal

I was using one pg_receivewal process and did some tests that way:

pgbench -n -c8 -j8 -T60 -f <(echo "SELECT pg_logical_emit_message(true,
'test', repeat('0', 1));";)

I did not measure any noticeable extra lag (I did measure the time it took
for pg_size_pretty(sent_lsn - write_lsn) from pg_stat_replication to be
back
to zero).

During the pgbench run a "perf record --call-graph fp -p <walsender_pid>"
would
report (perf report -n):

1. pgstat_flush_backend() appears at about 3%
2. pg_memory_is_all_zeros() at about 2.8%
3. pgstat_flush_io() at about 0.4%

So it does not look like what we're adding here can be seen as a primary
bottleneck.

That said it looks like that there is room for improvment in
pgstat_flush_backend()
and that relying on a "have_iostats" like variable would be better than
those
pg_memory_is_all_zeros() calls.

That's done in 0001 attached, by doing so, pgstat_flush_backend() now
appears at
about 0.2%.

=== Test with pg_recvlogical

Now it does not look like pg_receivewal had a lot of IO stats to report
(looking at
pg_stat_get_backend_io() output for the walsender).

Doing the same test with "pg_recvlogical -d postgres -S logical_slot -f
/dev/null --start"
reports much more IO stats.

What I observe without the "have_iostats" optimization is:

1. I did not measure any noticeable extra lag
2. pgstat_flush_io() at about 5.5% (pgstat_io_flush_cb() at about 5.3%)
3 pgstat_flush_backend() at about 4.8%

and with the "have_iostats" optimization I now see pgstat_flush_backend()
at
about 2.51%.

So it does not look like what we're adding here can be seen as a primary
bottleneck
but that is probably worth implementing the "have_iostats" optimization
attached.

Also, while I did not measure any noticeable extra lag, given the fact
that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about
2.5%,
that could still make sense to reduce the frequency of the flush calls,
thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Import Notes

Reply to msg id not found: CABPTF7U5B1gnvvrbk+we+Zi3p0-KKAEXYCLj914T9Fm5Cjk1LA@mail.gmail.com

#18

Xuneng Zhou

xunengzhou@gmail.com

10 months ago

In reply to: Bertrand Drouvot (#1)

Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Sorry, forgot to cc this one too.

---------- Forwarded message ---------
发件人： Xuneng Zhou <xunengzhou@gmail.com>
Date: 2025年3月13日周四 19:29
Subject: Re: [BUG]: the walsender does not update its IO statistics until
it exits
To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>

Hi,
Given that PGSTAT_MIN_INTERVAL is the go-to setting for stats flushes
* Unless called with 'force', pending stats updates are flushed happen once
* per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
* block on lock acquisition, except if stats updates have been pending for
* longer than PGSTAT_MAX_INTERVAL (60000ms).
*
* Whenever pending stats updates remain at the end of pgstat_report_stat()
a
* suggested idle timeout is returned. Currently this is always
* PGSTAT_IDLE_INTERVAL (10000ms). Callers can use the returned time to set
up
* a timeout after which to call pgstat_report_stat(true), but are not
* required to do so.

I am curious about the reason for this:

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both
case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
update the related doc.

PGSTAT_IDLE_INTERVAL seems to reduce the frequency even more.

Show quoted text

With the attached in place, the number of times the stats are being flushed
reduce drastically but still seems enough to get an accurate idea of the
walsender
IO activity. Thoughts?

Import Notes

Reply to msg id not found: CABPTF7VRN_-DfrPFjVtRYuN0E8mWETtVp6tSXNsXDJPDAuHHxA@mail.gmail.com

#19

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Xuneng Zhou (#17)

2 attachment(s)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Thu, Mar 13, 2025 at 07:33:24PM +0800, Xuneng Zhou wrote:

Regarding patch 0001, the optimization in pgstat_backend_have_pending_cb
looks good:

Thanks for looking at it!

bool
pgstat_backend_have_pending_cb(void)
{
- return (!pg_memory_is_all_zeros(&PendingBackendStats,
- sizeof(struct PgStat_BackendPending)));
+ return backend_has_iostats;
}
Additionally, the function pgstat_flush_backend includes the check:

+ if (!pgstat_backend_have_pending_cb())
return false;

However, I think we might need to revise the comment (and possibly the
function name) for clarity:

/*
* Check if there are any backend stats waiting to be flushed.
*/

The comment is not exactly this one on the current HEAD, it looks like that you're
looking at a previous version of the core code.

Originally, this function was intended to check multiple types of backend
statistics, which made sense when PendingBackendStats was the centralized
structure for various pending backend stats. However, since
PgStat_PendingWalStats was removed from PendingBackendStats earlier, and
now this patch introduces the backend_has_iostats variable, the scope of
this function appears even narrower. This narrowed functionality no longer
aligns closely with the original function name and its associated comment.

I don't think so, as since 76def4cdd7c, a pgstat_backend_wal_have_pending() check
has been added to pgstat_backend_have_pending_cb(). You're probably looking at
a version prior to 76def4cdd7c.

This particular sub-patch needs a rebase though, done in the attached. 0001
remains unchanged as compared to the v4 one just shared up-thread. If 0001 goes
in, merging 0002 would be less beneficial (as compared to v3).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Flush-the-IO-statistics-of-active-walsenders.patchtext/x-diff; charset=us-asciiDownload

From 3fa8bd30c15a99919601283f59f579e64b03c121 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Feb 2025 10:18:05 +0000
Subject: [PATCH v4 1/2] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. flush its IO statistics periodically to not overload the WAL sender
3. adds a test for a physical walsender (a logical walsender had the same issue
but the fix is in the same code path)
---
 src/backend/replication/walsender.c   | 60 +++++++++++++++++++--------
 src/backend/utils/activity/pgstat.c   |  2 -
 src/include/pgstat.h                  |  2 +
 src/test/recovery/t/001_stream_rep.pl | 16 +++++++
 4 files changed, 60 insertions(+), 20 deletions(-)
  70.5% src/backend/replication/
   3.1% src/backend/utils/activity/
   3.6% src/include/
  22.6% src/test/recovery/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..9b44d4ae600 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -90,6 +90,7 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
@@ -2740,6 +2741,8 @@ WalSndCheckTimeOut(void)
 static void
 WalSndLoop(WalSndSendDataCallback send_data)
 {
+	TimestampTz last_flush = 0;
+
 	/*
 	 * Initialize the last reply timestamp. That enables timeout processing
 	 * from hereon.
@@ -2834,30 +2837,51 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
 		 * its additional actions.  For physical replication, also block if
 		 * caught up; its send_data does not block.
+		 *
+		 * When the WAL sender is caught up or has pending data to send, we
+		 * also periodically report I/O statistics. It's done periodically to
+		 * not overload the WAL sender.
 		 */
-		if ((WalSndCaughtUp && send_data != XLogSendLogical &&
-			 !streamingDoneSending) ||
-			pq_is_send_pending())
+		if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
 		{
-			long		sleeptime;
-			int			wakeEvents;
+			TimestampTz now;
 
-			if (!streamingDoneReceiving)
-				wakeEvents = WL_SOCKET_READABLE;
-			else
-				wakeEvents = 0;
+			now = GetCurrentTimestamp();
 
-			/*
-			 * Use fresh timestamp, not last_processing, to reduce the chance
-			 * of reaching wal_sender_timeout before sending a keepalive.
-			 */
-			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+			if (TimestampDifferenceExceeds(last_flush, now, PGSTAT_MIN_INTERVAL))
+			{
+				/*
+				 * Report IO statistics
+				 */
+				pgstat_flush_io(false);
+				(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+				last_flush = now;
+			}
 
-			if (pq_is_send_pending())
-				wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (send_data != XLogSendLogical || pq_is_send_pending())
+			{
+				long		sleeptime;
+				int			wakeEvents;
+
+				if (!streamingDoneReceiving)
+					wakeEvents = WL_SOCKET_READABLE;
+				else
+					wakeEvents = 0;
+
+				/*
+				 * Use fresh timestamp, not last_processing, to reduce the
+				 * chance of reaching wal_sender_timeout before sending a
+				 * keepalive.
+				 */
+				sleeptime = WalSndComputeSleeptime(now);
+
+				if (pq_is_send_pending())
+					wakeEvents |= WL_SOCKET_WRITEABLE;
+
+				/* Sleep until something happens or we time out */
+				WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
+			}
 
-			/* Sleep until something happens or we time out */
-			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
 		}
 	}
 }
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 3168b825e25..1bf84cbf64e 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -123,8 +123,6 @@
  * ----------
  */
 
-/* minimum interval non-forced stats flushes.*/
-#define PGSTAT_MIN_INTERVAL			1000
 /* how long until to block flushing pending stats updates */
 #define PGSTAT_MAX_INTERVAL			60000
 /* when to call pgstat_report_stat() again, even when idle */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index def6b370ac1..d1b15bf7757 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -22,6 +22,8 @@
 #include "utils/relcache.h"
 #include "utils/wait_event.h"	/* for backward compatibility */	/* IWYU pragma: export */
 
+/* minimum interval non-forced stats flushes.*/
+#define PGSTAT_MIN_INTERVAL                 1000
 
 /* ----------
  * Paths for the statistics files (relative to installation's $PGDATA).
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 3945f00ab88..3371895ab1d 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -42,6 +42,9 @@ $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# To check that an active walsender updates its IO statistics below.
+$node_primary->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
@@ -329,6 +332,19 @@ $node_primary->psql(
 
 note "switching to physical replication slot";
 
+# Wait for the walsender to update its IO statistics.
+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_primary->poll_query_until(
+	'postgres',
+	qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+
 # Switch to using a physical replication slot. We can do this without a new
 # backup since physical slots can go backwards if needed. Do so on both
 # standbys. Since we're going to be testing things that affect the slot state,
-- 
2.34.1

v4-0002-Add-a-new-backend_has_iostats-global-variable.patchtext/x-diff; charset=us-asciiDownload

From abe58c82c35b8193d2cb8448b8147debdd2cb2f7 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 3 Mar 2025 10:02:40 +0000
Subject: [PATCH v4 2/2] Add a new backend_has_iostats global variable

It behaves as the existing have_iostats and replace the existing pg_memory_is_all_zeros()
calls in flushing backend stats functions. Indeed some perf measurements report
that those calls are responsible for a large part of the flushing backend stats
functions.
---
 src/backend/utils/activity/pgstat_backend.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index a8cb54a7732..82dae972c4c 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -37,6 +37,7 @@
  * memory allocation.
  */
 static PgStat_BackendPending PendingBackendStats;
+static bool backend_has_iostats = false;
 
 /*
  * WAL usage counters saved from pgWalUsage at the previous call to
@@ -76,6 +77,8 @@ pgstat_count_backend_io_op(IOObject io_object, IOContext io_context,
 
 	PendingBackendStats.pending_io.counts[io_object][io_context][io_op] += cnt;
 	PendingBackendStats.pending_io.bytes[io_object][io_context][io_op] += bytes;
+
+	backend_has_iostats = true;
 }
 
 /*
@@ -158,8 +161,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 	 * statistics.  In this case, avoid unnecessarily modifying the stats
 	 * entry.
 	 */
-	if (pg_memory_is_all_zeros(&PendingBackendStats.pending_io,
-							   sizeof(struct PgStat_PendingIO)))
+	if (!backend_has_iostats)
 		return;
 
 	shbackendent = (PgStatShared_Backend *) entry_ref->shared_stats;
@@ -190,6 +192,8 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_PendingIO));
+
+	backend_has_iostats = false;
 }
 
 /*
@@ -259,9 +263,7 @@ pgstat_flush_backend(bool nowait, bits32 flags)
 		return false;
 
 	/* Some IO data pending? */
-	if ((flags & PGSTAT_BACKEND_FLUSH_IO) &&
-		!pg_memory_is_all_zeros(&PendingBackendStats.pending_io,
-								sizeof(struct PgStat_PendingIO)))
+	if ((flags & PGSTAT_BACKEND_FLUSH_IO) && backend_has_iostats)
 		has_pending_data = true;
 
 	/* Some WAL data pending? */
@@ -298,9 +300,7 @@ pgstat_backend_have_pending_cb(void)
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return false;
 
-	return (!pg_memory_is_all_zeros(&PendingBackendStats,
-									sizeof(struct PgStat_BackendPending)) ||
-			pgstat_backend_wal_have_pending());
+	return (backend_has_iostats || pgstat_backend_wal_have_pending());
 }
 
 /*
-- 
2.34.1

#20

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Xuneng Zhou (#18)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Thu, Mar 13, 2025 at 07:35:24PM +0800, Xuneng Zhou wrote:

Hi,
Given that PGSTAT_MIN_INTERVAL is the go-to setting for stats flushes
* Unless called with 'force', pending stats updates are flushed happen once
* per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
* block on lock acquisition, except if stats updates have been pending for
* longer than PGSTAT_MAX_INTERVAL (60000ms).
*
* Whenever pending stats updates remain at the end of pgstat_report_stat()
a
* suggested idle timeout is returned. Currently this is always
* PGSTAT_IDLE_INTERVAL (10000ms). Callers can use the returned time to set
up
* a timeout after which to call pgstat_report_stat(true), but are not
* required to do so.

I am curious about the reason for this:

Thanks for looking at it!

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both
case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
update the related doc.

PGSTAT_IDLE_INTERVAL seems to reduce the frequency even more.

Yeah, I think that PGSTAT_MIN_INTERVAL is the one to use (that's why that's the
one the patch is using). I just mentioned PGSTAT_IDLE_INTERVAL as an open door
for conversation in case one prefers a "larger" frequency.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#21

Michael Paquier

michael@paquier.xyz

10 months ago

In reply to: Bertrand Drouvot (#19)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Thu, Mar 13, 2025 at 01:18:45PM +0000, Bertrand Drouvot wrote:

This particular sub-patch needs a rebase though, done in the attached. 0001
remains unchanged as compared to the v4 one just shared up-thread. If 0001 goes
in, merging 0002 would be less beneficial (as compared to v3).

PgStat_PendingIO is already quite large, could get larger, and the
backend stats would be a natural victim of that. What you are
proposing in v4-0002 is beneficial in any case outside of the specific
problem of this thread with WAL senders. It is new code on HEAD and
pgstat_flush_backend_entry_io() being the only entry point where IO
data is appended makes it easy to reason about. Just to say that I am
planning to apply this part on HEAD.
--
Michael

#22

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Michael Paquier (#21)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Tue, Mar 18, 2025 at 05:11:02PM +0900, Michael Paquier wrote:

On Thu, Mar 13, 2025 at 01:18:45PM +0000, Bertrand Drouvot wrote:

This particular sub-patch needs a rebase though, done in the attached. 0001
remains unchanged as compared to the v4 one just shared up-thread. If 0001 goes
in, merging 0002 would be less beneficial (as compared to v3).

PgStat_PendingIO is already quite large, could get larger, and the
backend stats would be a natural victim of that. What you are
proposing in v4-0002 is beneficial in any case outside of the specific
problem of this thread with WAL senders.

Yeah, I also think it makes sense independently of what has been discussed
in this thread.

Just to say that I am planning to apply this part on HEAD.

Thanks!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#23

Xuneng Zhou

xunengzhou@gmail.com

10 months ago

In reply to: Bertrand Drouvot (#20)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

I performed some tests using elog(no so sure whether this is the proper way
to do it) to monitor the new method. Here are the findings:

• With PGSTAT_MIN_INTERVAL set to 1000ms, the number of flush reports was
reduced to approximately 40–50 during the installcheck test suite.

• With PGSTAT_IDLE_INTERVAL set to 10000ms, the reports dropped to fewer
than 5.

• In contrast, the previous approach—flushing after every
WalSndKeepaliveIfNecessary()—resulted in roughly 50,000 flushes.

This reduction is significant, so the overhead from the flush reports is no
longer a concern. However, we still need to determine whether this
frequency is sufficient to capture the system’s state during periods of
high WAL activity. Based on my tests, using PGSTAT_MIN_INTERVAL seems to
provide a better balance than PGSTAT_IDLE_INTERVAL.

Show quoted text

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both
case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a

need to

update the related doc.

PGSTAT_IDLE_INTERVAL seems to reduce the frequency even more.

Yeah, I think that PGSTAT_MIN_INTERVAL is the one to use (that's why
that's the
one the patch is using). I just mentioned PGSTAT_IDLE_INTERVAL as an open
door
for conversation in case one prefers a "larger" frequency.

#24

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Xuneng Zhou (#23)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Tue, Mar 18, 2025 at 07:14:12PM +0800, Xuneng Zhou wrote:

Hi,

I performed some tests using elog

Thanks for the testing!

(no so sure whether this is the proper way
to do it) to monitor the new method.

Well that simple enough and that works well if the goal is just to "count" ;-)

Here are the findings:

• With PGSTAT_MIN_INTERVAL set to 1000ms, the number of flush reports was
reduced to approximately 40–50 during the installcheck test suite.

• With PGSTAT_IDLE_INTERVAL set to 10000ms, the reports dropped to fewer
than 5.

• In contrast, the previous approach—flushing after every
WalSndKeepaliveIfNecessary()—resulted in roughly 50,000 flushes.

This reduction is significant, so the overhead from the flush reports is no
longer a concern.

Yeah I did observe and do think the same.

However, we still need to determine whether this
frequency is sufficient to capture the system’s state during periods of
high WAL activity.

I think that reporting at PGSTAT_MIN_INTERVAL is fine and more than enough. I
mean, I 'm not sure that there is a real use case to query the statistics related
view at more than a second interval anyway. Or are you concerned that we may not
enter the "When the WAL sender is caught up or has pending data to send" frequently
enough?

Based on my tests, using PGSTAT_MIN_INTERVAL seems to
provide a better balance than PGSTAT_IDLE_INTERVAL.

Same here.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#25

Michael Paquier

michael@paquier.xyz

10 months ago

In reply to: Bertrand Drouvot (#22)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Tue, Mar 18, 2025 at 09:51:14AM +0000, Bertrand Drouvot wrote:

Thanks!

Done that part for now. It will be possible to look at the bug fix
after the release freeze as it impacts stable branches as well.
--
Michael

#26

Xuneng Zhou

xunengzhou@gmail.com

10 months ago

In reply to: Bertrand Drouvot (#24)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

I am OK with PGSTAT_MIN_INTERVAL. It has been used for flushing other
statistics as well. And monitoring systems are generally configured to poll
at one-second or longer intervals.

I think that reporting at PGSTAT_MIN_INTERVAL is fine and more than
enough. I
mean, I 'm not sure that there is a real use case to query the statistics
related
view at more than a second interval anyway.

I think these two conditions are good too. In a busy system, they are met
frequently, so the flush routine will be executed at least once every
second. Conversely, when WAL generation is low, there's simply less data to
record, and the flush frequency naturally decreases.

Show quoted text

Or are you concerned that we may not
enter the "When the WAL sender is caught up or has pending data to send"
frequently
enough?

Based on my tests, using PGSTAT_MIN_INTERVAL seems to
provide a better balance than PGSTAT_IDLE_INTERVAL.

Same here.

#27

Michael Paquier

michael@paquier.xyz

10 months ago

In reply to: Xuneng Zhou (#26)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Wed, Mar 19, 2025 at 09:53:16AM +0800, Xuneng Zhou wrote:

I think these two conditions are good too. In a busy system, they are met
frequently, so the flush routine will be executed at least once every
second. Conversely, when WAL generation is low, there's simply less data to
record, and the flush frequency naturally decreases.

Hmm, yeah, perhaps this is acceptable. The changes in pgstat.c seem
inconsistent, though, only moving the min interval while the max and
idle times stay around.

This also make me wonder if we should work towards extending
pgstat_report_stat(), so as we save in GetCurrentTimestamp() while
making the internals still local to pgstat.c. Or perhaps not in the
scope of a backpatchable design.
--
Michael

#28

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Michael Paquier (#25)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Mar 19, 2025 at 08:19:42AM +0900, Michael Paquier wrote:

Done that part for now.

Thanks!

It will be possible to look at the bug fix
after the release freeze as it impacts stable branches as well.

Yes, let's do that. Adding a reminder on my side.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#29

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Michael Paquier (#27)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Mar 19, 2025 at 02:26:47PM +0900, Michael Paquier wrote:

On Wed, Mar 19, 2025 at 09:53:16AM +0800, Xuneng Zhou wrote:

I think these two conditions are good too. In a busy system, they are met
frequently, so the flush routine will be executed at least once every
second. Conversely, when WAL generation is low, there's simply less data to
record, and the flush frequency naturally decreases.

Hmm, yeah, perhaps this is acceptable. The changes in pgstat.c seem
inconsistent, though, only moving the min interval while the max and
idle times stay around.

That's right. OTOH that sounds weird to move the others 2: that would create
wider visibility for them without real needs. That's not a big issue,
but could impact extensions or friends that would start using those should we
change their values in the future.

Another option could be to create a dedicated one in walsender.c but I'm not
sure I like it more.

This also make me wonder if we should work towards extending
pgstat_report_stat(), so as we save in GetCurrentTimestamp() while
making the internals still local to pgstat.c. Or perhaps not in the
scope of a backpatchable design.

I see. I think I'd vote for making the same changes on HEAD and the STABLE branches
as a first step. And then think about a new "design" that could be part of the
ideas suggested by Andres in [1]/messages/by-id/erpzwxoptqhuptdrtehqydzjapvroumkhh7lc6poclbhe7jk7l@l3yfsq5q4pw7. Thoughts?

Regards,

[1]: /messages/by-id/erpzwxoptqhuptdrtehqydzjapvroumkhh7lc6poclbhe7jk7l@l3yfsq5q4pw7

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#30

Xuneng Zhou

xunengzhou@gmail.com

10 months ago

In reply to: Bertrand Drouvot (#29)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,
Moving the other two provides a more complete view of the settings. For
newcomers(like me) to the codebase, seeing all three related values in one
place helps avoid a narrow view of the settings.

But I am not sure that I understand the cons of this well.

Bertrand Drouvot <bertranddrouvot.pg@gmail.com> 于2025年3月19日周三 13:50写道：

Hi,

On Wed, Mar 19, 2025 at 02:26:47PM +0900, Michael Paquier wrote:

On Wed, Mar 19, 2025 at 09:53:16AM +0800, Xuneng Zhou wrote:

I think these two conditions are good too. In a busy system, they are

met

frequently, so the flush routine will be executed at least once every
second. Conversely, when WAL generation is low, there's simply less

data to

record, and the flush frequency naturally decreases.

Hmm, yeah, perhaps this is acceptable. The changes in pgstat.c seem
inconsistent, though, only moving the min interval while the max and
idle times stay around.

That's right. OTOH that sounds weird to move the others 2: that would
create
wider visibility for them without real needs. That's not a big issue,
but could impact extensions or friends that would start using those should
we
change their values in the future.

The current name of the min interval appears consistent within the context
of the surrounding code in this patch.

Another option could be to create a dedicated one in walsender.c but I'm not

Show quoted text

sure I like it more.

#31

Michael Paquier

michael@paquier.xyz

10 months ago

In reply to: Xuneng Zhou (#30)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote:

Hi,
Moving the other two provides a more complete view of the settings. For
newcomers(like me) to the codebase, seeing all three related values in one
place helps avoid a narrow view of the settings.

But I am not sure that I understand the cons of this well.

While I don't disagree with the use of a hardcoded interval of time to
control timing the flush of the WAL sender stats, do we really need to
rely on the timing defined by pgstat.c? Wouldn't it be simpler to
assign one in walsender.c and pick up a different, perhaps higher,
value?

At the end the timestamp calculations are free because we can rely on
the existing call of GetCurrentTimestamp() for the physical WAL
senders to get an idea of the current time. For the logical WAL
senders, perhaps we'd better do the reports in WalSndWaitForWal(),
actually. There is also a call to GetCurrentTimestamp() that we can
rely on in this path.
--
Michael

#32

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

10 months ago

In reply to: Michael Paquier (#31)

1 attachment(s)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Mon, Mar 24, 2025 at 08:41:20AM +0900, Michael Paquier wrote:

On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote:

Hi,
Moving the other two provides a more complete view of the settings. For
newcomers(like me) to the codebase, seeing all three related values in one
place helps avoid a narrow view of the settings.

But I am not sure that I understand the cons of this well.

While I don't disagree with the use of a hardcoded interval of time to
control timing the flush of the WAL sender stats, do we really need to
rely on the timing defined by pgstat.c?

No but I thought it could make sense.

Wouldn't it be simpler to
assign one in walsender.c and pick up a different, perhaps higher,
value?

I don't have a strong opinion on it so done as suggested above in the attached.

I think that the 1s value is fine because: 1. it is consistent with
PGSTAT_MIN_INTERVAL and 2. it already needs that the sender is caught up or
has pending data to send (means it could be higher than 1s already). That said,
I don't think that would hurt if you think of a higher value.

At the end the timestamp calculations are free because we can rely on
the existing call of GetCurrentTimestamp() for the physical WAL
senders to get an idea of the current time.

Yup

For the logical WAL
senders, perhaps we'd better do the reports in WalSndWaitForWal(),
actually. There is also a call to GetCurrentTimestamp() that we can
rely on in this path.

I think it's better to flush the stats in a shared code path. I think it's
easier to maintain and that there is no differences between logical and
physical walsenders that would justify to flush the stats in specific code
paths.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-Flush-the-IO-statistics-of-active-walsenders.patchtext/x-diff; charset=us-asciiDownload

From 1102b1bccaa5ff88b4045a4e8751e43094e946e5 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Feb 2025 10:18:05 +0000
Subject: [PATCH v5] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. flush its IO statistics periodically to not overload the walsender
3. adds a test for a physical walsender (a logical walsender had the same issue
but the fix is in the same code path)
---
 src/backend/replication/walsender.c   | 63 +++++++++++++++++++--------
 src/test/recovery/t/001_stream_rep.pl | 16 +++++++
 2 files changed, 61 insertions(+), 18 deletions(-)
  76.6% src/backend/replication/
  23.3% src/test/recovery/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1028919aecb..9eed37b5de9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -91,10 +91,14 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+/* Minimum interval walsender IO stats flushes */
+#define MIN_IOSTATS_FLUSH_INTERVAL         1000
+
 /*
  * Maximum data payload in a WAL data message.  Must be >= XLOG_BLCKSZ.
  *
@@ -2742,6 +2746,8 @@ WalSndCheckTimeOut(void)
 static void
 WalSndLoop(WalSndSendDataCallback send_data)
 {
+	TimestampTz last_flush = 0;
+
 	/*
 	 * Initialize the last reply timestamp. That enables timeout processing
 	 * from hereon.
@@ -2836,30 +2842,51 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
 		 * its additional actions.  For physical replication, also block if
 		 * caught up; its send_data does not block.
+		 *
+		 * When the WAL sender is caught up or has pending data to send, we
+		 * also periodically report I/O statistics. It's done periodically to
+		 * not overload the WAL sender.
 		 */
-		if ((WalSndCaughtUp && send_data != XLogSendLogical &&
-			 !streamingDoneSending) ||
-			pq_is_send_pending())
+		if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
 		{
-			long		sleeptime;
-			int			wakeEvents;
+			TimestampTz now;
 
-			if (!streamingDoneReceiving)
-				wakeEvents = WL_SOCKET_READABLE;
-			else
-				wakeEvents = 0;
+			now = GetCurrentTimestamp();
 
-			/*
-			 * Use fresh timestamp, not last_processing, to reduce the chance
-			 * of reaching wal_sender_timeout before sending a keepalive.
-			 */
-			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+			if (TimestampDifferenceExceeds(last_flush, now, MIN_IOSTATS_FLUSH_INTERVAL))
+			{
+				/*
+				 * Report IO statistics
+				 */
+				pgstat_flush_io(false);
+				(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+				last_flush = now;
+			}
 
-			if (pq_is_send_pending())
-				wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (send_data != XLogSendLogical || pq_is_send_pending())
+			{
+				long		sleeptime;
+				int			wakeEvents;
+
+				if (!streamingDoneReceiving)
+					wakeEvents = WL_SOCKET_READABLE;
+				else
+					wakeEvents = 0;
+
+				/*
+				 * Use fresh timestamp, not last_processing, to reduce the
+				 * chance of reaching wal_sender_timeout before sending a
+				 * keepalive.
+				 */
+				sleeptime = WalSndComputeSleeptime(now);
+
+				if (pq_is_send_pending())
+					wakeEvents |= WL_SOCKET_WRITEABLE;
+
+				/* Sleep until something happens or we time out */
+				WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
+			}
 
-			/* Sleep until something happens or we time out */
-			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
 		}
 	}
 }
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index ccd8417d449..2508295eca6 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -42,6 +42,9 @@ $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# To check that an active walsender updates its IO statistics below.
+$node_primary->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
@@ -333,6 +336,19 @@ $node_primary->psql(
 
 note "switching to physical replication slot";
 
+# Wait for the walsender to update its IO statistics.
+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_primary->poll_query_until(
+	'postgres',
+	qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+
 # Switch to using a physical replication slot. We can do this without a new
 # backup since physical slots can go backwards if needed. Do so on both
 # standbys. Since we're going to be testing things that affect the slot state,
-- 
2.34.1

#33

vignesh C

vignesh21@gmail.com

9 months ago

In reply to: Bertrand Drouvot (#32)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Mon, 31 Mar 2025 at 18:35, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Mon, Mar 24, 2025 at 08:41:20AM +0900, Michael Paquier wrote:

On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote:

Hi,
Moving the other two provides a more complete view of the settings. For
newcomers(like me) to the codebase, seeing all three related values in one
place helps avoid a narrow view of the settings.

But I am not sure that I understand the cons of this well.

While I don't disagree with the use of a hardcoded interval of time to
control timing the flush of the WAL sender stats, do we really need to
rely on the timing defined by pgstat.c?

No but I thought it could make sense.

Wouldn't it be simpler to
assign one in walsender.c and pick up a different, perhaps higher,
value?

I don't have a strong opinion on it so done as suggested above in the attached.

I think that the 1s value is fine because: 1. it is consistent with
PGSTAT_MIN_INTERVAL and 2. it already needs that the sender is caught up or
has pending data to send (means it could be higher than 1s already). That said,
I don't think that would hurt if you think of a higher value.

At the end the timestamp calculations are free because we can rely on
the existing call of GetCurrentTimestamp() for the physical WAL
senders to get an idea of the current time.

Yup

For the logical WAL
senders, perhaps we'd better do the reports in WalSndWaitForWal(),
actually. There is also a call to GetCurrentTimestamp() that we can
rely on in this path.

I think it's better to flush the stats in a shared code path. I think it's
easier to maintain and that there is no differences between logical and
physical walsenders that would justify to flush the stats in specific code
paths.

Couple of suggestions:
1) I felt we can include a similar verification for one of the logical
replication tests too:
+# Wait for the walsender to update its IO statistics.

+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_primary->poll_query_until(
+ 'postgres',
+ qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+

2) We can comment this in a single line itself:
+ /*
+ * Report IO statistics
+ */

Regards,
Vignesh

#34

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

9 months ago

In reply to: vignesh C (#33)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Thu, Apr 03, 2025 at 09:25:02AM +0530, vignesh C wrote:

On Mon, 31 Mar 2025 at 18:35, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Couple of suggestions:

Thanks for looking at it!

1) I felt we can include a similar verification for one of the logical
replication tests too:
+# Wait for the walsender to update its IO statistics.

+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_primary->poll_query_until(
+ 'postgres',
+ qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+

Initially ([1]/messages/by-id/Z73IsKBceoVd4t55@ip-10-97-1-34.eu-west-3.compute.internal) it was added in 035_standby_logical_decoding.pl. But this
test (035) is already racy enough that I felt better to move it to 001_stream_rep.pl
(see [2]/messages/by-id/Z77jgvhwOu9S0a5r@ip-10-97-1-34.eu-west-3.compute.internal) instead.

I don't think that's a big issue as the code path being changed in the patch is
shared between logical and physical walsenders. That said that would not hurt to
add a logical walsender test, but I would prefer it to be outside
035_standby_logical_decoding.pl. Do you have a suggestion for the location of
such a test?

2) We can comment this in a single line itself:
+ /*
+ * Report IO statistics
+ */

Right, but:

- it's already done that way in walsender.c (see "Try to flush any pending output to the client"
for example).
- the exact same comment is written that way in pgstat_bgwriter.c and
pgstat_checkpointer.c

So that I just copy/paste it.

[1]: /messages/by-id/Z73IsKBceoVd4t55@ip-10-97-1-34.eu-west-3.compute.internal
[2]: /messages/by-id/Z77jgvhwOu9S0a5r@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#35

vignesh C

vignesh21@gmail.com

9 months ago

In reply to: Bertrand Drouvot (#34)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Thu, 3 Apr 2025 at 12:54, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Thu, Apr 03, 2025 at 09:25:02AM +0530, vignesh C wrote:

On Mon, 31 Mar 2025 at 18:35, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Couple of suggestions:

Thanks for looking at it!
1) I felt we can include a similar verification for one of the logical
replication tests too:
+# Wait for the walsender to update its IO statistics.
+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_primary->poll_query_until(
+ 'postgres',
+ qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+
Initially ([1]) it was added in 035_standby_logical_decoding.pl. But this
test (035) is already racy enough that I felt better to move it to 001_stream_rep.pl
(see [2]) instead.

I don't think that's a big issue as the code path being changed in the patch is
shared between logical and physical walsenders. That said that would not hurt to
add a logical walsender test, but I would prefer it to be outside
035_standby_logical_decoding.pl. Do you have a suggestion for the location of
such a test?

Can we add it to one of the subscription tests, such as 001_rep_changes.pl?

Regards,
Vignesh

#36

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

9 months ago

In reply to: vignesh C (#35)

1 attachment(s)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Thu, Apr 03, 2025 at 03:23:31PM +0530, vignesh C wrote:

Can we add it to one of the subscription tests, such as 001_rep_changes.pl?

Yeah that sounds like a good place for it. Done in the attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Flush-the-IO-statistics-of-active-walsenders.patchtext/x-diff; charset=us-asciiDownload

From 49d9c73a29f5c05567d01f57424b7b8fd4df4eee Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Feb 2025 10:18:05 +0000
Subject: [PATCH v6] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. flush its IO statistics periodically to not overload the walsender
3. adds a test for a physical walsender and a test for a logical walsender
---
 src/backend/replication/walsender.c        | 63 +++++++++++++++-------
 src/test/recovery/t/001_stream_rep.pl      | 16 ++++++
 src/test/subscription/t/001_rep_changes.pl | 13 +++++
 3 files changed, 74 insertions(+), 18 deletions(-)
  65.8% src/backend/replication/
  20.0% src/test/recovery/t/
  14.0% src/test/subscription/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1028919aecb..9eed37b5de9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -91,10 +91,14 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+/* Minimum interval walsender IO stats flushes */
+#define MIN_IOSTATS_FLUSH_INTERVAL         1000
+
 /*
  * Maximum data payload in a WAL data message.  Must be >= XLOG_BLCKSZ.
  *
@@ -2742,6 +2746,8 @@ WalSndCheckTimeOut(void)
 static void
 WalSndLoop(WalSndSendDataCallback send_data)
 {
+	TimestampTz last_flush = 0;
+
 	/*
 	 * Initialize the last reply timestamp. That enables timeout processing
 	 * from hereon.
@@ -2836,30 +2842,51 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
 		 * its additional actions.  For physical replication, also block if
 		 * caught up; its send_data does not block.
+		 *
+		 * When the WAL sender is caught up or has pending data to send, we
+		 * also periodically report I/O statistics. It's done periodically to
+		 * not overload the WAL sender.
 		 */
-		if ((WalSndCaughtUp && send_data != XLogSendLogical &&
-			 !streamingDoneSending) ||
-			pq_is_send_pending())
+		if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
 		{
-			long		sleeptime;
-			int			wakeEvents;
+			TimestampTz now;
 
-			if (!streamingDoneReceiving)
-				wakeEvents = WL_SOCKET_READABLE;
-			else
-				wakeEvents = 0;
+			now = GetCurrentTimestamp();
 
-			/*
-			 * Use fresh timestamp, not last_processing, to reduce the chance
-			 * of reaching wal_sender_timeout before sending a keepalive.
-			 */
-			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+			if (TimestampDifferenceExceeds(last_flush, now, MIN_IOSTATS_FLUSH_INTERVAL))
+			{
+				/*
+				 * Report IO statistics
+				 */
+				pgstat_flush_io(false);
+				(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+				last_flush = now;
+			}
 
-			if (pq_is_send_pending())
-				wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (send_data != XLogSendLogical || pq_is_send_pending())
+			{
+				long		sleeptime;
+				int			wakeEvents;
+
+				if (!streamingDoneReceiving)
+					wakeEvents = WL_SOCKET_READABLE;
+				else
+					wakeEvents = 0;
+
+				/*
+				 * Use fresh timestamp, not last_processing, to reduce the
+				 * chance of reaching wal_sender_timeout before sending a
+				 * keepalive.
+				 */
+				sleeptime = WalSndComputeSleeptime(now);
+
+				if (pq_is_send_pending())
+					wakeEvents |= WL_SOCKET_WRITEABLE;
+
+				/* Sleep until something happens or we time out */
+				WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
+			}
 
-			/* Sleep until something happens or we time out */
-			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
 		}
 	}
 }
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index ccd8417d449..2508295eca6 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -42,6 +42,9 @@ $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# To check that an active walsender updates its IO statistics below.
+$node_primary->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
@@ -333,6 +336,19 @@ $node_primary->psql(
 
 note "switching to physical replication slot";
 
+# Wait for the walsender to update its IO statistics.
+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_primary->poll_query_until(
+	'postgres',
+	qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+
 # Switch to using a physical replication slot. We can do this without a new
 # backup since physical slots can go backwards if needed. Do so on both
 # standbys. Since we're going to be testing things that affect the slot state,
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 8726fe04ad2..fc17bbd48a0 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -184,6 +184,19 @@ $result =
   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_no_col");
 is($result, qq(2), 'check replicated changes for table having no columns');
 
+# Wait for the walsender to update its IO statistics.
+# Has to be done far enough from the CREATE SUBSCRIPTION to minimize the risk
+# of polling for too long.
+$node_publisher->poll_query_until(
+	'postgres',
+	qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+
 # insert some duplicate rows
 $node_publisher->safe_psql('postgres',
 	"INSERT INTO tab_full SELECT generate_series(1,10)");
-- 
2.34.1

#37

vignesh C

vignesh21@gmail.com

9 months ago

In reply to: Bertrand Drouvot (#36)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Fri, 4 Apr 2025 at 10:31, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Thu, Apr 03, 2025 at 03:23:31PM +0530, vignesh C wrote:

Can we add it to one of the subscription tests, such as 001_rep_changes.pl?

Yeah that sounds like a good place for it. Done in the attached.

The new test added currently passes even without the patch. It would
be ideal to have a test that fails without the patch and passes once
the patch is applied.

Regards,
Vignesh

#38

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: vignesh C (#37)

1 attachment(s)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Fri, Apr 04, 2025 at 09:33:46PM +0530, vignesh C wrote:

The new test added currently passes even without the patch. It would
be ideal to have a test that fails without the patch and passes once
the patch is applied.

Right. The subscription test and logical WAL senders passes without
the patch, because we are able to catch some WAL activity through
pgoutput. The recovery test for physical WAL sender fails without the
patch on timeout.

We could need something more advanced here for the logical case, where
we could use pg_recvlogical started in the background with a hardcoded
endpos, or just kill the pg_recvlogical command once we have checked
the state of the stats. I am not sure if this is worth the cycles
spent on, TBH, so I would be happy with just the physical case checked
in TAP as it's simpler because streaming replication makes that easy
to work with.

One thing that I'm a bit unhappy about in the patch is the choice to
do the stats updates in WalSndLoop() for the logical WAL sender case,
because this forces an extra GetCurrentTimestamp() call for each loop,
and that's never a cheap system call in what can be a hot code path.
How about doing the calculations in WalSndWaitForWal() for the logical
part, relying on the existing GetCurrentTimestamp() done there?
That's also where the waits are handled for the logical part, so there
may be a good point in keeping this code more symmetric for now,
rather than split it.

Saying that, here is a version 7 with all that included, which is
simpler to read.
--
Michael

Attachments:

v7-0001-Flush-the-IO-statistics-of-active-walsenders.patchtext/x-diff; charset=us-asciiDownload

From 9f6dadc679cdc0da7c005d005a06dad6e81020ad Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 25 Feb 2025 10:18:05 +0000
Subject: [PATCH v7] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. flush its IO statistics periodically to not overload the walsender
3. adds a test for a physical walsender and a test for a logical walsender
---
 src/backend/replication/walsender.c   | 36 +++++++++++++++++++++++++--
 src/test/recovery/t/001_stream_rep.pl | 15 +++++++++++
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1028919aecb1..216baeda5cd2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -91,10 +91,14 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+/* Minimum interval used by walsender for stats flushes, in ms */
+#define WALSENDER_STATS_FLUSH_INTERVAL         1000
+
 /*
  * Maximum data payload in a WAL data message.  Must be >= XLOG_BLCKSZ.
  *
@@ -1797,6 +1801,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	int			wakeEvents;
 	uint32		wait_event = 0;
 	static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+	TimestampTz last_flush = 0;
 
 	/*
 	 * Fast path to avoid acquiring the spinlock in case we already know we
@@ -1817,6 +1822,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 	{
 		bool		wait_for_standby_at_stop = false;
 		long		sleeptime;
+		TimestampTz now;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -1927,7 +1933,8 @@ WalSndWaitForWal(XLogRecPtr loc)
 		 * new WAL to be generated.  (But if we have nothing to send, we don't
 		 * want to wake on socket-writable.)
 		 */
-		sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+		now = GetCurrentTimestamp();
+		sleeptime = WalSndComputeSleeptime(now);
 
 		wakeEvents = WL_SOCKET_READABLE;
 
@@ -1936,6 +1943,15 @@ WalSndWaitForWal(XLogRecPtr loc)
 
 		Assert(wait_event != 0);
 
+		/* Report IO statistics, if needed */
+		if (TimestampDifferenceExceeds(last_flush, now,
+									   WALSENDER_STATS_FLUSH_INTERVAL))
+		{
+			pgstat_flush_io(false);
+			(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+			last_flush = now;
+		}
+
 		WalSndWait(wakeEvents, sleeptime, wait_event);
 	}
 
@@ -2742,6 +2758,8 @@ WalSndCheckTimeOut(void)
 static void
 WalSndLoop(WalSndSendDataCallback send_data)
 {
+	TimestampTz last_flush = 0;
+
 	/*
 	 * Initialize the last reply timestamp. That enables timeout processing
 	 * from hereon.
@@ -2836,6 +2854,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
 		 * its additional actions.  For physical replication, also block if
 		 * caught up; its send_data does not block.
+		 *
+		 * The IO statistics are reported in WalSndWaitForWal() for the
+		 * logical WAL senders.
 		 */
 		if ((WalSndCaughtUp && send_data != XLogSendLogical &&
 			 !streamingDoneSending) ||
@@ -2843,6 +2864,7 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		{
 			long		sleeptime;
 			int			wakeEvents;
+			TimestampTz now;
 
 			if (!streamingDoneReceiving)
 				wakeEvents = WL_SOCKET_READABLE;
@@ -2853,11 +2875,21 @@ WalSndLoop(WalSndSendDataCallback send_data)
 			 * Use fresh timestamp, not last_processing, to reduce the chance
 			 * of reaching wal_sender_timeout before sending a keepalive.
 			 */
-			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+			now = GetCurrentTimestamp();
+			sleeptime = WalSndComputeSleeptime(now);
 
 			if (pq_is_send_pending())
 				wakeEvents |= WL_SOCKET_WRITEABLE;
 
+			/* Report IO statistics, if needed */
+			if (TimestampDifferenceExceeds(last_flush, now,
+										   WALSENDER_STATS_FLUSH_INTERVAL))
+			{
+				pgstat_flush_io(false);
+				(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+				last_flush = now;
+			}
+
 			/* Sleep until something happens or we time out */
 			WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
 		}
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index ccd8417d449f..e55d8ec0ec17 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -42,6 +42,9 @@ $node_standby_2->init_from_backup($node_standby_1, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# Reset IO statistics, for the WAL sender check with pg_stat_io.
+$node_primary->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 # Create some content on primary and check its presence in standby nodes
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
@@ -333,6 +336,18 @@ $node_primary->psql(
 
 note "switching to physical replication slot";
 
+# Wait for the walsender to update its IO statistics.  This is done before
+# the next restart and far enough from the reset done above.
+$node_primary->poll_query_until(
+	'postgres',
+	qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+
 # Switch to using a physical replication slot. We can do this without a new
 # backup since physical slots can go backwards if needed. Do so on both
 # standbys. Since we're going to be testing things that affect the slot state,
-- 
2.49.0

#39

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

9 months ago

In reply to: Michael Paquier (#38)

1 attachment(s)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Mon, Apr 07, 2025 at 03:31:14PM +0900, Michael Paquier wrote:

On Fri, Apr 04, 2025 at 09:33:46PM +0530, vignesh C wrote:

The new test added currently passes even without the patch. It would
be ideal to have a test that fails without the patch and passes once
the patch is applied.

Right. The subscription test and logical WAL senders passes without
the patch, because we are able to catch some WAL activity through
pgoutput. The recovery test for physical WAL sender fails without the
patch on timeout.

We could need something more advanced here for the logical case, where
we could use pg_recvlogical started in the background with a hardcoded
endpos, or just kill the pg_recvlogical command once we have checked
the state of the stats. I am not sure if this is worth the cycles
spent on,

Another option for the logical walsender is to reset the stats once the
subscription is created like in the attached. With the attached in place the
test now fails without the patch in place (and passes with the patch).

One thing that I'm a bit unhappy about in the patch is the choice to
do the stats updates in WalSndLoop() for the logical WAL sender case,
because this forces an extra GetCurrentTimestamp() call for each loop,
and that's never a cheap system call in what can be a hot code path.
How about doing the calculations in WalSndWaitForWal() for the logical
part, relying on the existing GetCurrentTimestamp() done there?
That's also where the waits are handled for the logical part, so there
may be a good point in keeping this code more symmetric for now,
rather than split it.

Well, while I was working on a reproducer case for the logical walsender (that
lead to the attached), I realized that the test was looping for relatively long
time before passing. Then I was changing the patch to do *exactly* what you did
propose in v7 and the logical walsender test now passes quickly enough...

So it looks like that the idea of using a "shared" code path was not that good
and that spliting physical/logical as done in v7 is a good choice.

But...

TBH, so I would be happy with just the physical case checked
in TAP as it's simpler because streaming replication makes that easy
to work with.

As we now have 2 code paths I think we "really" need 2 tests. The attached
(to apply on top of v7) seems to do the job.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

add_logical_walsender_test.txttext/plain; charset=us-asciiDownload

commit e3d9337aa01539ce77553d95db47aa919d1f501c
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date:   Mon Apr 7 06:56:08 2025 +0000

    logical walsender test

diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 8726fe04ad2..2253ee48312 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -113,6 +113,9 @@ $node_subscriber->safe_psql('postgres',
 # Wait for initial table sync to finish
 $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');
 
+# To check that an active walsender updates its IO statistics below.
+$node_publisher->safe_psql('postgres', "SELECT pg_stat_reset_shared('io')");
+
 my $result =
   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_notrep");
 is($result, qq(0), 'check non-replicated table is empty on subscriber');
@@ -184,6 +187,19 @@ $result =
   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_no_col");
 is($result, qq(2), 'check replicated changes for table having no columns');
 
+# Wait for the walsender to update its IO statistics.
+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_publisher->poll_query_until(
+	'postgres',
+	qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+
 # insert some duplicate rows
 $node_publisher->safe_psql('postgres',
 	"INSERT INTO tab_full SELECT generate_series(1,10)");

#40

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Bertrand Drouvot (#39)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Mon, Apr 07, 2025 at 07:13:20AM +0000, Bertrand Drouvot wrote:

As we now have 2 code paths I think we "really" need 2 tests. The attached
(to apply on top of v7) seems to do the job.

Confirmed. I am sold on this extra location on HEAD, and it does not
impact the run time of the test as there is enough activity happening
between the resets and the poll check. So, done down to v16.
--
Michael

#41

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

3 months ago

In reply to: Michael Paquier (#5)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Thu, Feb 27, 2025 at 12:09:46PM +0900, Michael Paquier wrote:

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction. The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

Agreed that something in the lines of non-transaction update of the
entries could be adapted in some cases, so +1 for the idea. I suspect
that there would be cases where a single stats kind should be able to
handle both transactional and non-transactional flush cases.
Splitting that across two stats kinds would lead to a lot of
duplication.

One option could be to use 2 pending lists for the variables stats and 2 flush
callbacks for fixed stats. Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#42

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

3 months ago

In reply to: Andres Freund (#4)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:

Hi,

On 2025-02-26 15:37:10 +0900, Michael Paquier wrote:

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

FWIW, I think medium term we need to work on splitting stats flushing into two
separate kinds of flushes:
1) non-transactional stats, which should be flushed at a regular interval,
unless a process is completely idle
2) transaction stats, which can only be flushed at transaction boundaries,
because before the transaction boundary we don't know if e.g. newly
inserted rows should be counted as live or dead

So far we have some timer logic for 2), but we have basically no support for
1). Which means we have weird ad-hoc logic in various kinds of
non-plain-connection processes. And that will often have holes, as Bertrand
noticed here.

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction.

I can see an issue with GetCurrentTransactionStopTimestamp(), any other issue
/concern you have in mind?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#43

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Bertrand Drouvot (#41)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Tue, Sep 30, 2025 at 07:14:14AM +0000, Bertrand Drouvot wrote:

On Thu, Feb 27, 2025 at 12:09:46PM +0900, Michael Paquier wrote:

Agreed that something in the lines of non-transaction update of the
entries could be adapted in some cases, so +1 for the idea. I suspect
that there would be cases where a single stats kind should be able to
handle both transactional and non-transactional flush cases.
Splitting that across two stats kinds would lead to a lot of
duplication.

One option could be to use 2 pending lists for the variables stats and 2 flush
callbacks for fixed stats. Thoughts?

Hmm. I would have thought first about one pending area, and two
callbacks for the variable-sized stats, called with a different
timing because the stats to be flushed are the same aren't they? For
example, if we are in a long analytical query, we would flush the IO
stats periodically, reset the pending data, repeat/rinse periodically,
and do a last round when we are done with the query in postgres.c.

Do we really need a second callback by the way? It could be as well
the same flush callback, with an option to mark stats kinds that allow
a periodic flush. The trick is knowing where the new reports calls
should happen. The executor is the primary target area.

Or perhaps you think that the pending data of a stats kind could be
different if a kind allows transactional and non-transactional
flushes?
--
Michael

#44

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

3 months ago

In reply to: Michael Paquier (#43)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Tue, Sep 30, 2025 at 04:37:25PM +0900, Michael Paquier wrote:

On Tue, Sep 30, 2025 at 07:14:14AM +0000, Bertrand Drouvot wrote:

On Thu, Feb 27, 2025 at 12:09:46PM +0900, Michael Paquier wrote:

Agreed that something in the lines of non-transaction update of the
entries could be adapted in some cases, so +1 for the idea. I suspect
that there would be cases where a single stats kind should be able to
handle both transactional and non-transactional flush cases.
Splitting that across two stats kinds would lead to a lot of
duplication.

One option could be to use 2 pending lists for the variables stats and 2 flush
callbacks for fixed stats. Thoughts?

Hmm. I would have thought first about one pending area, and two
callbacks for the variable-sized stats, called with a different
timing because the stats to be flushed are the same aren't they? For
example, if we are in a long analytical query, we would flush the IO
stats periodically, reset the pending data, repeat/rinse periodically,
and do a last round when we are done with the query in postgres.c.

Do we really need a second callback by the way? It could be as well
the same flush callback, with an option to mark stats kinds that allow
a periodic flush. The trick is knowing where the new reports calls
should happen. The executor is the primary target area.

Or perhaps you think that the pending data of a stats kind could be
different if a kind allows transactional and non-transactional
flushes?

Yeah that was my thought: one stats kind could have metrics that are transactional
and some metrics that are non-transactional. This could be possible for
both variable and fixed stats, that's why I was thinking about having
2 pending lists for the variables stats and 2 flush callbacks for fixed stats.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#45

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Bertrand Drouvot (#44)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Tue, Sep 30, 2025 at 10:05:43AM +0000, Bertrand Drouvot wrote:

On Tue, Sep 30, 2025 at 04:37:25PM +0900, Michael Paquier wrote:

Or perhaps you think that the pending data of a stats kind could be
different if a kind allows transactional and non-transactional
flushes?

Yeah that was my thought: one stats kind could have metrics that are transactional
and some metrics that are non-transactional. This could be possible for
both variable and fixed stats, that's why I was thinking about having
2 pending lists for the variables stats and 2 flush callbacks for fixed stats.

Okay. The first use case that comes into mind here is pgstat_io.c, to
offer periodic reports on analytics (please extend that to the backend
stats for WAL and I/O). In this case, it seems that it would be good
to know about all the stats fields, meaning that a single pending list
is OK. My feeling would be to reuse pgstat_report_stat() and plant a
new argument for the transactional state (or just a bits32 that works
along the force option) with the same threshold for reports, combined
with a new option to allow non-transactional reports (don't see a
point in forcing these and wait on locks, 1-min reports not happening
would not matter if an analytical query takes a few hours).

It would be better to determine a list of the stats we'd be
interesting in seeing updated without waiting for a query or a
transaction to finish, then design the needs around these
requirements, with a new thread to discuss the matter. The IO stats
are just the first case coming into mind.
--
Michael

#46

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

3 months ago

In reply to: Michael Paquier (#45)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Oct 01, 2025 at 07:43:31AM +0900, Michael Paquier wrote:

On Tue, Sep 30, 2025 at 10:05:43AM +0000, Bertrand Drouvot wrote:

On Tue, Sep 30, 2025 at 04:37:25PM +0900, Michael Paquier wrote:

Or perhaps you think that the pending data of a stats kind could be
different if a kind allows transactional and non-transactional
flushes?

Yeah that was my thought: one stats kind could have metrics that are transactional
and some metrics that are non-transactional. This could be possible for
both variable and fixed stats, that's why I was thinking about having
2 pending lists for the variables stats and 2 flush callbacks for fixed stats.

Okay. The first use case that comes into mind here is pgstat_io.c, to
offer periodic reports on analytics (please extend that to the backend
stats for WAL and I/O). In this case, it seems that it would be good
to know about all the stats fields, meaning that a single pending list
is OK. My feeling would be to reuse pgstat_report_stat() and plant a
new argument for the transactional state (or just a bits32 that works
along the force option) with the same threshold for reports, combined
with a new option to allow non-transactional reports (don't see a
point in forcing these and wait on locks, 1-min reports not happening
would not matter if an analytical query takes a few hours).

It would be better to determine a list of the stats we'd be
interesting in seeing updated without waiting for a query or a
transaction to finish, then design the needs around these
requirements, with a new thread to discuss the matter. The IO stats
are just the first case coming into mind.

Yeah. Also for the relations stats we'd have a mix (for example there is no
need to wait for a transaction to finish to report it's related seq_scan if
any).

I'll work on a list and open a dedicated thread.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#47

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

7 days ago

In reply to: Andres Freund (#4)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:

FWIW, I think medium term we need to work on splitting stats flushing into two
separate kinds of flushes:
1) non-transactional stats, which should be flushed at a regular interval,
unless a process is completely idle
2) transaction stats, which can only be flushed at transaction boundaries,
because before the transaction boundary we don't know if e.g. newly
inserted rows should be counted as live or dead

So far we have some timer logic for 2), but we have basically no support for
1). Which means we have weird ad-hoc logic in various kinds of
non-plain-connection processes. And that will often have holes, as Bertrand
noticed here.

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction. The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

FWIW, I just started to look a it. I'll update once I've something to share.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com