[BUG]: the walsender does not update its IO statistics until it exits

Started by Bertrand Drouvotabout 1 year ago47 messageshackers

bertranddrouvot.pg@gmail.com

about 1 year ago

Hi hackers,

while doing some tests for [1]/messages/by-id/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal, I observed that $SUBJECT.

To observe this behavior on master:

1. create a logical replication slot

postgres=# SELECT * FROM pg_create_logical_replication_slot('logical_slot', 'test_decoding', false, true);
slot_name | lsn
--------------+------------
logical_slot | 0/40749508
(1 row)

2. create a table and add some data

postgres=# create table bdt (a int);
CREATE TABLE
postgres=# insert into bdt select a from generate_series(1,10000) a ;
INSERT 0 10000

3. starts pg_recvlogical that way

pg_recvlogical -d postgres -S logical_slot -f - --no-loop --start

4. query pg_stat_io

The non zeros stats that we see here are due to the pgstat_report_stat() call in
PostgresMain() but not to the walsender decoding activity per say (proof is that
you can see that the wal object values are empty while it certainly had to read
some WAL).

5. Once ctrl-c is done for pg_recvlogical then we get:

Now we can see that the numbers increased for the relation object and that we
get non zeros numbers for the wal object too (which makes fully sense).

With the attached patch applied, we would get the same numbers already in
step 4. (means the stats are flushed without the need to wait for the walsender
to exit).

Remarks:

R1. The extra flush are done in WalSndLoop(): I believe this is the right place
for them.
R2. A test is added in 035_standby_logical_decoding.pl: while this TAP test
is already racy ([2]/messages/by-id/Z6oRgmD8m7zBo732@ip-10-97-1-34.eu-west-3.compute.internal) that looks like a good place as we don't want pg_recvlogical
to stop/exit.
R3. The test can also be back-patched till 16_STABLE as 035_standby_logical_decoding.pl
has been introduced in 16 (and so do pg_stat_io).
R4. The test fails if the extra flushs are not applied/done, which makes fully
sense.

[1]: /messages/by-id/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal
[2]: /messages/by-id/Z6oRgmD8m7zBo732@ip-10-97-1-34.eu-west-3.compute.internal

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Bertrand Drouvot (#1)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Tue, Feb 25, 2025 at 01:42:08PM +0000, Bertrand Drouvot wrote:

Now we can see that the numbers increased for the relation object and that we
get non zeros numbers for the wal object too (which makes fully sense).

With the attached patch applied, we would get the same numbers already in
step 4. (means the stats are flushed without the need to wait for the walsender
to exit).

@@ -2793,6 +2794,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
if (pq_flush_if_writable() != 0)
WalSndShutdown();

+        /*
+         * Report IO statistics
+         */
+        pgstat_flush_io(false);
+        (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
         /* If nothing remains to be sent right now ... */
         if (WalSndCaughtUp && !pq_is_send_pending())
         {

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

A non-forced periodic flush like you are proposing here sounds OK to
me, but the position of the flush could be positioned better in the
loop. If there is a SIGUSR2 (aka got_SIGUSR2 is true), WAL senders
would shut down, so it seems rather pointless to do a flush just
before exiting the process in WalSndDone(), no? I'd suggest to move
the flush attempt closer to where we wait for some activity, just
after WalSndKeepaliveIfNecessary().
--
Michael

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#2)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Feb 26, 2025 at 03:37:10PM +0900, Michael Paquier wrote:

On Tue, Feb 25, 2025 at 01:42:08PM +0000, Bertrand Drouvot wrote:

Now we can see that the numbers increased for the relation object and that we
get non zeros numbers for the wal object too (which makes fully sense).

With the attached patch applied, we would get the same numbers already in
step 4. (means the stats are flushed without the need to wait for the walsender
to exit).

@@ -2793,6 +2794,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
if (pq_flush_if_writable() != 0)
WalSndShutdown();
+        /*
+         * Report IO statistics
+         */
+        pgstat_flush_io(false);
+        (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
/* If nothing remains to be sent right now ... */
if (WalSndCaughtUp && !pq_is_send_pending())
{
That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

Yeah... And while the example shared up-thread is related to logical walsender,
the same issue exists for a physical walsender.

OTOH, It's also great to see that the new stats that have been added (the WAL
ones) helped to spot the issue.

A non-forced periodic flush like you are proposing here sounds OK to
me,

Thanks for looking at it!

but the position of the flush could be positioned better in the
loop. If there is a SIGUSR2 (aka got_SIGUSR2 is true), WAL senders
would shut down,

That's true for a physical walsender but I'm not sure it is for a logical
walsender (due to the "sentPtr == replicatedPtr" check in WalSndDone()).

so it seems rather pointless to do a flush just
before exiting the process in WalSndDone(), no? I'd suggest to move
the flush attempt closer to where we wait for some activity, just
after WalSndKeepaliveIfNecessary().

Yeah I think that makes sense, done that way in the attached.

Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
(would also fail without the fix).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Michael Paquier (#2)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On 2025-02-26 15:37:10 +0900, Michael Paquier wrote:

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

FWIW, I think medium term we need to work on splitting stats flushing into two
separate kinds of flushes:
1) non-transactional stats, which should be flushed at a regular interval,
unless a process is completely idle
2) transaction stats, which can only be flushed at transaction boundaries,
because before the transaction boundary we don't know if e.g. newly
inserted rows should be counted as live or dead

So far we have some timer logic for 2), but we have basically no support for
1). Which means we have weird ad-hoc logic in various kinds of
non-plain-connection processes. And that will often have holes, as Bertrand
noticed here.

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction. The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

Greetings,

Andres Freund

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Andres Freund (#4)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction. The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

Agreed that something in the lines of non-transaction update of the
entries could be adapted in some cases, so +1 for the idea. I suspect
that there would be cases where a single stats kind should be able to
handle both transactional and non-transactional flush cases.
Splitting that across two stats kinds would lead to a lot of
duplication.
--
Michael

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Andres Freund (#4)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:

Hi,

On 2025-02-26 15:37:10 +0900, Michael Paquier wrote:

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

FWIW, I think medium term we need to work on splitting stats flushing into two
separate kinds of flushes:
1) non-transactional stats, which should be flushed at a regular interval,
unless a process is completely idle
2) transaction stats, which can only be flushed at transaction boundaries,
because before the transaction boundary we don't know if e.g. newly
inserted rows should be counted as live or dead

So far we have some timer logic for 2), but we have basically no support for
1). Which means we have weird ad-hoc logic in various kinds of
non-plain-connection processes. And that will often have holes, as Bertrand
noticed here.

Thanks for sharing your thoughts on it!

Yeah, agree that's a good medium term idea to avoid missing flushing some
stats.

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

+++1 There is no need to wait for the transaction boundary for some stats as
those would be more "useful"/"actionable" if flushed during the transaction is
in progress.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction. The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

Adding this to my bucket unless someone beats me on it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Michael Paquier (#5)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On 2025-02-27 12:09:46 +0900, Michael Paquier wrote:

I suspect that there would be cases where a single stats kind should be able
to handle both transactional and non-transactional flush cases. Splitting
that across two stats kinds would lead to a lot of duplication.

Agreed. Table stats will be the prime case where that's important.

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Bertrand Drouvot (#3)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Wed, Feb 26, 2025 at 09:48:50AM +0000, Bertrand Drouvot wrote:

Yeah I think that makes sense, done that way in the attached.

Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
(would also fail without the fix).

Hmm. I was doing some more checks with this patch, and on closer look
I am wondering if the location you have chosen for the stats reports
is too aggressive: this requires a LWLock for the WAL sender backend
type taken in exclusive mode, with each step of WalSndLoop() taken
roughly each time a record or a batch of records is sent. A single
installcheck with a primary/standby setup can lead to up to 50k stats
report calls.

With smaller records, the loop can become hotter, can't it? Also,
there can be a high number of WAL senders on a single node, and I've
heard of some customers with complex logical decoding deployments with
dozens of logical WAL senders. Isn't there a risk of having this code
path become a point of contention? It seems to me that we should
benchmark this change more carefully, perhaps even reduce the
frequency of the report calls.
--
Michael

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Michael Paquier (#8)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Fri, Feb 28, 2025 at 02:41:34PM +0900, Michael Paquier wrote:

With smaller records, the loop can become hotter, can't it? Also,
there can be a high number of WAL senders on a single node, and I've
heard of some customers with complex logical decoding deployments with
dozens of logical WAL senders. Isn't there a risk of having this code
path become a point of contention? It seems to me that we should
benchmark this change more carefully, perhaps even reduce the
frequency of the report calls.

One idea here would be to have on a single host one server with a set
of N pg_receivewal processes dumping their WAL segments into a tmpfs,
while a single session generates a bunch of records with a minimal
size using pg_logical_emit_message(). Monitoring the maximum
replication lag with pg_stat_replication and looking at some perf
profiles of the cluster should show how these stats reports affect the
replication setup efficiency.
--
Michael

#10

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#8)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Fri, Feb 28, 2025 at 02:41:34PM +0900, Michael Paquier wrote:

On Wed, Feb 26, 2025 at 09:48:50AM +0000, Bertrand Drouvot wrote:

Yeah I think that makes sense, done that way in the attached.

Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
(would also fail without the fix).

Hmm. I was doing some more checks with this patch, and on closer look
I am wondering if the location you have chosen for the stats reports
is too aggressive: this requires a LWLock for the WAL sender backend
type taken in exclusive mode, with each step of WalSndLoop() taken
roughly each time a record or a batch of records is sent. A single
installcheck with a primary/standby setup can lead to up to 50k stats
report calls.

Yeah, what I can observe (installcheck with a primary/standby):

- extras flush are called about 70K times.

Among those 70K:

- about 575 are going after the "have_iostats" check in pgstat_io_flush_cb()
- about 575 are going after the PendingBackendStats pg_memory_is_all_zeros()
check in pgstat_flush_backend() (makes sense due to the above)
- about 575 are going after the PendingBackendStats.pending_io pg_memory_is_all_zeros()
check in pgstat_flush_backend_entry_io() (makes sense due to the above)

It means that only a very few of them are "really" flushing IO stats.

With smaller records, the loop can become hotter, can't it? Also,
there can be a high number of WAL senders on a single node, and I've
heard of some customers with complex logical decoding deployments with
dozens of logical WAL senders. Isn't there a risk of having this code
path become a point of contention? It seems to me that we should
benchmark this change more carefully, perhaps even reduce the
frequency of the report calls.

That sounds a good idea to measure the impact of those extra calls and see
if we'd need to mitigate the impacts. I'll collect some data.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#11

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Bertrand Drouvot (#10)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:

That sounds a good idea to measure the impact of those extra calls and see
if we'd need to mitigate the impacts. I'll collect some data.

Thanks.
--
Michael

#12

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#11)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Mon, Mar 03, 2025 at 10:51:19AM +0900, Michael Paquier wrote:

On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:

That sounds a good idea to measure the impact of those extra calls and see
if we'd need to mitigate the impacts. I'll collect some data.

So I did some tests using only one walsender (given the fact that the extra
lock you mentioned above is "only" for this particular backend).

=== Test with pg_receivewal

I was using one pg_receivewal process and did some tests that way:

pgbench -n -c8 -j8 -T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 1));";)

I did not measure any noticeable extra lag (I did measure the time it took
for pg_size_pretty(sent_lsn - write_lsn) from pg_stat_replication to be back
to zero).

During the pgbench run a "perf record --call-graph fp -p <walsender_pid>" would
report (perf report -n):

1. pgstat_flush_backend() appears at about 3%
2. pg_memory_is_all_zeros() at about 2.8%
3. pgstat_flush_io() at about 0.4%

So it does not look like what we're adding here can be seen as a primary bottleneck.

That said it looks like that there is room for improvment in pgstat_flush_backend()
and that relying on a "have_iostats" like variable would be better than those
pg_memory_is_all_zeros() calls.

That's done in 0001 attached, by doing so, pgstat_flush_backend() now appears at
about 0.2%.

=== Test with pg_recvlogical

Now it does not look like pg_receivewal had a lot of IO stats to report (looking at
pg_stat_get_backend_io() output for the walsender).

Doing the same test with "pg_recvlogical -d postgres -S logical_slot -f /dev/null --start"
reports much more IO stats.

What I observe without the "have_iostats" optimization is:

1. I did not measure any noticeable extra lag
2. pgstat_flush_io() at about 5.5% (pgstat_io_flush_cb() at about 5.3%)
3 pgstat_flush_backend() at about 4.8%

and with the "have_iostats" optimization I now see pgstat_flush_backend() at
about 2.51%.

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Bertrand Drouvot (#12)

Re: [BUG]: the walsender does not update its IO statistics until it exits

On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

Okay. The number of reports really worry me anyway as proposed in the
patch. It's not really complicated to see dozens of pgstats calls in
a single millisecond in the WAL sender, even with some of the
regression tests. That's more present in the logirep scenarios, it
seems..

For example, here are some numbers based on some elog() calls planted
in walreceiver.c, looking at the logs of the TAP tests. Through a
single check-world, I am able to gather the following information:
- The location proposed by the patch leads to 680k stats reports,
in WalSndLoop() because the "Block if we have unsent data". Some
tests show quite some contention.
- A different location, in WalSndLoop() just before WalSndWait() leads
to 59k calls. And there's some contention in the logirep tests..
- I've spotted a third candidate which looks pretty solid, actually:
WalSndWaitForWal() before WalSndWait(). This leads to 2.9k reports in
the whole test suite, with much less contention in the reports. These
can still be rather frequent, up to ~50 calls per seconds, but that's
really less than the two others.

Stats data is useful as long as it is possible to get an idea of how
the system behaves, particularly with a steady workload. More
frequent reports are useful for spikey data detection, showing more
noise. Still, too many reports may cause the part gathering the
reports to become a bottleneck, while we want it to offer hints about
bottlenecks. So I would argue in favor of a more conservative choice
in the back branches than what the patch is proposing. Choice 3 i'm
quoting above is tempting by design: not too much, still frequent
enough to offer enough relevant information in the stats.
--
Michael

#14

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#13)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Mon, Mar 10, 2025 at 09:23:50AM +0900, Michael Paquier wrote:

On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

- I've spotted a third candidate which looks pretty solid, actually:
WalSndWaitForWal() before WalSndWait(). This leads to 2.9k reports in
the whole test suite, with much less contention in the reports. These
can still be rather frequent, up to ~50 calls per seconds, but that's
really less than the two others.

Stats data is useful as long as it is possible to get an idea of how
the system behaves, particularly with a steady workload. More
frequent reports are useful for spikey data detection, showing more
noise. Still, too many reports may cause the part gathering the
reports to become a bottleneck, while we want it to offer hints about
bottlenecks. So I would argue in favor of a more conservative choice
in the back branches than what the patch is proposing.

Yeah, fully agree. Anyway, having so many frequent stats reports makes little sense.

Choice 3 i'm
quoting above is tempting by design: not too much, still frequent
enough to offer enough relevant information in the stats.

Yeah, I also agree that we should reduce the number of "reports". I'll look
at the third option.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Bertrand Drouvot (#14)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Mon, Mar 10, 2025 at 02:52:42PM +0000, Bertrand Drouvot wrote:

Hi,

On Mon, Mar 10, 2025 at 09:23:50AM +0900, Michael Paquier wrote:

On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

- I've spotted a third candidate which looks pretty solid, actually:
WalSndWaitForWal() before WalSndWait(). This leads to 2.9k reports in
the whole test suite, with much less contention in the reports. These
can still be rather frequent, up to ~50 calls per seconds, but that's
really less than the two others.

Stats data is useful as long as it is possible to get an idea of how
the system behaves, particularly with a steady workload. More
frequent reports are useful for spikey data detection, showing more
noise. Still, too many reports may cause the part gathering the
reports to become a bottleneck, while we want it to offer hints about
bottlenecks. So I would argue in favor of a more conservative choice
in the back branches than what the patch is proposing.

Yeah, fully agree. Anyway, having so many frequent stats reports makes little sense.

Choice 3 i'm
quoting above is tempting by design: not too much, still frequent
enough to offer enough relevant information in the stats.

Yeah, I also agree that we should reduce the number of "reports". I'll look
at the third option.

WalSndWaitForWal() is being used only for logical walsender. So we'd need to
find another location for the physical walsender case. One option is to keep the
WalSndLoop() location and control the reports frequency.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Bertrand Drouvot (#15)

Re: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Tue, Mar 11, 2025 at 06:10:32AM +0000, Bertrand Drouvot wrote:

WalSndWaitForWal() is being used only for logical walsender. So we'd need to
find another location for the physical walsender case. One option is to keep the
WalSndLoop() location and control the reports frequency.

I ended up with an idea in the same vein as your third one, but let's do it
in WalSndLoop() instead for both logical and physical walsenders. The idea
is to report the stats when the walsender is caught up or has pending data to
send. Also, we do flush only periodically (based on PGSTAT_MIN_INTERVAL).

Remarks/Questions:

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
update the related doc.

2. bonus point to make it here is that we don't need an extra call to
GetCurrentTimestamp(). We simply assign the one that is already done to "now".

3. add to use poll_query_until() has I've seen failures on the CI whith this
new approach. Also moving the test far away from the stats reset to to minimize
the risk of polling for too long.

With the attached in place, the number of times the stats are being flushed
reduce drastically but still seems enough to get an accurate idea of the walsender
IO activity. Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#17

Xuneng Zhou

xunengzhou@gmail.com

about 1 year ago

In reply to: Bertrand Drouvot (#1)

Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Forgot to cc...

---------- Forwarded message ---------
发件人： Xuneng Zhou <xunengzhou@gmail.com>
Date: 2025年3月13日周四 19:15
Subject: Re: [BUG]: the walsender does not update its IO statistics until
it exits
To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>

Hi,

Thanks for working on this! I'm glad to see that the patch (
/messages/by-id/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal)
has been committed.

Regarding patch 0001, the optimization in pgstat_backend_have_pending_cb
looks good:

bool
pgstat_backend_have_pending_cb(void)
{
- return (!pg_memory_is_all_zeros(&PendingBackendStats,
- sizeof(struct PgStat_BackendPending)));
+ return backend_has_iostats;
}

Additionally, the function pgstat_flush_backend includes the check:

+ if (!pgstat_backend_have_pending_cb())
return false;

However, I think we might need to revise the comment (and possibly the
function name) for clarity:

/*
* Check if there are any backend stats waiting to be flushed.
*/

Originally, this function was intended to check multiple types of backend
statistics, which made sense when PendingBackendStats was the centralized
structure for various pending backend stats. However, since
PgStat_PendingWalStats was removed from PendingBackendStats earlier, and
now this patch introduces the backend_has_iostats variable, the scope of
this function appears even narrower. This narrowed functionality no longer
aligns closely with the original function name and its associated comment.

Bertrand Drouvot <bertranddrouvot.pg@gmail.com> 于2025年3月3日周一 19:54写道：

Show quoted text

Hi,

On Mon, Mar 03, 2025 at 10:51:19AM +0900, Michael Paquier wrote:

On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:

That sounds a good idea to measure the impact of those extra calls and

see

if we'd need to mitigate the impacts. I'll collect some data.

So I did some tests using only one walsender (given the fact that the extra
lock you mentioned above is "only" for this particular backend).

=== Test with pg_receivewal

I was using one pg_receivewal process and did some tests that way:

pgbench -n -c8 -j8 -T60 -f <(echo "SELECT pg_logical_emit_message(true,
'test', repeat('0', 1));";)

I did not measure any noticeable extra lag (I did measure the time it took
for pg_size_pretty(sent_lsn - write_lsn) from pg_stat_replication to be
back
to zero).

During the pgbench run a "perf record --call-graph fp -p <walsender_pid>"
would
report (perf report -n):

1. pgstat_flush_backend() appears at about 3%
2. pg_memory_is_all_zeros() at about 2.8%
3. pgstat_flush_io() at about 0.4%

So it does not look like what we're adding here can be seen as a primary
bottleneck.

That said it looks like that there is room for improvment in
pgstat_flush_backend()
and that relying on a "have_iostats" like variable would be better than
those
pg_memory_is_all_zeros() calls.

That's done in 0001 attached, by doing so, pgstat_flush_backend() now
appears at
about 0.2%.

=== Test with pg_recvlogical

Now it does not look like pg_receivewal had a lot of IO stats to report
(looking at
pg_stat_get_backend_io() output for the walsender).

Doing the same test with "pg_recvlogical -d postgres -S logical_slot -f
/dev/null --start"
reports much more IO stats.

What I observe without the "have_iostats" optimization is:

1. I did not measure any noticeable extra lag
2. pgstat_flush_io() at about 5.5% (pgstat_io_flush_cb() at about 5.3%)
3 pgstat_flush_backend() at about 4.8%

and with the "have_iostats" optimization I now see pgstat_flush_backend()
at
about 2.51%.

So it does not look like what we're adding here can be seen as a primary
bottleneck
but that is probably worth implementing the "have_iostats" optimization
attached.

Also, while I did not measure any noticeable extra lag, given the fact
that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about
2.5%,
that could still make sense to reduce the frequency of the flush calls,
thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Import Notes

Reply to msg id not found: CABPTF7U5B1gnvvrbk+we+Zi3p0-KKAEXYCLj914T9Fm5Cjk1LA@mail.gmail.com

#18

Xuneng Zhou

xunengzhou@gmail.com

about 1 year ago

In reply to: Bertrand Drouvot (#1)

Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Sorry, forgot to cc this one too.

---------- Forwarded message ---------
发件人： Xuneng Zhou <xunengzhou@gmail.com>
Date: 2025年3月13日周四 19:29
Subject: Re: [BUG]: the walsender does not update its IO statistics until
it exits
To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>

Hi,
Given that PGSTAT_MIN_INTERVAL is the go-to setting for stats flushes
* Unless called with 'force', pending stats updates are flushed happen once
* per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
* block on lock acquisition, except if stats updates have been pending for
* longer than PGSTAT_MAX_INTERVAL (60000ms).
*
* Whenever pending stats updates remain at the end of pgstat_report_stat()
a
* suggested idle timeout is returned. Currently this is always
* PGSTAT_IDLE_INTERVAL (10000ms). Callers can use the returned time to set
up
* a timeout after which to call pgstat_report_stat(true), but are not
* required to do so.

I am curious about the reason for this:

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both
case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
update the related doc.

PGSTAT_IDLE_INTERVAL seems to reduce the frequency even more.

Show quoted text

With the attached in place, the number of times the stats are being flushed
reduce drastically but still seems enough to get an accurate idea of the
walsender
IO activity. Thoughts?

Import Notes

Reply to msg id not found: CABPTF7VRN_-DfrPFjVtRYuN0E8mWETtVp6tSXNsXDJPDAuHHxA@mail.gmail.com

#19

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Xuneng Zhou (#17)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Thu, Mar 13, 2025 at 07:33:24PM +0800, Xuneng Zhou wrote:

Regarding patch 0001, the optimization in pgstat_backend_have_pending_cb
looks good:

Thanks for looking at it!

bool
pgstat_backend_have_pending_cb(void)
{
- return (!pg_memory_is_all_zeros(&PendingBackendStats,
- sizeof(struct PgStat_BackendPending)));
+ return backend_has_iostats;
}
Additionally, the function pgstat_flush_backend includes the check:

+ if (!pgstat_backend_have_pending_cb())
return false;

However, I think we might need to revise the comment (and possibly the
function name) for clarity:

/*
* Check if there are any backend stats waiting to be flushed.
*/

The comment is not exactly this one on the current HEAD, it looks like that you're
looking at a previous version of the core code.

Originally, this function was intended to check multiple types of backend
statistics, which made sense when PendingBackendStats was the centralized
structure for various pending backend stats. However, since
PgStat_PendingWalStats was removed from PendingBackendStats earlier, and
now this patch introduces the backend_has_iostats variable, the scope of
this function appears even narrower. This narrowed functionality no longer
aligns closely with the original function name and its associated comment.

I don't think so, as since 76def4cdd7c, a pgstat_backend_wal_have_pending() check
has been added to pgstat_backend_have_pending_cb(). You're probably looking at
a version prior to 76def4cdd7c.

This particular sub-patch needs a rebase though, done in the attached. 0001
remains unchanged as compared to the v4 one just shared up-thread. If 0001 goes
in, merging 0002 would be less beneficial (as compared to v3).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#20

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Xuneng Zhou (#18)

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

Hi,

On Thu, Mar 13, 2025 at 07:35:24PM +0800, Xuneng Zhou wrote:

Hi,
Given that PGSTAT_MIN_INTERVAL is the go-to setting for stats flushes
* Unless called with 'force', pending stats updates are flushed happen once
* per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
* block on lock acquisition, except if stats updates have been pending for
* longer than PGSTAT_MAX_INTERVAL (60000ms).
*
* Whenever pending stats updates remain at the end of pgstat_report_stat()
a
* suggested idle timeout is returned. Currently this is always
* PGSTAT_IDLE_INTERVAL (10000ms). Callers can use the returned time to set
up
* a timeout after which to call pgstat_report_stat(true), but are not
* required to do so.

I am curious about the reason for this:

Thanks for looking at it!

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both
case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
update the related doc.

PGSTAT_IDLE_INTERVAL seems to reduce the frequency even more.

Yeah, I think that PGSTAT_MIN_INTERVAL is the one to use (that's why that's the
one the patch is using). I just mentioned PGSTAT_IDLE_INTERVAL as an open door
for conversation in case one prefers a "larger" frequency.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#21

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Bertrand Drouvot (#19)

#22

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#21)