Monitoring gaps in XLogWalRcvWrite() for the WAL receiver
Hi all,
While doing some monitoring of a replication setup for a stable
branch, I have been surprised by the fact that we have never tracked
WAL statistics for the WAL receiver in pg_stat_wal because we have
never bothered to update its code so as WAL stats are reported. This
is relevant for the write and sync counts and timings. On HEAD, this
information has been moved to pg_stat_io, but the stats reports happen
with the same routine (pgstat_report_wal in 15~, and pgstat_send_wal
in ~14).
As of f4694e0f35b2, the situation is better thanks to the addition of
a pgstat_report_wal() in the WAL receiver main loop, so we have some
data. However, we are only able to gather the data for segment syncs
and initializations, not the writes themselves as these are managed by
an independent code path, XLogWalRcvWrite().
A second thing that lacks in XLogWalRcvWrite() is a wait event around
the pg_pwrite() call, which is useful as the WAL receiver is listed in
pg_stat_activity. Note that it is possible to re-use the same wait
event as XLogWrite() for the WAL receiver, WAL_WRITE, because the WAL
receiver does not rely on the write and flush calls from xlog.c when
doing its work, and both have the same meaning, aka they write WAL.
The fsync calls use issue_xlog_fsync() and the segment inits happen in
XLogFileInit().
Perhaps there's a point in backpatching a portion of what's in the
attached patch (the wait event?), but I am not planning to bother much
with the stable branches based on the lack of complaints. If you
have an opinion about that, please feel free.
Thoughts?
--
Michael
Attachments:
0001-Report-WAL-writes-in-stats-of-the-WAL-receiver.patchtext/x-diff; charset=us-asciiDownload
From 31baeaf0b0b3291a9edafa76f59f40110d57d00b Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 5 Mar 2025 12:35:01 +0900
Subject: [PATCH] Report WAL writes in stats of the WAL receiver
These have never been counted since pg_stat_wal exists, as writes happen
in a path of its own. Now these are moved to pg_stat_io.
A wait event is added around the pg_pwrite() call in charge of the
writes, using the exiting WAIT_EVENT_WAL_WRITE.
---
src/backend/replication/walreceiver.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 83129cb92afe..2e5dd6deb2c2 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -922,6 +922,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
{
int startoff;
int byteswritten;
+ instr_time start;
Assert(tli != 0);
@@ -952,7 +953,18 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
/* OK to write the logs */
errno = 0;
+ /*
+ * Measure I/O timing to write WAL data, for pg_stat_io.
+ */
+ start = pgstat_prepare_io_time(track_wal_io_timing);
+
+ pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+ pgstat_report_wait_end();
+
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+ IOOP_WRITE, start, 1, byteswritten);
+
if (byteswritten <= 0)
{
char xlogfname[MAXFNAMELEN];
--
2.47.2
Hi,
On Wed, Mar 05, 2025 at 12:35:26PM +0900, Michael Paquier wrote:
Hi all,
While doing some monitoring of a replication setup for a stable
branch, I have been surprised by the fact that we have never tracked
WAL statistics for the WAL receiver in pg_stat_wal because we have
never bothered to update its code so as WAL stats are reported.
Nice catch!
This
is relevant for the write and sync counts and timings.
Also for sync? sync looks fine as issue_xlog_fsync() is being called in
XLogWalRcvFlush(), no?
As of f4694e0f35b2, the situation is better thanks to the addition of
a pgstat_report_wal() in the WAL receiver main loop, so we have some
data. However, we are only able to gather the data for segment syncs
and initializations, not the writes themselves as these are managed by
an independent code path, XLogWalRcvWrite().A second thing that lacks in XLogWalRcvWrite() is a wait event around
the pg_pwrite() call, which is useful as the WAL receiver is listed in
pg_stat_activity. Note that it is possible to re-use the same wait
event as XLogWrite() for the WAL receiver, WAL_WRITE, because the WAL
receiver does not rely on the write and flush calls from xlog.c when
doing its work, and both have the same meaning, aka they write WAL.
The fsync calls use issue_xlog_fsync() and the segment inits happen in
XLogFileInit().Perhaps there's a point in backpatching a portion of what's in the
attached patch (the wait event?), but I am not planning to bother much
with the stable branches based on the lack of complaints.
We're not emitting some statistics, so I think that it's hard for users to
complain about something they don't/can't see.
If you
have an opinion about that, please feel free.
I'm tempted to say that the wal receiver part of f4694e0f35b2 should be
backpatched as well as what you're doing here.
+ /*
+ * Measure I/O timing to write WAL data, for pg_stat_io.
+ */
+ start = pgstat_prepare_io_time(track_wal_io_timing);
+
+ pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+ pgstat_report_wait_end();
Same logic as in XLogWrite() and I don't think there is a need for a
dedicated wait event, so LGTM.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Mar 05, 2025 at 08:04:44AM +0000, Bertrand Drouvot wrote:
On Wed, Mar 05, 2025 at 12:35:26PM +0900, Michael Paquier wrote:
Also for sync? sync looks fine as issue_xlog_fsync() is being called in
XLogWalRcvFlush(), no?
Yes, we're OK for the sync data aggregated in the WAL receiver on
HEAD, as mentioned below, not in the back-branches.
We're not emitting some statistics, so I think that it's hard for users to
complain about something they don't/can't see.
One would see idle data in pg_stat_wal on a standby, so the lack of
data could be annoying, but I'm perhaps the only one who noticed
that..
Same logic as in XLogWrite() and I don't think there is a need for a
dedicated wait event, so LGTM.
Thanks.
--
Michael
On Wed, Mar 05, 2025 at 08:04:44AM +0000, Bertrand Drouvot wrote:
On Wed, Mar 05, 2025 at 12:35:26PM +0900, Michael Paquier wrote:
Perhaps there's a point in backpatching a portion of what's in the
attached patch (the wait event?), but I am not planning to bother much
with the stable branches based on the lack of complaints.We're not emitting some statistics, so I think that it's hard for users to
complain about something they don't/can't see.
Hmm, not exactly actually. I've missed that ff99918c625a mentions
that WAL receiver was discarded on purpose, but this was still when
pgstats was not in shared memory and still file-based. We scale much
better now. It is not that difficult to make XLogWrite() hot enough
that it would trigger a lot of calls of pgstat_count_io_op_time() per
ms, either, like the WAL receiver, so as long as the timings are
behind track_wal_io_timing we're fine.
Let's do this at the end, without a backpatch. At least we'll be anle
to get better IO metrics for replication setups.
--
Michael
Hi,
On Thu, Mar 06, 2025 at 10:12:37AM +0900, Michael Paquier wrote:
On Wed, Mar 05, 2025 at 08:04:44AM +0000, Bertrand Drouvot wrote:
On Wed, Mar 05, 2025 at 12:35:26PM +0900, Michael Paquier wrote:
Perhaps there's a point in backpatching a portion of what's in the
attached patch (the wait event?), but I am not planning to bother much
with the stable branches based on the lack of complaints.We're not emitting some statistics, so I think that it's hard for users to
complain about something they don't/can't see.Hmm, not exactly actually. I've missed that ff99918c625a mentions
that WAL receiver was discarded on purpose, but this was still when
pgstats was not in shared memory and still file-based. We scale much
better now. It is not that difficult to make XLogWrite() hot enough
that it would trigger a lot of calls of pgstat_count_io_op_time() per
ms, either, like the WAL receiver, so as long as the timings are
behind track_wal_io_timing we're fine.Let's do this at the end, without a backpatch. At least we'll be anle
to get better IO metrics for replication setups.
Good catch about the comment in ff99918c625a, so yeah I think it makes sense
to not backpatch then.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com