Show WAL write and fsync stats in pg_stat_io

Started by Nazir Bilal Yavuzover 2 years ago94 messages

byavuz81@gmail.com

over 2 years ago

1 attachment(s)

Hi,

This is a WIP patch to add WAL write and fsync stats to pg_stat_io
view. There is a track_io_timing variable to control pg_stat_io
timings and a track_wal_io_timing variable to control WAL timings. I
couldn't decide on which logic to enable WAL timings on pg_stat_io.
For now, both pg_stat_io and track_wal_io_timing are needed to be
enabled to track WAL timings in pg_stat_io.

Also, if you compare WAL stats in pg_stat_wal and pg_stat_io; you can
come across differences. These differences are caused by the
background writer's WAL stats not being flushed. Because of that,
background writer's WAL stats are not seen in pg_stat_wal but in
pg_stat_io. I already sent a patch [1]/messages/by-id/CAN55FZ2FPYngovZstr=3w1KSEHe6toiZwrurbhspfkXe5UDocg@mail.gmail.com to fix that.

[1]: /messages/by-id/CAN55FZ2FPYngovZstr=3w1KSEHe6toiZwrurbhspfkXe5UDocg@mail.gmail.com

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v1-0001-WIP-Show-WAL-write-and-fsync-stats-on-pg_stat_io.patchapplication/octet-stream; name=v1-0001-WIP-Show-WAL-write-and-fsync-stats-on-pg_stat_io.patchDownload

From eb557114161e865807e39a53a24d03f5f0abe4f5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 28 Jun 2023 12:34:28 +0300
Subject: [PATCH v1] WIP: Show WAL write and fsync stats on pg_stat_io

This patch aims to showing WAL write and fsync stats per backend on
pg_stat_io view.
---
 src/backend/access/transam/xlog.c      | 8 ++++++++
 src/backend/utils/activity/pgstat_io.c | 9 ++++++++-
 src/include/pgstat.h                   | 3 ++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8b0710abe6..2ee6c21398 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 
 					INSTR_TIME_SET_CURRENT(duration);
 					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration, start);
+					pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_WRITE, start, 1);
+				} else
+				{
+					pgstat_count_io_op_n(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_WRITE, 1);
 				}
 
 				PendingWalStats.wal_write++;
@@ -8233,6 +8237,10 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 		INSTR_TIME_SET_CURRENT(duration);
 		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start);
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, start, 1);
+	} else
+	{
+		pgstat_count_io_op_n(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, 1);
 	}
 
 	PendingWalStats.wal_sync++;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d422..e5cb3c7219 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -231,6 +231,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -313,9 +315,9 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_ARCHIVER:
 		case B_LOGGER:
 		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
 			return false;
 
+		case B_WAL_WRITER:
 		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
@@ -350,6 +352,11 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_WAL)
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57a2c0866a..fb7d3abf22 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -277,9 +277,10 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
-- 
2.25.1

Melanie Plageman

melanieplageman@gmail.com

over 2 years ago

In reply to: Nazir Bilal Yavuz (#1)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Jun 28, 2023 at 6:09 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

This is a WIP patch to add WAL write and fsync stats to pg_stat_io
view.

Thanks for working on this! I have some feedback on the content of the
patch as well as some items that I feel are missing.

I think it would be good to count WAL reads even though they are not
currently represented in pg_stat_wal. Here is a thread discussing this
[1]: /messages/by-id/20230216191138.jotc73lqb7xhfqbi@awork3.anarazel.de

Eventually, the docs will need an update as well. You can wait until a
later version of the patch to do this, but I would include it in a list
of the remaining TODOs in your next version.

I think we will also want to add an IOContext for WAL initialization.
Then we can track how long is spent doing WAL init (including filling
the WAL file with zeroes). XLogFileInitInternal() is likely where we
would want to add it. And op_bytes for this would likely be
wal_segment_size. I thought I heard about someone proposing adding WAL
init to pg_stat_wal, but I can't find the thread.

I think there is also an argument for counting WAL files recycled as
IOOP_REUSES. We should start thinking about how to interpret the
different IOOps within the two IOContexts and discussing what would be
useful to count. For example, should removing a logfile count as an
IOOP_EVICT? Maybe it is not directly related to "IO" enough or even an
interesting statistic, but we should think about what kinds of
IO-related WAL statistics we want to track.

Any that we decide not to count for now should be "banned" in
pgstat_tracks_io_op() for clarity. For example, if we create a separate
IOContext for WAL file init, I'm not sure what would count as an
IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.

Also, I think there are some backend types which will not generate WAL
and we should determine which those are and skip those rows in
pgstat_tracks_io_object().

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 8b0710abe6..2ee6c21398 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)

I think we should likely follow the pattern of using
pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
elsewhere. You could pass the IOObject as a parameter to
pgstat_prepare_io_time() in order to determine if we should check
track_io_timing or track_wal_io_timing. And we'll want to check
track_wal_io_timing if IOObject is IOOBJECT_WAL in
pgstat_count_io_op_time().

INSTR_TIME_SET_CURRENT(duration);

INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration,
start);

+                    pgstat_count_io_op_time(IOOBJECT_WAL,
IOCONTEXT_NORMAL, IOOP_WRITE, start, 1);
+                } else
+                {

Other users of pgstat_count_io_op_time()/io_op_n() which write multiple
pages at a time pass the number of pages in as the cnt parameter. (see
ExtendBufferedRelLocal() as an example). I think we want to do that for
WAL also. In this case, it would be the local variable "npages" and we
can do it outside of this loop.

It is true that the existing WAL stats count wal_writes here. However,
this is essentially counting write system calls, which is probably not
what we want for pg_stat_io. See [2]/messages/by-id/20230504165738.4e2hfoddoels542c@awork3.anarazel.de for a discussion about whether to
count blocks written back or writeback system calls for a previous
pg_stat_io feature. All of the other block-based IO statistics in
pg_stat_io count the number of blocks.

This being said, we probably want to just leave
PendingWalStats.wal_write++ here. We would normally move it into
pg_stat_io like we have with pgBufferUsage and the db IO stats that are
updated in pgstat_count_io_op_time(). This consolidation makes it easier
to eventually reduce the duplication. However, in this case, it seems
wal_write counts something we don't count in pg_stat_io, so it can
probably be left here. I would still move the
PendingWalStats.wal_write_time into pgstat_count_io_op_time(), since
that seems like it is the same as what will be in pg_stat_io.

Also, op_bytes for IOOBJECT_WAL/IOCONTEXT_NORMAL should be XLOG_BLCKSZ
(see comment in pg_stat_get_io() in pgstatfuncs.c). Those default to the
same value but can be made to be different.

+ pgstat_count_io_op_n(IOOBJECT_WAL,
IOCONTEXT_NORMAL, IOOP_WRITE, 1);
}

PendingWalStats.wal_write++;

@@ -8233,6 +8237,10 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start);
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
IOOP_FSYNC, start, 1);

I would wrap this line and check other lines to make sure they are not
too long.

+    } else
+    {
+        pgstat_count_io_op_n(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, 1);
     }

PendingWalStats.wal_sync++;

Same feedback as above about using the prepare/count pattern used for
pg_stat_io elsewhere. In this case, you should be able to move
PendingWalStats.wal_sync into there as well.

diff --git a/src/backend/utils/activity/pgstat_io.c
b/src/backend/utils/activity/pgstat_io.c
@@ -350,6 +352,11 @@ pgstat_tracks_io_object(BackendType bktype,
IOObject io_object,
     if (!pgstat_tracks_io_bktype(bktype))
         return false;

+
+    if (io_context != IOCONTEXT_NORMAL &&
+        io_object == IOOBJECT_WAL)
+        return false;

We should add more restrictions. See the top of my email for details.

There is a track_io_timing variable to control pg_stat_io
timings and a track_wal_io_timing variable to control WAL timings. I
couldn't decide on which logic to enable WAL timings on pg_stat_io.
For now, both pg_stat_io and track_wal_io_timing are needed to be
enabled to track WAL timings in pg_stat_io.

Hmm. I could see a case where someone doesn't want to incur the
overhead of track_io_timing for regular IO but does want to do so for
WAL because they are interested in a specific issue. I'm not sure
though. I could be convinced otherwise (based on relative overhead,
etc).

Also, if you compare WAL stats in pg_stat_wal and pg_stat_io; you can
come across differences. These differences are caused by the
background writer's WAL stats not being flushed. Because of that,
background writer's WAL stats are not seen in pg_stat_wal but in
pg_stat_io. I already sent a patch [1] to fix that.

Cool! Thanks for doing that.

- Melanie

[1]: /messages/by-id/20230216191138.jotc73lqb7xhfqbi@awork3.anarazel.de
[2]: /messages/by-id/20230504165738.4e2hfoddoels542c@awork3.anarazel.de

Nazir Bilal Yavuz

byavuz81@gmail.com

over 2 years ago

In reply to: Melanie Plageman (#2)

1 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thanks for the review!

Current status of the patch is:
- 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
- IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
added to pg_stat_io.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned
in pg_stat_io.
- Working on adding 'WAL read' to the xlogreader.c and walsender.c.
- PendingWalStats.wal_sync and
PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Thinking about how to interpret the different IOOps within the two
IOContexts and discussing what would be useful to count.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
be an another patch)
- Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
(This could be an another patch)

On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think it would be good to count WAL reads even though they are not
currently represented in pg_stat_wal. Here is a thread discussing this
[1].

I used the same implementation in the thread link [1]. I added 'WAL
read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
xlogreader.c and walsender.c because they cause some failures on:
'!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
these. Also, I added Bharath to CC. I have a question about 'WAL
read':
1. There are two places where 'WAL read' happens.
a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
time count is equal to XLOG_BLCKSZ but there are some cases it is not.
For example
- in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
- in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
These nbytes and count variables could be different from XLOG_BLCKSZ.

b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);

So, what should op_bytes be set to for 'WAL read' operations?

Eventually, the docs will need an update as well. You can wait until a
later version of the patch to do this, but I would include it in a list
of the remaining TODOs in your next version.

Done. I shared TODOs at the top.

I think we will also want to add an IOContext for WAL initialization.
Then we can track how long is spent doing 'WAL init' (including filling
the WAL file with zeroes). XLogFileInitInternal() is likely where we
would want to add it. And op_bytes for this would likely be
wal_segment_size. I thought I heard about someone proposing adding WAL
init to pg_stat_wal, but I can't find the thread.

Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
have a question there:
1. Some of the WAL processes happens at initdb (standalone backend
IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
initdb, AFAIK there is no way to set 'track_wal_io_timing' and
'track_io_timing' variables there. So, their timings appear as 0.
Should I use IsBootstrapProcessingMode() to enable WAL io timings at
the initdb or are they not that much important?

I think there is also an argument for counting WAL files recycled as
IOOP_REUSES. We should start thinking about how to interpret the
different IOOps within the two IOContexts and discussing what would be
useful to count. For example, should removing a logfile count as an
IOOP_EVICT? Maybe it is not directly related to "IO" enough or even an
interesting statistic, but we should think about what kinds of
IO-related WAL statistics we want to track.

I added that to TODOs.

Any that we decide not to count for now should be "banned" in
pgstat_tracks_io_op() for clarity. For example, if we create a separate
IOContext for WAL file init, I'm not sure what would count as an
IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.

Also, I think there are some backend types which will not generate WAL
and we should determine which those are and skip those rows in
pgstat_tracks_io_object().

I agree, I am working on this. I have a couple of questions:
1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?
2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
specific IOOp operations?

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 8b0710abe6..2ee6c21398 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)
I think we should likely follow the pattern of using
pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
elsewhere. You could pass the IOObject as a parameter to
pgstat_prepare_io_time() in order to determine if we should check
track_io_timing or track_wal_io_timing. And we'll want to check
track_wal_io_timing if IOObject is IOOBJECT_WAL in
pgstat_count_io_op_time().

Done. Instead of passing parameters to pgstat_prepare_io_time(), I
used a slightly different implementation. I return the current time if
there is a chance that any 'time' can be tracked.

INSTR_TIME_SET_CURRENT(duration);

INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration,
start);
+                    pgstat_count_io_op_time(IOOBJECT_WAL,
IOCONTEXT_NORMAL, IOOP_WRITE, start, 1);
+                } else
+                {
Other users of pgstat_count_io_op_time()/io_op_n() which write multiple
pages at a time pass the number of pages in as the cnt parameter. (see
ExtendBufferedRelLocal() as an example). I think we want to do that for
WAL also. In this case, it would be the local variable "npages" and we
can do it outside of this loop.

It is true that the existing WAL stats count wal_writes here. However,
this is essentially counting write system calls, which is probably not
what we want for pg_stat_io. See [2] for a discussion about whether to
count blocks written back or writeback system calls for a previous
pg_stat_io feature. All of the other block-based IO statistics in
pg_stat_io count the number of blocks.

This being said, we probably want to just leave
PendingWalStats.wal_write++ here. We would normally move it into
pg_stat_io like we have with pgBufferUsage and the db IO stats that are
updated in pgstat_count_io_op_time(). This consolidation makes it easier
to eventually reduce the duplication. However, in this case, it seems
wal_write counts something we don't count in pg_stat_io, so it can
probably be left here. I would still move the
PendingWalStats.wal_write_time into pgstat_count_io_op_time(), since
that seems like it is the same as what will be in pg_stat_io.

Done. I moved PendingWalStats.wal_sync and
PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time to
pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively. Because
of this change, pg_stat_wal's and pg_stat_io's
IOOBJECT_WAL/IOCONTEXT_NORMAL/IOOP_WRITE counts are different but the
rest are the same.

Also, op_bytes for IOOBJECT_WAL/IOCONTEXT_NORMAL should be XLOG_BLCKSZ
(see comment in pg_stat_get_io() in pgstatfuncs.c). Those default to the
same value but can be made to be different.

Done.

I would wrap this line and check other lines to make sure they are not
too long.

Done.

+    } else
+    {
+        pgstat_count_io_op_n(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, 1);
}
PendingWalStats.wal_sync++;

Same feedback as above about using the prepare/count pattern used for
pg_stat_io elsewhere. In this case, you should be able to move
PendingWalStats.wal_sync into there as well.

Done.

There is a track_io_timing variable to control pg_stat_io
timings and a track_wal_io_timing variable to control WAL timings. I
couldn't decide on which logic to enable WAL timings on pg_stat_io.
For now, both pg_stat_io and track_wal_io_timing are needed to be
enabled to track WAL timings in pg_stat_io.

Hmm. I could see a case where someone doesn't want to incur the
overhead of track_io_timing for regular IO but does want to do so for
WAL because they are interested in a specific issue. I'm not sure
though. I could be convinced otherwise (based on relative overhead,
etc).

Done. IOOBJECT_WAL uses track_wal_io_timing regardless of
track_io_timing for now.

[1] /messages/by-id/20230216191138.jotc73lqb7xhfqbi@awork3.anarazel.de
[2] /messages/by-id/20230504165738.4e2hfoddoels542c@awork3.anarazel.de

In addition to these, are WAIT_EVENT_WAL_COPY_* operations needed to
be added to pg_stat_io? If the answer is yes, should I add them to the
current patch?

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v2-0001-Show-WAL-stats-on-pg_stat_io.patchtext/x-diff; charset=US-ASCII; name=v2-0001-Show-WAL-stats-on-pg_stat_io.patchDownload

From 574fdec6ed8073dbc49053e6933db0310c7c62f5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 3 Aug 2023 16:11:16 +0300
Subject: [PATCH v2] Show WAL stats on pg_stat_io

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.
---
 src/backend/access/transam/xlog.c         | 58 ++++++----------
 src/backend/access/transam/xlogrecovery.c | 17 +++++
 src/backend/utils/activity/pgstat_io.c    | 80 +++++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c       |  4 +-
 src/include/pgstat.h                      | 16 ++++-
 5 files changed, 131 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3af..ee7b85e18ca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2188,38 +2188,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time();
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	duration;
-
-					INSTR_TIME_SET_CURRENT(duration);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3016,6 +3006,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+						io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3032,6 +3025,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3044,6 +3039,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8174,7 +8172,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8187,11 +8185,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		sync_method == SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -8235,18 +8229,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	duration;
-
-		INSTR_TIME_SET_CURRENT(duration);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start);
-	}
-
-	PendingWalStats.wal_sync++;
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62e..ee850af5514 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
 	XLogRecord *record;
 	bool		reachedRecoveryTarget = false;
 	TimeLineID	replayTLI;
+	uint32		pgstat_report_wal_frequency = 0;
 
 	/*
 	 * Initialize shared variables for tracking progress of WAL replay, as if
@@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
 			 */
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
+			/*
+			 * Report pending statistics to the cumulative stats system once
+			 * every PGSTAT_REPORT_FREQUENCY times to not hinder performance.
+			 */
+			if (pgstat_report_wal_frequency++ == PGSTAT_REPORT_FREQUENCY)
+			{
+				pgstat_report_wal(false);
+				pgstat_report_wal_frequency = 0;
+			}
+
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
 			{
@@ -3220,6 +3231,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3312,6 +3324,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3340,6 +3354,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d4225..97689a0637e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+	   io_op == IOOP_FSYNC)
+		PendingWalStats.wal_sync += cnt;
+
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
 
+/*
+ * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
+ * current time if there is a chance that any 'time' can be tracked.
+ */
 instr_time
 pgstat_prepare_io_time(void)
 {
 	instr_time	io_start;
 
-	if (track_io_timing)
+	if(track_io_timing || track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 	else
 		INSTR_TIME_SET_ZERO(io_start);
@@ -105,6 +113,22 @@ pgstat_prepare_io_time(void)
 	return io_start;
 }
 
+/*
+ * Decide if the io timing needs be tracked
+ */
+bool
+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+	/*
+	 * io times of IOOBJECT_WAL IOObject needs to be tracked when
+	 * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+	 */
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op_n() except it also accumulates time.
  */
@@ -112,7 +136,7 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt)
 {
-	if (track_io_timing)
+	if (pgstat_should_track_io_time(io_object, io_context))
 	{
 		instr_time	io_time;
 
@@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			/* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
+			else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
@@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
+		else if (io_op == IOOP_FSYNC)
+		{
+			/* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
+			if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);
+		}
 
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
@@ -216,12 +249,32 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "normal";
 		case IOCONTEXT_VACUUM:
 			return "vacuum";
+		case IOCONTEXT_INIT:
+			return "init";
 	}
 
 	elog(ERROR, "unrecognized IOContext value: %d", io_context);
 	pg_unreachable();
 }
 
+/*
+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default.
+ */
+int
+pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
+{
+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+	}
+
+	return BLCKSZ;
+}
+
 const char *
 pgstat_get_io_object_name(IOObject io_object)
 {
@@ -231,6 +284,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -312,10 +367,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_INVALID:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
 			return false;
 
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
 		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
@@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+	     io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
 		return false;
 
+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+	   !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+	   return false;
+
+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+	   !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+	   return false;
+
 	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
 		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2a4c8ef87ff..c467891e5fb 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1386,6 +1386,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
+				int			op_bytes;
 
 				Datum		values[IO_NUM_COLUMNS] = {0};
 				bool		nulls[IO_NUM_COLUMNS] = {0};
@@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				 * and constant multipliers, once non-block-oriented IO (e.g.
 				 * temporary file IO) is tracked.
 				 */
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				op_bytes = pgstat_get_io_op_btyes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57a2c0866a2..04deb2aa509 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -277,9 +277,10 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
@@ -287,9 +288,10 @@ typedef enum IOContext
 	IOCONTEXT_BULKWRITE,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
+	IOCONTEXT_INIT,
 } IOContext;
 
-#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_INIT + 1)
 
 typedef enum IOOp
 {
@@ -439,6 +441,12 @@ typedef struct PgStat_WalStats
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
+/*
+ * When stats captured in resource-intensive/tight loops, report pending stats
+ * once every PGSTAT_REPORT_FREQUENCY times.
+ */
+#define PGSTAT_REPORT_FREQUENCY		1024
+
 /*
  * This struct stores wal-related durations as instr_time, which makes it
  * cheaper and easier to accumulate them, by not requiring type
@@ -521,10 +529,12 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
 extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern instr_time pgstat_prepare_io_time(void);
+extern bool pgstat_should_track_io_time(IOObject io_object, IOContext io_context);
 extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time, uint32 cnt);
 
 extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 
@@ -770,6 +780,8 @@ extern PGDLLIMPORT SessionEndType pgStatSessionEndCause;
 /*
  * Variables in pgstat_wal.c
  */
+extern PGDLLIMPORT bool track_wal_io_timing;
+extern PGDLLIMPORT int wal_segment_size;
 
 /* updated directly by backends and background processes */
 extern PGDLLIMPORT PgStat_PendingWalStats PendingWalStats;
-- 
2.40.1

Melanie Plageman

melanieplageman@gmail.com

over 2 years ago

In reply to: Nazir Bilal Yavuz (#3)

Re: Show WAL write and fsync stats in pg_stat_io

On Thu, Aug 03, 2023 at 04:38:41PM +0300, Nazir Bilal Yavuz wrote:

Current status of the patch is:
- 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
- IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
added to pg_stat_io.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned
in pg_stat_io.
- Working on adding 'WAL read' to the xlogreader.c and walsender.c.
- PendingWalStats.wal_sync and
PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

Cool! Thanks for the summary and for continuing to work on this.

TODOs:
- Documentation.
- Thinking about how to interpret the different IOOps within the two
IOContexts and discussing what would be useful to count.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
be an another patch)

Yes, I would be explicit that you are not including WAL IO done exclusively in
the context of replication.

- Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
(This could be an another patch)

Yes, I think it makes sense as another patch.

On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think it would be good to count WAL reads even though they are not
currently represented in pg_stat_wal. Here is a thread discussing this
[1].

I used the same implementation in the thread link [1]. I added 'WAL
read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
xlogreader.c and walsender.c because they cause some failures on:
'!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
these. Also, I added Bharath to CC. I have a question about 'WAL
read':
1. There are two places where 'WAL read' happens.
a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
time count is equal to XLOG_BLCKSZ but there are some cases it is not.
For example
- in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
- in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
These nbytes and count variables could be different from XLOG_BLCKSZ.

b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);

So, what should op_bytes be set to for 'WAL read' operations?

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
treat op_bytes * number of reads as an approximation of the number of
bytes read. I don't actually know what makes more sense. I don't think I
would like having a number for bytes that is not accurate.

I think we will also want to add an IOContext for WAL initialization.
Then we can track how long is spent doing 'WAL init' (including filling
the WAL file with zeroes). XLogFileInitInternal() is likely where we
would want to add it. And op_bytes for this would likely be
wal_segment_size. I thought I heard about someone proposing adding WAL
init to pg_stat_wal, but I can't find the thread.

Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
have a question there:
1. Some of the WAL processes happens at initdb (standalone backend
IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
initdb, AFAIK there is no way to set 'track_wal_io_timing' and
'track_io_timing' variables there. So, their timings appear as 0.
Should I use IsBootstrapProcessingMode() to enable WAL io timings at
the initdb or are they not that much important?

I don't have an opinion about this. I can see an argument for doing it
either way. We do track other IO during initdb in pg_stat_io.

Any that we decide not to count for now should be "banned" in
pgstat_tracks_io_op() for clarity. For example, if we create a separate
IOContext for WAL file init, I'm not sure what would count as an
IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.

Also, I think there are some backend types which will not generate WAL
and we should determine which those are and skip those rows in
pgstat_tracks_io_object().

I agree, I am working on this. I have a couple of questions:
1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?

I don't know the answer to this.

2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
specific IOOp operations?

I don't think there is a general answer to this. You'll have to look at
the code and think about specific things that backend might do that
would require WAL. I think we'll definitely need other community members
to check our work for the valid combinations.

Completing the matrix of valid combinations of BackendType, IOOp, and
IOContext and defining each one is the biggest area where we could use
help from community members.

As an additional TODO, I would explore adding some tests to prevent
accidental removal of the pg_stat_io WAL tracking.

I think we can easily test IOCONTEXT_NORMAL WAL writes in
src/test/regress/sql/stats.sql (perhaps it is worth checking that
synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs
should again be easy to test if synchronous_commit is on and fsync is
on.

I'm not sure how to reliably test WAL reads (given timing). Logically,
you can sum WAL reads before a crash is initiated in one of the tests in
the recovery suite, and then sum them after the db has restarted and
there should definitely be an increase in WAL reads, but I don't know if
we need to do something to guarantee that there will have been WAL reads
(to avoid test flakes).

I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We
need a before and after and I can't think of a cheap operation to ensure
a new WAL segment is written to or fsyncd in between a before and after
for the purposes of testing.

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 8b0710abe6..2ee6c21398 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)
I think we should likely follow the pattern of using
pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
elsewhere. You could pass the IOObject as a parameter to
pgstat_prepare_io_time() in order to determine if we should check
track_io_timing or track_wal_io_timing. And we'll want to check
track_wal_io_timing if IOObject is IOOBJECT_WAL in
pgstat_count_io_op_time().
Done. Instead of passing parameters to pgstat_prepare_io_time(), I
used a slightly different implementation. I return the current time if
there is a chance that any 'time' can be tracked.

Cool!

From 574fdec6ed8073dbc49053e6933db0310c7c62f5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 3 Aug 2023 16:11:16 +0300
Subject: [PATCH v2] Show WAL stats on pg_stat_io

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

In the commit message, I would describe what kinds of WAL IO this
patchset currently covers -- i.e. not streaming replication WAL IO.

---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3af..ee7b85e18ca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
startoffset += written;
} while (nleft > 0);

I'm not sure if the right location is here or in
pgstat_count_io_op_time(), but I would explain why you did not move
PendingWalStats.wal_writes counter into pg_stat_io code (and why you did
move the other PendingWalStats counters there.

+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages);
+
npages = 0;
/*
@@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
int fd;
int save_errno;
int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+ instr_time io_start;

Assert(logtli != 0);

@@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
(errcode_for_file_access(),
errmsg("could not create file \"%s\": %m", tmppath)));

Since you have two calls to pgstat_prepare_io_time() in this function, I
think it would be nice to have a comment above each to the effect of
"start timing writes for stats" and "start timing fsyncs for stats"

+	io_start = pgstat_prepare_io_time();
+
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62e..ee850af5514 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
XLogRecord *record;
bool		reachedRecoveryTarget = false;
TimeLineID	replayTLI;
+	uint32		pgstat_report_wal_frequency = 0;

/*
* Initialize shared variables for tracking progress of WAL replay, as if
@@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
*/
ApplyWalRecord(xlogreader, record, &replayTLI);

+			/*
+			 * Report pending statistics to the cumulative stats system once
+			 * every PGSTAT_REPORT_FREQUENCY times to not hinder performance.
+			 */
+			if (pgstat_report_wal_frequency++ == PGSTAT_REPORT_FREQUENCY)
+			{
+				pgstat_report_wal(false);
+				pgstat_report_wal_frequency = 0;
+			}
+

Is the above needed for your patch to work? What does it do? It should
probably be in a separate commit and should definitely have an
explanation.

--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
Assert((unsigned int) io_op < IOOP_NUM_TYPES);
Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));

I would add a comment here explaining that pg_stat_wal doesn't count WAL
init or WAL reads.

+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+	   io_op == IOOP_FSYNC)
+		PendingWalStats.wal_sync += cnt;
+
PendingIOStats.counts[io_object][io_context][io_op] += cnt;

have_iostats = true;
}

+/*
+ * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
+ * current time if there is a chance that any 'time' can be tracked.
+ */
instr_time
pgstat_prepare_io_time(void)
{
instr_time	io_start;

-	if (track_io_timing)
+	if(track_io_timing || track_wal_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
else
INSTR_TIME_SET_ZERO(io_start);

Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I
couldn't remember, it is probably worth a comment.

pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
instr_time start_time, uint32 cnt)
{
-	if (track_io_timing)
+	if (pgstat_should_track_io_time(io_object, io_context))
{
instr_time	io_time;
@@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));

Now that we are adding more if statements to this function, I think we
should start adding more comments.

We should explain what the different counters here are for e.g.
pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal.

We should also explain what is tracked for each and why it differs --
e.g. some track time and some don't, some track only reads or writes,
etc.

Also we should mention why we are consolidating them here. That is, we
want to eventually deduplicate these counters, so we are consolidating
them first. This also makes it easy to compare what is tracked for which
stats or instrumentation purpose.

And for those IO counters that we haven't moved here, we should mention
it is because they track at a different level of granularity or at a
different point in the call stack.

if (io_object == IOOBJECT_RELATION)
INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			/* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
+			else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
}

Also, I would reorder the if statements to be in order of the enum
values (e.g. FSYNC, READ, WRITE).

else if (io_op == IOOP_READ)
{
@@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
if (io_object == IOOBJECT_RELATION)
INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
}
+		else if (io_op == IOOP_FSYNC)
+		{
+			/* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */

I wouldn't squeeze this comment here like this. It is hard to read

+			if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);

+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default.
+ */
+int
+pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
+{

Small typo in function name:
pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes
I'd also mention why BLCKSZ is the default

+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+	}
+
+	return BLCKSZ;
+}

@@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
if (!pgstat_tracks_io_bktype(bktype))
return false;
+	/*
+	 * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&

Little bit of errant whitespace here.

/*
* Currently, IO on temporary relations can only occur in the
* IOCONTEXT_NORMAL IOContext.
@@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
return false;

I would expand on the comment to explain what NORMAL is for WAL -- what
we consider normal to be and why. And why it is different than INIT.

+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+	   !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+	   return false;
+
+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+	   !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+	   return false;

These are the first "bans" that we have for an IOOp for a specific
combination of io_context and io_object. We should add a new comment for
this and perhaps consider what ordering makes most sense. I tried to
organize the bans from most broad to most specific at the bottom.

--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
* and constant multipliers, once non-block-oriented IO (e.g.
* temporary file IO) is tracked.
*/
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);

There's a comment above this in the code that says this is hard-coded to
BLCKSZ. That comment needs to be updated or removed (in lieu of the
comment in your pgstat_get_io_op_bytes() function).

+				op_bytes = pgstat_get_io_op_btyes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);

+extern PGDLLIMPORT bool track_wal_io_timing;
+extern PGDLLIMPORT int wal_segment_size;

These shouldn't be in two places (i.e. they are already in xlog.h and
you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for
track_io_timing, so you can probably justify including xlog.h.

- Melanie

Nazir Bilal Yavuz

byavuz81@gmail.com

over 2 years ago

In reply to: Melanie Plageman (#4)

2 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thanks for the review!

Current status of the patch is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
- IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned in
pg_stat_io.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Try to set op_bytes for BackendType / IOContext.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
- Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

I am adding tracking of BackendType / IOContext / IOOp as tables, empty
cell means it is not decided yet:

IOCONTEXT_NORMAL / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╦═══════╗
║ IOCONTEXT_NORMAL ║ read ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum launcher ║ FALSE ║ FALSE ║ FALSE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum worker ║ FALSE ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ client backend ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background worker ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background writer ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ checkpointer ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ standalone backend ║ TRUE ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ startup ║ TRUE ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walreceiver ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walsender ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walwriter ║ ║ TRUE ║ TRUE ║
╚═════════════════════╩═══════╩═══════╩═══════╝

IOCONTEXT_WAL_INIT / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╗
║ IOCONTEXT_WAL_INIT ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum launcher ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum worker ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ client backend ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ background worker ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ background writer ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ checkpointer ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ standalone backend ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ startup ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walreceiver ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walsender ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walwriter ║ ║ ║
╚═════════════════════╩═══════╩═══════╝

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com>
wrote:

On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think it would be good to count WAL reads even though they are not
currently represented in pg_stat_wal. Here is a thread discussing this
[1].

I used the same implementation in the thread link [1]. I added 'WAL
read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
xlogreader.c and walsender.c because they cause some failures on:
'!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
these. Also, I added Bharath to CC. I have a question about 'WAL
read':
1. There are two places where 'WAL read' happens.
a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
time count is equal to XLOG_BLCKSZ but there are some cases it is not.
For example
- in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
- in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
These nbytes and count variables could be different from XLOG_BLCKSZ.

b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ

bytes:

pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);

So, what should op_bytes be set to for 'WAL read' operations?

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

I agree. I will edit that later, added to TODOs.

The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
treat op_bytes * number of reads as an approximation of the number of
bytes read. I don't actually know what makes more sense. I don't think I
would like having a number for bytes that is not accurate.

Yes, the prior one makes more sense to me.

Should I use IsBootstrapProcessingMode() to enable WAL io timings at
the initdb or are they not that much important?

I don't have an opinion about this. I can see an argument for doing it
either way. We do track other IO during initdb in pg_stat_io.

I didn't add it for now. It is an easy change, it could be added later.

As an additional TODO, I would explore adding some tests to prevent
accidental removal of the pg_stat_io WAL tracking.

I think we can easily test IOCONTEXT_NORMAL WAL writes in
src/test/regress/sql/stats.sql (perhaps it is worth checking that
synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs
should again be easy to test if synchronous_commit is on and fsync is
on.

I'm not sure how to reliably test WAL reads (given timing). Logically,
you can sum WAL reads before a crash is initiated in one of the tests in
the recovery suite, and then sum them after the db has restarted and
there should definitely be an increase in WAL reads, but I don't know if
we need to do something to guarantee that there will have been WAL reads
(to avoid test flakes).

I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We
need a before and after and I can't think of a cheap operation to ensure
a new WAL segment is written to or fsyncd in between a before and after
for the purposes of testing.

IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
For the IOCONTEXT_NORMAL reads and IOCONTEXT_INIT tests, I couldn't find a
way to avoid test flakes. I am open to suggestions. I added these to TODOs.

---
diff --git a/src/backend/access/transam/xlog.c

b/src/backend/access/transam/xlog.c

index 60c0b7ec3af..ee7b85e18ca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli,

bool flexible)

startoffset += written;
} while (nleft > 0);

I'm not sure if the right location is here or in
pgstat_count_io_op_time(), but I would explain why you did not move
PendingWalStats.wal_writes counter into pg_stat_io code (and why you did
move the other PendingWalStats counters there.

+ pgstat_count_io_op_time(IOOBJECT_WAL,

IOCONTEXT_NORMAL,

+

IOOP_WRITE, io_start, npages);

+
npages = 0;

/*
@@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno,

TimeLineID logtli,

int fd;
int save_errno;
int open_flags = O_RDWR | O_CREAT | O_EXCL |

PG_BINARY;

+ instr_time io_start;

Assert(logtli != 0);

@@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno,

TimeLineID logtli,

(errcode_for_file_access(),
errmsg("could not create file \"%s\":

%m", tmppath)));

Since you have two calls to pgstat_prepare_io_time() in this function, I
think it would be nice to have a comment above each to the effect of
"start timing writes for stats" and "start timing fsyncs for stats"

Done.

+     io_start = pgstat_prepare_io_time();
+
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);

diff --git a/src/backend/access/transam/xlogrecovery.c

b/src/backend/access/transam/xlogrecovery.c

index becc2bda62e..ee850af5514 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
XLogRecord *record;
bool            reachedRecoveryTarget = false;
TimeLineID      replayTLI;
+     uint32          pgstat_report_wal_frequency = 0;

/*
* Initialize shared variables for tracking progress of WAL

replay, as if

@@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
*/
ApplyWalRecord(xlogreader, record, &replayTLI);
+                     /*
+                      * Report pending statistics to the cumulative

stats system once

+ * every PGSTAT_REPORT_FREQUENCY times to not

hinder performance.

+                      */
+                     if (pgstat_report_wal_frequency++ ==

PGSTAT_REPORT_FREQUENCY)

+                     {
+                             pgstat_report_wal(false);
+                             pgstat_report_wal_frequency = 0;
+                     }
+
Is the above needed for your patch to work? What does it do? It should
probably be in a separate commit and should definitely have an
explanation.

Done, I omit that part.

--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext

io_context, IOOp io_op, uint3

Assert((unsigned int) io_op < IOOP_NUM_TYPES);
Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context,

io_op));

I would add a comment here explaining that pg_stat_wal doesn't count WAL
init or WAL reads.

Done.

+     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+        io_op == IOOP_FSYNC)
+             PendingWalStats.wal_sync += cnt;
+
PendingIOStats.counts[io_object][io_context][io_op] += cnt;

have_iostats = true;
}

+/*
+ * Prepares io_time for pgstat_count_io_op_time() function. It needs

to return

+ * current time if there is a chance that any 'time' can be tracked.
+ */
instr_time
pgstat_prepare_io_time(void)
{
instr_time      io_start;

-     if (track_io_timing)
+     if(track_io_timing || track_wal_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
else
INSTR_TIME_SET_ZERO(io_start);

Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I
couldn't remember, it is probably worth a comment.

Done.

pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp

io_op,

instr_time start_time,

uint32 cnt)

{
-     if (track_io_timing)
+     if (pgstat_should_track_io_time(io_object, io_context))
{
instr_time      io_time;
@@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object,

IOContext io_context, IOOp io_op,

pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));

Now that we are adding more if statements to this function, I think we
should start adding more comments.

We should explain what the different counters here are for e.g.
pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal.

We should also explain what is tracked for each and why it differs --
e.g. some track time and some don't, some track only reads or writes,
etc.

Also we should mention why we are consolidating them here. That is, we
want to eventually deduplicate these counters, so we are consolidating
them first. This also makes it easy to compare what is tracked for which
stats or instrumentation purpose.

And for those IO counters that we haven't moved here, we should mention
it is because they track at a different level of granularity or at a
different point in the call stack.

Done.

if (io_object == IOOBJECT_RELATION)

INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);

+ /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on

PendingWalStats */

+ else if (io_object == IOOBJECT_WAL && io_context

== IOCONTEXT_NORMAL)

+

INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);

}

Also, I would reorder the if statements to be in order of the enum
values (e.g. FSYNC, READ, WRITE).

Done.

else if (io_op == IOOP_READ)
{
@@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object,

IOContext io_context, IOOp io_op,

if (io_object == IOOBJECT_RELATION)

INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);

}
+             else if (io_op == IOOP_FSYNC)
+             {
+                     /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on

PendingWalStats */

I wouldn't squeeze this comment here like this. It is hard to read

Done.

+ if (io_object == IOOBJECT_WAL && io_context ==

IOCONTEXT_NORMAL)

+

INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);

+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default.
+ */
+int
+pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
+{
Small typo in function name:
pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes
I'd also mention why BLCKSZ is the default

Done.

+     if (io_object == IOOBJECT_WAL)
+     {
+             if (io_context == IOCONTEXT_NORMAL)
+                     return XLOG_BLCKSZ;
+             else if (io_context == IOCONTEXT_INIT)
+                     return wal_segment_size;
+     }
+
+     return BLCKSZ;
+}

@@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype,

IOObject io_object,

if (!pgstat_tracks_io_bktype(bktype))
return false;

+     /*
+      * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+      * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+      */
+     if (io_object == IOOBJECT_WAL &&
+             (io_context != IOCONTEXT_NORMAL &&

Little bit of errant whitespace here.

Done.

/*
* Currently, IO on temporary relations can only occur in the
* IOCONTEXT_NORMAL IOContext.
@@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject

io_object,

if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
return false;

I would expand on the comment to explain what NORMAL is for WAL -- what
we consider normal to be and why. And why it is different than INIT.

Done.

+     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+        !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+        return false;
+
+     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+        !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op ==

IOOP_FSYNC))

+ return false;

These are the first "bans" that we have for an IOOp for a specific
combination of io_context and io_object. We should add a new comment for
this and perhaps consider what ordering makes most sense. I tried to
organize the bans from most broad to most specific at the bottom.

Done.

--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
* and constant multipliers, once

non-block-oriented IO (e.g.

* temporary file IO) is tracked.
*/
- values[IO_COL_CONVERSION] =

Int64GetDatum(BLCKSZ);

There's a comment above this in the code that says this is hard-coded to
BLCKSZ. That comment needs to be updated or removed (in lieu of the
comment in your pgstat_get_io_op_bytes() function).

Done.

+ op_bytes = pgstat_get_io_op_btyes(io_obj,

io_context);

+ values[IO_COL_CONVERSION] =

Int64GetDatum(op_bytes);

+extern PGDLLIMPORT bool track_wal_io_timing;
+extern PGDLLIMPORT int wal_segment_size;
These shouldn't be in two places (i.e. they are already in xlog.h and
you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for
track_io_timing, so you can probably justify including xlog.h.

Done.

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v3-0001-Show-WAL-stats-except-streaming-replication-WAL-o.patchtext/x-diff; charset=US-ASCII; name=v3-0001-Show-WAL-stats-except-streaming-replication-WAL-o.patchDownload

From fbbd64e8bf53c0ab149bac7970df9b4c1ac92589 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 19 Sep 2023 17:56:56 +0300
Subject: [PATCH v3 1/2] Show WAL stats (except streaming replication WAL) on
 pg_stat_io

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.
---
 src/backend/access/transam/xlog.c         |  60 ++++------
 src/backend/access/transam/xlogrecovery.c |   9 ++
 src/backend/utils/activity/pgstat_io.c    | 137 ++++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c       |  10 +-
 src/include/pgstat.h                      |   8 +-
 5 files changed, 172 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529b..5c0da34bc45 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2200,38 +2200,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time();
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2256,6 +2240,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+			                        IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -2949,6 +2936,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -2992,6 +2980,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3027,6 +3018,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+	                        io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3043,6 +3037,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3055,6 +3052,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+	                        IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8186,7 +8186,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8199,11 +8199,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		sync_method == SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -8247,18 +8243,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
-
-	PendingWalStats.wal_sync++;
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+	                        io_start, 1);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62e..ca1041c5fd7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1745,6 +1745,9 @@ PerformWalRecovery(void)
 			 */
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
+			/*  Report pending statistics to the cumulative stats system  */
+			pgstat_report_wal(false);
+
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
 			{
@@ -3220,6 +3223,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3312,6 +3316,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3340,6 +3346,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+	                        io_start, 1);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d4225..d5345ba2536 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -18,6 +18,7 @@
 
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 
 
@@ -87,24 +88,58 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
+	/*
+	 * Only count PendingWalStats.wal_sync in pg_stat_io because
+	 * PendingWalStats.wal_write could count different than what pg_stat_io
+	 * counts e.g. system calls. Also, PendingWalStats doesn't count WAL init
+	 * or WAL reads. So, they are not included too.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+	    io_op == IOOP_FSYNC)
+		PendingWalStats.wal_sync += cnt;
+
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
 
+/*
+ * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
+ * current time if there is a chance that any 'time' can be tracked.
+ */
 instr_time
 pgstat_prepare_io_time(void)
 {
 	instr_time	io_start;
 
-	if (track_io_timing)
+	if (track_io_timing || track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 	else
+		/*
+		 * If time won't be tracked, there is no need to set io_start but
+		 * compiler complains about uninitialized use. So, set it to zero.
+		 */
 		INSTR_TIME_SET_ZERO(io_start);
 
 	return io_start;
 }
 
+/*
+ * Decide if the io timing needs be tracked
+ */
+bool
+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+	/*
+	 * io times of IOOBJECT_WAL IOObject needs to be tracked when
+	 * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+	 */
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op_n() except it also accumulates time.
  */
@@ -112,18 +147,40 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate different type of times here. We want to eventually
+	 * deduplicate these counters, so we are consolidating them first.
+	 * This also makes it easy to compare what is tracked for which
+	 * stats or instrumentation purpose.
+	 *
+	 * Some IO counters didn't moved here because they track at a different
+	 * level of granularity or at a different point in the call stack.
+	 *
+	 * pgstat_count_buffer is for pgstat_database. Since pg_stat_database only
+	 * counts blk_read_time and blk_write_time, it is set for IOOP_READ and
+	 * IOOP_WRITE.
+	 *
+	 * pgBufferUsage is for EXPLAIN. pgBufferUsage has only write and read stats
+	 * for shared/local and temporary blocks. Only shared/local blocks are counted
+	 * here.
+	 *
+	 * PendingWalStats is for counting WAL stats on pg_stat_wal. It
+	 * only counts IOOBJECT_WAL / IOCONTEXT_NORMAL.
+	 *
+	 * At the end of the if case, accumulate time for pg_stat_io.
+	 */
+	if (pgstat_should_track_io_time(io_object, io_context))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE)
+		if (io_op == IOOP_FSYNC)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-			if (io_object == IOOBJECT_RELATION)
-				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			/* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
+			if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
@@ -131,11 +188,22 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
+		else if (io_op == IOOP_WRITE)
+		{
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+
+			/* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
+			else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
+		}
 
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 	}
 
+	/* IO OP times are counted, now count IO OPs */
 	pgstat_count_io_op_n(io_object, io_context, io_op, cnt);
 }
 
@@ -216,12 +284,33 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "normal";
 		case IOCONTEXT_VACUUM:
 			return "vacuum";
+		case IOCONTEXT_INIT:
+			return "init";
 	}
 
 	elog(ERROR, "unrecognized IOContext value: %d", io_context);
 	pg_unreachable();
 }
 
+/*
+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default because most of the
+ * IOObject / IOContext uses BLCKSZ.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+	}
+
+	return BLCKSZ;
+}
+
 const char *
 pgstat_get_io_object_name(IOObject io_object)
 {
@@ -231,6 +320,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -312,10 +403,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_INVALID:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
 			return false;
 
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
 		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
@@ -350,6 +441,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+	    (io_context != IOCONTEXT_NORMAL &&
+	     io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -448,6 +548,27 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * Some IOOps are not valid in certain IOContexts & IOObjects and some
+	 * IOOps are only valid in certain IOContexts & IOObjects.
+	 */
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_INIT io_context means
+	 * operations done while creating new WAL segments.
+	 */
+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+	   !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+	   return false;
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_NORMAL io_context means
+	 * operations done on already created WAL segments.
+	 */
+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+	   !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+	   return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 3b44af80066..18bff382363 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1383,6 +1383,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
+				int			op_bytes;
 
 				Datum		values[IO_NUM_COLUMNS] = {0};
 				bool		nulls[IO_NUM_COLUMNS] = {0};
@@ -1401,12 +1402,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
 
 				/*
-				 * Hard-code this to the value of BLCKSZ for now. Future
-				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
-				 * and constant multipliers, once non-block-oriented IO (e.g.
-				 * temporary file IO) is tracked.
+				 * op_bytes can change according to IOObject and IOContext.
+				 * Get the correct op_bytes.
 				 */
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57a2c0866a2..21b6b8ca31c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -277,9 +277,10 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
@@ -287,9 +288,10 @@ typedef enum IOContext
 	IOCONTEXT_BULKWRITE,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
+	IOCONTEXT_INIT,
 } IOContext;
 
-#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_INIT + 1)
 
 typedef enum IOOp
 {
@@ -521,10 +523,12 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
 extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern instr_time pgstat_prepare_io_time(void);
+extern bool pgstat_should_track_io_time(IOObject io_object, IOContext io_context);
 extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time, uint32 cnt);
 
 extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 
-- 
2.40.1

v3-0002-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patchtext/x-diff; charset=US-ASCII; name=v3-0002-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patchDownload

From 8a06deb704b51010f36a448edf37dfdc1c829d43 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 14 Sep 2023 12:44:29 +0300
Subject: [PATCH v3 2/2] Add IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync
 tests

---
 src/test/regress/expected/stats.out | 25 +++++++++++++++++++++++++
 src/test/regress/sql/stats.sql      | 10 ++++++++++
 2 files changed, 35 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 94187e59cfb..98f94b189ea 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1246,6 +1246,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1283,6 +1286,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 1e21e55c6d9..c91ec43f039 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -603,6 +603,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -621,6 +624,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
-- 
2.40.1

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#5)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote:

Any kind of feedback would be appreciated.

This was registered in the CF, so I have given it a look. Note that
0001 has a conflict with pgstat_count_io_op_time(), so it cannot be
applied.

+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+	/*
+	 * io times of IOOBJECT_WAL IOObject needs to be tracked when
+	 * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+	 */
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;

I can see the temptation to do that, but I have mixed feelings about
the approach of mixing two GUCs in a code path dedicated to pg_stat_io
where now we only rely on track_io_timing. The result brings
confusion, while making pg_stat_io, which is itself only used for
block-based operations, harder to read.

The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF)
is quite tempting, actually, creating a neat separation between the
existing pg_stat_io and pg_stat_wal (not a SRF), with a third view
that provides more details about the contexts and backend types for
the WAL stats with its relevant fields:
/messages/by-id/CAAKRu_bM55pj3pPRW0nd_-paWHLRkOU69r816AeztBBa-N1HLA@mail.gmail.com

And perhaps just putting that everything that calls
pgstat_count_io_op_time() under track_io_timing is just natural?
What's the performance regression you would expect if both WAL and
block I/O are controlled by that, still one would expect only one of
them?

On top of that pg_stat_io is now for block-based I/O operations, so
that does not fit entirely in the picture, though I guess that Melanie
has thought more on the matter than me. That may be also a matter of
taste.

+      /*  Report pending statistics to the cumulative stats system */
+      pgstat_report_wal(false);

This is hidden in 0001, still would be better if handled as a patch on
its own and optionally backpatch it as we did for the bgwriter with
e64c733bb1?

Side note: I think that we should spend more efforts in documenting
what IOContext and IOOp mean. Not something directly related to this
patch, still this patch or things similar make it a bit harder which
part of it is used for what by reading pgstat.h.
--
Michael

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#6)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thank you for the feedback!

On Thu, 26 Oct 2023 at 09:28, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote:

Any kind of feedback would be appreciated.

This was registered in the CF, so I have given it a look. Note that
0001 has a conflict with pgstat_count_io_op_time(), so it cannot be
applied.
+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+       /*
+        * io times of IOOBJECT_WAL IOObject needs to be tracked when
+        * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+        */
+       if (io_object == IOOBJECT_WAL)
+               return track_wal_io_timing;
+
+       return track_io_timing;
I can see the temptation to do that, but I have mixed feelings about
the approach of mixing two GUCs in a code path dedicated to pg_stat_io
where now we only rely on track_io_timing. The result brings
confusion, while making pg_stat_io, which is itself only used for
block-based operations, harder to read.

The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF)
is quite tempting, actually, creating a neat separation between the
existing pg_stat_io and pg_stat_wal (not a SRF), with a third view
that provides more details about the contexts and backend types for
the WAL stats with its relevant fields:
/messages/by-id/CAAKRu_bM55pj3pPRW0nd_-paWHLRkOU69r816AeztBBa-N1HLA@mail.gmail.com

And perhaps just putting that everything that calls
pgstat_count_io_op_time() under track_io_timing is just natural?
What's the performance regression you would expect if both WAL and
block I/O are controlled by that, still one would expect only one of
them?

I will check these and I hope I will come back with something meaningful.

+      /*  Report pending statistics to the cumulative stats system */
+      pgstat_report_wal(false);
This is hidden in 0001, still would be better if handled as a patch on
its own and optionally backpatch it as we did for the bgwriter with
e64c733bb1?

I thought about it again and found the use of
'pgstat_report_wal(false);' here wrong. This was mainly for flushing
WAL stats because of the WAL reads but pg_stat_wal doesn't have WAL
read stats, so there is no need to flush WAL stats here. I think this
should be replaced with 'pgstat_flush_io(false);'.

Side note: I think that we should spend more efforts in documenting
what IOContext and IOOp mean. Not something directly related to this
patch, still this patch or things similar make it a bit harder which
part of it is used for what by reading pgstat.h.

I agree.

Regards,
Nazir Bilal Yavuz
Microsoft

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#7)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Tue, 31 Oct 2023 at 16:57, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Thu, 26 Oct 2023 at 09:28, Michael Paquier <michael@paquier.xyz> wrote:

And perhaps just putting that everything that calls
pgstat_count_io_op_time() under track_io_timing is just natural?
What's the performance regression you would expect if both WAL and
block I/O are controlled by that, still one would expect only one of
them?

I will check these and I hope I will come back with something meaningful.

I applied the patches on upstream postgres and then run pgbench for each
available clock sources couple of times:
# Set fsync = off and track_io_timing = on
# pgbench -i -s 100 test
# pgbench -M prepared -c16 -j8 -f <( echo "SELECT
pg_logical_emit_message(true, \:client_id::text, '1234567890');") -T60 test

Results are:

╔═════════╦═══════════════════════════════╦════════╗
║ ║ track_wal_io_timing ║ ║
╠═════════╬═══════════════╦═══════════════╬════════╣
║ clock ║ on ║ off ║ change ║
║ sources ║ ║ ║ ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ tsc ║ ║ ║ ║
║ ║ 514814.459170 ║ 519826.284139 ║ %1 ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ hpet ║ ║ ║ ║
║ ║ 132116.272121 ║ 141820.548447 ║ %7 ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ acpi_pm ║ ║ ║ ║
║ ║ 394793.092255 ║ 403723.874719 ║ %2 ║
╚═════════╩═══════════════╩═══════════════╩════════╝

Regards,
Nazir Bilal Yavuz
Microsoft

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#8)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, Nov 06, 2023 at 03:35:01PM +0300, Nazir Bilal Yavuz wrote:

Results are:

╔═════════╦═══════════════════════════════╦════════╗
║ ║ track_wal_io_timing ║ ║
╠═════════╬═══════════════╦═══════════════╬════════╣
║ clock ║ on ║ off ║ change ║
║ sources ║ ║ ║ ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ tsc ║ ║ ║ ║
║ ║ 514814.459170 ║ 519826.284139 ║ %1 ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ hpet ║ ║ ║ ║
║ ║ 132116.272121 ║ 141820.548447 ║ %7 ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ acpi_pm ║ ║ ║ ║
║ ║ 394793.092255 ║ 403723.874719 ║ %2 ║
╚═════════╩═══════════════╩═══════════════╩════════╝

Thanks for the tests. That's indeed noticeable under this load.
Better to keep track_io_timing and track_wal_io_timing as two
separated beasts, at least that's clear.
--
Michael

#10

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Michael Paquier (#6)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On 2023-10-26 15:28:32 +0900, Michael Paquier wrote:

On top of that pg_stat_io is now for block-based I/O operations, so
that does not fit entirely in the picture, though I guess that Melanie
has thought more on the matter than me. That may be also a matter of
taste.

I strongly disagree. A significant part of the design of pg_stat_io was to
make it possible to collect multiple sources of IO in a single view, so that
sysadmins don't have to look in dozens of places to figure out what is causing
what kind of IO.

We should over time collect all sources of IO in pg_stat_io. For some things
we might want to also have more detailed information in other views (e.g. it
doesn't make sense to track FPIs in pg_stat_io, but does make sense in
pg_stat_wal) - but that should be in addition, not instead of.

Greetings,

Andres Freund

#11

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Andres Freund (#10)

Re: Show WAL write and fsync stats in pg_stat_io

On Tue, Nov 07, 2023 at 03:30:48PM -0800, Andres Freund wrote:

I strongly disagree. A significant part of the design of pg_stat_io was to
make it possible to collect multiple sources of IO in a single view, so that
sysadmins don't have to look in dozens of places to figure out what is causing
what kind of IO.

Okay. Point taken.

We should over time collect all sources of IO in pg_stat_io. For some things
we might want to also have more detailed information in other views (e.g. it
doesn't make sense to track FPIs in pg_stat_io, but does make sense in
pg_stat_wal) - but that should be in addition, not instead of.

Sure. I understand here that you mean the number of FPIs counted when
a record is inserted, different from the path where we decide to write
and/or flush WAL. The proposed patch seems to be a bit inconsistent
regarding wal_sync_time, by the way.

By the way, if the write/sync quantities and times begin to be tracked
by pg_stat_io, I'd see a pretty good argument in removing the
equivalent columns in pg_stat_wal. It looks like this would reduce
the confusion related to the handling of PendingWalStats added in
pgstat_io.c, for one.
--
Michael

#12

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Michael Paquier (#11)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:

By the way, if the write/sync quantities and times begin to be tracked
by pg_stat_io, I'd see a pretty good argument in removing the
equivalent columns in pg_stat_wal. It looks like this would reduce
the confusion related to the handling of PendingWalStats added in
pgstat_io.c, for one.

Another approach would be to fetch the relevant columns from pg_stat_io in the
pg_stat_wal view. That'd avoid double accounting and breaking existing
monitoring.

Greetings,

Andres Freund

#13

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Andres Freund (#12)

Re: Show WAL write and fsync stats in pg_stat_io

On Tue, Nov 07, 2023 at 05:19:28PM -0800, Andres Freund wrote:

Another approach would be to fetch the relevant columns from pg_stat_io in the
pg_stat_wal view. That'd avoid double accounting and breaking existing
monitoring.

Yep, I'd be OK with that as well to maintain compatibility.
--
Michael

#14

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Michael Paquier (#13)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Nov 08, 2023 at 10:27:44AM +0900, Michael Paquier wrote:

Yep, I'd be OK with that as well to maintain compatibility.

By the way, note that the patch is failing to apply, and that I've
switched it as waiting on author on 10/26.
--
Michael

#15

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#5)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Sep 20, 2023 at 1:28 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

Thanks for the review!

Current status of the patch is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
- IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned in pg_stat_io.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time / PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() / pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Try to set op_bytes for BackendType / IOContext.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
- Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.

Is there any plan to account for WAL read stats in the WALRead()
function which will cover walsenders i.e. WAL read by logical and
streaming replication, WAL read by pg_walinspect and so on? I see the
patch already covers WAL read stats by recovery in XLogPageRead(), but
not other page_read callbacks which will end up in WALRead()
eventually. If added, the feature at
/messages/by-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=Q@mail.gmail.com
can then extend it to cover WAL read from WAL buffer stats.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#14)

2 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thanks for all the feedback!

On Wed, 8 Nov 2023 at 08:59, Michael Paquier <michael@paquier.xyz> wrote:

By the way, note that the patch is failing to apply, and that I've
switched it as waiting on author on 10/26.

Here is an updated patchset in attachment. Rebased on the latest HEAD
and changed 'pgstat_report_wal(false)' to 'pgstat_flush_io(false)' in
xlogrecovery.c. I will share the new version of the patchset once I
address the feedback.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v4-0001-Show-WAL-stats-except-streaming-replication-WAL-o.patchtext/x-diff; charset=US-ASCII; name=v4-0001-Show-WAL-stats-except-streaming-replication-WAL-o.patchDownload

From e5db5cd6d8c47cadde0539f06bbee22368d17a41 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 26 Oct 2023 12:12:32 +0300
Subject: [PATCH v4 1/2] Show WAL stats (except streaming replication WAL) on
 pg_stat_io

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.
---
 src/backend/access/transam/xlog.c         |  60 ++++-----
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/utils/activity/pgstat_io.c    | 144 ++++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c       |  10 +-
 src/include/pgstat.h                      |   8 +-
 5 files changed, 178 insertions(+), 54 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..d265b8c032 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2256,38 +2256,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time();
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2312,6 +2296,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -3005,6 +2992,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3048,6 +3036,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3083,6 +3074,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3099,6 +3093,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3111,6 +3108,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8282,7 +8282,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8295,11 +8295,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8343,18 +8339,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
-
-	PendingWalStats.wal_sync++;
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..5557f846a0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1773,6 +1774,9 @@ PerformWalRecovery(void)
 			 */
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
+			/* Report pending statistics to the cumulative stats system  */
+			pgstat_flush_io(false);
+
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
 			{
@@ -3248,6 +3252,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3340,6 +3345,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3368,6 +3375,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..a25fa67703 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -18,6 +18,7 @@
 
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 
 
@@ -87,24 +88,59 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
+	/*
+	 * Only count PendingWalStats.wal_sync in pg_stat_io because
+	 * PendingWalStats.wal_write could count different than what pg_stat_io
+	 * counts e.g. system calls. Also, PendingWalStats doesn't count WAL init
+	 * or WAL reads. So, they are not included too.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		io_op == IOOP_FSYNC)
+		PendingWalStats.wal_sync += cnt;
+
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
 
+/*
+ * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
+ * current time if there is a chance that any 'time' can be tracked.
+ */
 instr_time
 pgstat_prepare_io_time(void)
 {
 	instr_time	io_start;
 
-	if (track_io_timing)
+	if (track_io_timing || track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 	else
+
+		/*
+		 * If time won't be tracked, there is no need to set io_start but
+		 * compiler complains about uninitialized use. So, set it to zero.
+		 */
 		INSTR_TIME_SET_ZERO(io_start);
 
 	return io_start;
 }
 
+/*
+ * Decide if the io timing needs be tracked
+ */
+bool
+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+	/*
+	 * io times of IOOBJECT_WAL IOObject needs to be tracked when
+	 * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+	 */
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op_n() except it also accumulates time.
  */
@@ -112,21 +148,43 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate different type of times here. We want to eventually
+	 * deduplicate these counters, so we are consolidating them first. This
+	 * also makes it easy to compare what is tracked for which stats or
+	 * instrumentation purpose.
+	 *
+	 * Some IO counters didn't moved here because they track at a different
+	 * level of granularity or at a different point in the call stack.
+	 *
+	 * pgstat_count_buffer is for pgstat_database. Since pg_stat_database only
+	 * counts blk_read_time and blk_write_time, it is set for IOOP_READ and
+	 * IOOP_WRITE.
+	 *
+	 * pgBufferUsage is for EXPLAIN. pgBufferUsage has only write and read
+	 * stats for shared/local and temporary blocks. Only shared/local blocks
+	 * are counted here.
+	 *
+	 * PendingWalStats is for counting WAL stats on pg_stat_wal. It only
+	 * counts IOOBJECT_WAL / IOCONTEXT_NORMAL.
+	 *
+	 * At the end of the if case, accumulate time for pg_stat_io.
+	 */
+	if (pgstat_should_track_io_time(io_object, io_context))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+
+		if (io_op == IOOP_FSYNC)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-			if (io_object == IOOBJECT_RELATION)
-				INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
-			else if (io_object == IOOBJECT_TEMP_RELATION)
-				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+			/* Track IOOBJECT_WAL / IOCONTEXT_NORMAL times on PendingWalStats */
+			if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);
 		}
+
 		else if (io_op == IOOP_READ)
 		{
 			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
@@ -135,11 +193,24 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			else if (io_object == IOOBJECT_TEMP_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
 		}
+		else if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		{
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+			else if (io_object == IOOBJECT_TEMP_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+
+			/* Track IOOBJECT_WAL / IOCONTEXT_NORMAL times on PendingWalStats */
+			else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
+		}
 
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 	}
 
+	/* IO OP times are counted, now count IO OPs */
 	pgstat_count_io_op_n(io_object, io_context, io_op, cnt);
 }
 
@@ -220,12 +291,33 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "normal";
 		case IOCONTEXT_VACUUM:
 			return "vacuum";
+		case IOCONTEXT_INIT:
+			return "init";
 	}
 
 	elog(ERROR, "unrecognized IOContext value: %d", io_context);
 	pg_unreachable();
 }
 
+/*
+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default because most of the
+ * IOObject / IOContext uses BLCKSZ.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+	}
+
+	return BLCKSZ;
+}
+
 const char *
 pgstat_get_io_object_name(IOObject io_object)
 {
@@ -235,6 +327,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -316,10 +410,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_INVALID:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
 			return false;
 
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
 		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
@@ -354,6 +448,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -452,6 +555,27 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * Some IOOps are not valid in certain IOContexts & IOObjects and some
+	 * IOOps are only valid in certain IOContexts & IOObjects.
+	 */
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_INIT io_context means operations
+	 * done while creating new WAL segments.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_NORMAL io_context means operations
+	 * done on already created WAL segments.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1fb8b31863..4329ac517f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1377,6 +1377,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
+				int			op_bytes;
 
 				Datum		values[IO_NUM_COLUMNS] = {0};
 				bool		nulls[IO_NUM_COLUMNS] = {0};
@@ -1395,12 +1396,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
 
 				/*
-				 * Hard-code this to the value of BLCKSZ for now. Future
-				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
-				 * and constant multipliers, once non-block-oriented IO (e.g.
-				 * temporary file IO) is tracked.
+				 * op_bytes can change according to IOObject and IOContext.
+				 * Get the correct op_bytes.
 				 */
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f95d8db0c4..1d9428b3a8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -276,9 +276,10 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
@@ -286,9 +287,10 @@ typedef enum IOContext
 	IOCONTEXT_BULKWRITE,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
+	IOCONTEXT_INIT,
 } IOContext;
 
-#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_INIT + 1)
 
 typedef enum IOOp
 {
@@ -520,10 +522,12 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
 extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern instr_time pgstat_prepare_io_time(void);
+extern bool pgstat_should_track_io_time(IOObject io_object, IOContext io_context);
 extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time, uint32 cnt);
 
 extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int	pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 
-- 
2.42.0

v4-0002-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patchtext/x-diff; charset=US-ASCII; name=v4-0002-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patchDownload

From 27b43a8568e65121db3599b3c415fca37a3fb345 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 14 Sep 2023 12:44:29 +0300
Subject: [PATCH v4 2/2] Add IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync
 tests

---
 src/test/regress/expected/stats.out | 25 +++++++++++++++++++++++++
 src/test/regress/sql/stats.sql      | 10 ++++++++++
 2 files changed, 35 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 494cef07d3..82dbe72dcd 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1292,6 +1292,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1329,6 +1332,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 7ae8b8a276..3945a7e3a3 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -620,6 +620,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -638,6 +641,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
-- 
2.42.0

#17

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Andres Freund (#12)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, 8 Nov 2023 at 04:19, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:

By the way, if the write/sync quantities and times begin to be tracked
by pg_stat_io, I'd see a pretty good argument in removing the
equivalent columns in pg_stat_wal. It looks like this would reduce
the confusion related to the handling of PendingWalStats added in
pgstat_io.c, for one.

Another approach would be to fetch the relevant columns from pg_stat_io in the
pg_stat_wal view. That'd avoid double accounting and breaking existing
monitoring.

There are some differences between pg_stat_wal and pg_stat_io while
collecting WAL stats. For example in the XLogWrite() function in the
xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
is not something we want for pg_stat_io since pg_stat_io counts the
number of blocks rather than the system calls, so instead incremented
pg_stat_io by npages.

Could that cause a problem since pg_stat_wal's behaviour will be
changed? Of course, as an alternative we could change pg_stat_io's
behaviour but in the end either pg_stat_wal's or pg_stat_io's
behaviour will be changed.

Regards,
Nazir Bilal Yavuz
Microsoft

#18

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#17)

Re: Show WAL write and fsync stats in pg_stat_io

On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:

There are some differences between pg_stat_wal and pg_stat_io while
collecting WAL stats. For example in the XLogWrite() function in the
xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
is not something we want for pg_stat_io since pg_stat_io counts the
number of blocks rather than the system calls, so instead incremented
pg_stat_io by npages.

Could that cause a problem since pg_stat_wal's behaviour will be
changed? Of course, as an alternative we could change pg_stat_io's
behaviour but in the end either pg_stat_wal's or pg_stat_io's
behaviour will be changed.

Yep, that could be confusing for existing applications that track the
information of pg_stat_wal. The number of writes is not something
that can be correctly shared between both. The timings for the writes
and the syncs could be shared at least, right?

This slightly relates to pgstat_count_io_op_n() in your latest patch,
where it feels a bit weird to see an update of
PendingWalStats.wal_sync sit in the middle of a routine dedicated to
pg_stat_io.. I am not completely sure what's the right balance here,
but I would try to implement things so as pg_stat_io paths does not
need to know about PendingWalStats.
--
Michael

#19

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#18)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thanks for the feedback.

On Mon, 20 Nov 2023 at 10:47, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:

There are some differences between pg_stat_wal and pg_stat_io while
collecting WAL stats. For example in the XLogWrite() function in the
xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
is not something we want for pg_stat_io since pg_stat_io counts the
number of blocks rather than the system calls, so instead incremented
pg_stat_io by npages.

Could that cause a problem since pg_stat_wal's behaviour will be
changed? Of course, as an alternative we could change pg_stat_io's
behaviour but in the end either pg_stat_wal's or pg_stat_io's
behaviour will be changed.

Yep, that could be confusing for existing applications that track the
information of pg_stat_wal. The number of writes is not something
that can be correctly shared between both. The timings for the writes
and the syncs could be shared at least, right?

Yes, the timings for the writes and the syncs should work. Another
question I have in mind is the pg_stat_reset_shared() function. When
we call it with 'io' it will reset pg_stat_wal's timings and when we
call it with 'wal' it won't reset them, right?

This slightly relates to pgstat_count_io_op_n() in your latest patch,
where it feels a bit weird to see an update of
PendingWalStats.wal_sync sit in the middle of a routine dedicated to
pg_stat_io.. I am not completely sure what's the right balance here,
but I would try to implement things so as pg_stat_io paths does not
need to know about PendingWalStats.

Write has block vs system calls differentiation but it is the same for
sync. Because of that I put PendingWalStats.wal_sync to pg_stat_io but
I agree that it looks a bit weird.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#20

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#19)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, Nov 20, 2023 at 05:43:17PM +0300, Nazir Bilal Yavuz wrote:

Yes, the timings for the writes and the syncs should work. Another
question I have in mind is the pg_stat_reset_shared() function. When
we call it with 'io' it will reset pg_stat_wal's timings and when we
call it with 'wal' it won't reset them, right?

pg_stat_reset_shared() with a target is IMO a very edge case, so I'm
OK with the approach of resetting timings in pg_stat_wal even if 'io'
was implied because pg_stat_wal would feed partially from pg_stat_io.
I'd take that as a side-cost in favor of compatibility while making
the stats gathering cheaper overall. I'm OK as well if people
counter-argue on this point, though that would mean to keep entirely
separate views with duplicated fields that serve the same purpose,
impacting all deployments because it would make the stats gathering
heavier for all.
--
Michael

#21

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Bharath Rupireddy (#15)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, 8 Nov 2023 at 10:34, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Is there any plan to account for WAL read stats in the WALRead()
function which will cover walsenders i.e. WAL read by logical and
streaming replication, WAL read by pg_walinspect and so on? I see the
patch already covers WAL read stats by recovery in XLogPageRead(), but
not other page_read callbacks which will end up in WALRead()
eventually. If added, the feature at
/messages/by-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=Q@mail.gmail.com
can then extend it to cover WAL read from WAL buffer stats.

Yes, I am planning to create a patch for that after this patch is
done. Thanks for informing!

--
Regards,
Nazir Bilal Yavuz
Microsoft

#22

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#16)

4 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thanks for all the feedback. I am sharing the new version of the patchset.

Current status of the patchset is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write, fsync stats and their
tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats and their tests are added.
- Documentation is updated.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

Updates & Discussion items:
- Try to set op_bytes for BackendType / IOContext: I think we don't
need this now, we will need this when we add streaming replication WAL
IOs.

- Decide which 'BackendType / IOContext / IOOp' should not be tracked:
-- IOOBJECT_WAL / IOCONTEXT_INIT + IOCONTEXT_NORMAL / write and fsync
IOs can be done on every backend that tracks IO statistics. Because of
that and since we have a pgstat_tracks_io_bktype(bktype) check, I
didn't add another check for this.
-- I found that only the standalone backend and startup backend do
IOOBJECT_WAL / IOCONTEXT_NORMAL / read IOs. So, I added a check for
that but I am not sure if there are more backends that do WAL reads on
WAL recovery.

- For the IOOBJECT_WAL / IOCONTEXT_INIT and IOOBJECT_WAL /
IOCONTEXT_NORMAL / read tests, I used initial WAL IOs to check these
stats. I am not sure if that is the correct way or enough to test
these stats.

- To not calculate WAL timings on pg_stat_wal and pg_stat_io view,
pg_stat_wal view's WAL timings are fetched from pg_stat_io. Since
these timings are fetched from pg_stat_io, pg_stat_reset_shared('io')
will reset pg_stat_wal's timings too.

- I didn't move 'PendingWalStats.wal_sync' out from the
'pgstat_count_io_op_n' function because they count the same thing
(block vs system calls) but I agree that this doesn't look good.

Any kind of feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v5-0003-Add-IOOBJECT_WAL-IOCONTEXT_INIT-write-and-fsync-t.patchtext/x-diff; charset=US-ASCII; name=v5-0003-Add-IOOBJECT_WAL-IOCONTEXT_INIT-write-and-fsync-t.patchDownload

From 4ad85b11d418ae78237ed70eced6e3b46d086ef5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 1 Dec 2023 10:03:21 +0300
Subject: [PATCH v5 3/4] Add IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync
 tests

---
 src/test/regress/expected/stats.out | 19 +++++++++++++++++++
 src/test/regress/sql/stats.sql      | 10 ++++++++++
 2 files changed, 29 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4d3a515bdd..4adda9e479 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,25 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index aa48e65dc8..72e864a0d2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,16 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
-- 
2.43.0

v5-0002-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patchtext/x-diff; charset=US-ASCII; name=v5-0002-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patchDownload

From c7ae6c12cd02806d9d8201d738920179985cee7a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 10 Nov 2023 14:52:22 +0300
Subject: [PATCH v5 2/4] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / write and fsync
 tests

---
 src/test/regress/expected/stats.out | 26 ++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql      | 11 +++++++++++
 2 files changed, 37 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2..4d3a515bdd 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1255,6 +1255,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1264,6 +1265,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1301,6 +1305,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index e3b4ca96e8..aa48e65dc8 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -601,6 +601,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -612,6 +613,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -630,6 +634,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
-- 
2.43.0

v5-0001-Show-WAL-stats-on-pg_stat_io-except-streaming-rep.patchtext/x-diff; charset=US-ASCII; name=v5-0001-Show-WAL-stats-on-pg_stat_io-except-streaming-rep.patchDownload

From 5538140043a975f8d3a469fe83c0dec7919f8dfa Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 26 Oct 2023 12:12:32 +0300
Subject: [PATCH v5 1/4] Show WAL stats on pg_stat_io (except streaming
 replication)

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

For the not calculating WAL timings on pg_stat_wal and pg_stat_io view,
pg_stat_wal view's WAL timings are fetched from pg_stat_io. Since these
timings are fetched from pg_stat_io, pg_stat_reset_shared('io') will
reset pg_stat_wal's timings too.

In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.

IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.

This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
  pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.

doesn't cover:
- Streaming replication WAL IO.
---
 doc/src/sgml/monitoring.sgml              |  28 ++++-
 src/backend/access/transam/xlog.c         |  60 ++++------
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/utils/activity/pgstat_io.c    | 136 +++++++++++++++++++++-
 src/backend/utils/adt/pgstatfuncs.c       |  48 ++++++--
 src/include/pgstat.h                      |   8 +-
 6 files changed, 232 insertions(+), 58 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 42509042ad..679c527f46 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2499,9 +2499,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activities are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2554,6 +2555,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2578,6 +2584,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
@@ -4740,7 +4753,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. Note that, this will
+          reset <structname>pg_stat_wal</structname> view's timing counters
+          too.
          </para>
         </listitem>
         <listitem>
@@ -4757,8 +4772,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. These timing
+          counters can be reset by calling pg_stat_reset_shared with 'io'.
          </para>
         </listitem>
         <listitem>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6526bd4f43..55da2d7812 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2275,38 +2275,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time();
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2331,6 +2315,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -3037,6 +3024,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3080,6 +3068,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3115,6 +3106,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3131,6 +3125,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3143,6 +3140,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8315,7 +8315,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8328,11 +8328,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8376,18 +8372,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
-
-	PendingWalStats.wal_sync++;
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..5557f846a0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1773,6 +1774,9 @@ PerformWalRecovery(void)
 			 */
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
+			/* Report pending statistics to the cumulative stats system  */
+			pgstat_flush_io(false);
+
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
 			{
@@ -3248,6 +3252,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3340,6 +3345,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3368,6 +3375,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..f2f544bb73 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -18,6 +18,7 @@
 
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 
 
@@ -87,24 +88,59 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
+	/*
+	 * Only count PendingWalStats.wal_sync in pg_stat_io because
+	 * PendingWalStats.wal_write could count different than what pg_stat_io
+	 * counts e.g. system calls. Also, PendingWalStats doesn't count WAL init
+	 * or WAL reads. So, they are not included too.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		io_op == IOOP_FSYNC)
+		PendingWalStats.wal_sync += cnt;
+
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
 
+/*
+ * Prepares the io_time for pgstat_count_io_op_time() function. It needs to
+ * return current time if there is a chance that any 'time' can be tracked.
+ */
 instr_time
 pgstat_prepare_io_time(void)
 {
 	instr_time	io_start;
 
-	if (track_io_timing)
+	if (track_io_timing || track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 	else
+
+		/*
+		 * If the time won't be tracked, there is no need to set io_start but
+		 * compiler complains about uninitialized use. So, set it to zero.
+		 */
 		INSTR_TIME_SET_ZERO(io_start);
 
 	return io_start;
 }
 
+/*
+ * Decide if the io timing needs be tracked.
+ */
+bool
+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+	/*
+	 * io times of IOOBJECT_WAL IOObject needs to be tracked when
+	 * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+	 */
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op_n() except it also accumulates time.
  */
@@ -112,14 +148,35 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate different type of times here. We want to eventually
+	 * deduplicate these counters, so we are consolidating them first. This
+	 * also makes it easy to compare what is tracked for which stats or
+	 * instrumentation purpose.
+	 *
+	 * Some of the IO counters didn't moved here because they track at a
+	 * different level of granularity or at a different point in the call
+	 * stack. Also, some of the IO counters are moved to pgstat_count_io_op_n
+	 * function since they track only numbers not timings.
+	 *
+	 * pgstat_count_buffer is for pgstat_database. Since pg_stat_database only
+	 * counts blk_read_time and blk_write_time, it is set for IOOP_READ and
+	 * IOOP_WRITE.
+	 *
+	 * pgBufferUsage is for EXPLAIN. pgBufferUsage has only write and read
+	 * stats for shared/local and temporary blocks. Only shared/local blocks
+	 * are counted here.
+	 *
+	 * At the end of the if case, accumulate time for the pg_stat_io.
+	 */
+	if (pgstat_should_track_io_time(io_object, io_context))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -140,6 +197,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 					   io_time);
 	}
 
+	/* The IO timings are counted, now count the IO numbers */
 	pgstat_count_io_op_n(io_object, io_context, io_op, cnt);
 }
 
@@ -220,12 +278,33 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "normal";
 		case IOCONTEXT_VACUUM:
 			return "vacuum";
+		case IOCONTEXT_INIT:
+			return "init";
 	}
 
 	elog(ERROR, "unrecognized IOContext value: %d", io_context);
 	pg_unreachable();
 }
 
+/*
+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default because most of the
+ * IOObject / IOContext uses BLCKSZ.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+	}
+
+	return BLCKSZ;
+}
+
 const char *
 pgstat_get_io_object_name(IOObject io_object)
 {
@@ -235,6 +314,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -316,10 +397,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_INVALID:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
 			return false;
 
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
 		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
@@ -354,6 +435,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -412,6 +502,7 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 					IOContext io_context, IOOp io_op)
 {
 	bool		strategy_io_context;
+	bool		no_wal_normal_read;
 
 	/* if (io_context, io_object) will never collect stats, we're done */
 	if (!pgstat_tracks_io_object(bktype, io_object, io_context))
@@ -428,6 +519,20 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes / IOObjects will not do certain IOOps.
+	 */
+	no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
+		bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
+		bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
+		bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
+
+	if (no_wal_normal_read &&
+		(io_object == IOOBJECT_WAL &&
+		 io_op == IOOP_READ))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -452,6 +557,27 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * Some IOOps are not valid in certain IOContexts / IOObjects and some
+	 * IOOps are only valid in certain IOContexts / IOObjects.
+	 */
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_INIT io_context means operations
+	 * done while creating new WAL segments.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_NORMAL io_context means operations
+	 * done on already created WAL segments.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0cea320c00..36cba20856 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1377,6 +1377,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
+				int			op_bytes;
 
 				Datum		values[IO_NUM_COLUMNS] = {0};
 				bool		nulls[IO_NUM_COLUMNS] = {0};
@@ -1395,12 +1396,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
 
 				/*
-				 * Hard-code this to the value of BLCKSZ for now. Future
-				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
-				 * and constant multipliers, once non-block-oriented IO (e.g.
-				 * temporary file IO) is tracked.
+				 * op_bytes can change according to IOObject and IOContext.
+				 * Get the correct op_bytes.
 				 */
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
@@ -1446,6 +1446,27 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 	return (Datum) 0;
 }
 
+/*
+ * Return total IOOp time by IOObject and IOContext.
+ */
+static double
+pg_stat_get_io_time(IOObject io_obj, IOContext io_context, IOOp io_op)
+{
+	double		sum_time = 0;
+	PgStat_IO  *backends_io_stats = pgstat_fetch_stat_io();
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		if (!pgstat_tracks_io_bktype(bktype) || !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+			continue;
+
+		sum_time += bktype_stats->times[io_obj][io_context][io_op];
+	}
+	return pg_stat_us_to_ms(sum_time);
+}
+
 /*
  * Returns statistics of WAL activity
  */
@@ -1500,9 +1521,20 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+	/*
+	 * There is no need to calculate timings for both pg_stat_wal and
+	 * pg_stat_io. So, fetch timings from pg_stat_io to make stats gathering
+	 * cheaper. Note that, since timings are fetched from pg_stat_io;
+	 * pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
+	 *
+	 * Convert counters from microsec to millisec for display
+	 */
+	values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+												   IOCONTEXT_NORMAL,
+												   IOOP_WRITE));
+	values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+												   IOCONTEXT_NORMAL,
+												   IOOP_FSYNC));
 
 	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f95d8db0c4..1d9428b3a8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -276,9 +276,10 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
@@ -286,9 +287,10 @@ typedef enum IOContext
 	IOCONTEXT_BULKWRITE,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
+	IOCONTEXT_INIT,
 } IOContext;
 
-#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_INIT + 1)
 
 typedef enum IOOp
 {
@@ -520,10 +522,12 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
 extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern instr_time pgstat_prepare_io_time(void);
+extern bool pgstat_should_track_io_time(IOObject io_object, IOContext io_context);
 extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time, uint32 cnt);
 
 extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int	pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 
-- 
2.43.0

v5-0004-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-read-tests.patchtext/x-diff; charset=US-ASCII; name=v5-0004-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-read-tests.patchDownload

From b0a54f44e4b5699d62c071d87619f19084e5475a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 29 Nov 2023 15:30:03 +0300
Subject: [PATCH v5 4/4] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / read tests

---
 src/test/regress/expected/stats.out | 12 ++++++++++++
 src/test/regress/sql/stats.sql      |  8 ++++++++
 2 files changed, 20 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4adda9e479..7f5340cd7e 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -881,6 +881,18 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read.
+-- When the servers starts, StartupXLOG function must be called by postmaster
+-- or standalone-backend startup and WAL read must be done.
+-- So, check these before stats get resetted.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 72e864a0d2..b13f17f4d2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -452,6 +452,14 @@ SELECT :io_sum_wal_init_writes > 0;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_wal_init_fsyncs > 0;
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read.
+-- When the servers starts, StartupXLOG function must be called by postmaster
+-- or standalone-backend startup and WAL read must be done.
+-- So, check these before stats get resetted.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
-- 
2.43.0

#23

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#22)

Re: Show WAL write and fsync stats in pg_stat_io

On Fri, Dec 01, 2023 at 12:02:05PM +0300, Nazir Bilal Yavuz wrote:

Thanks for all the feedback. I am sharing the new version of the patchset.

- I didn't move 'PendingWalStats.wal_sync' out from the
'pgstat_count_io_op_n' function because they count the same thing
(block vs system calls) but I agree that this doesn't look good.

-       if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+       if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)

Unrelated diff.

+   if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+       io_op == IOOP_FSYNC)
+       PendingWalStats.wal_sync += cnt;

Nah, I really don't think that adding this dependency within
pg_stat_io is a good idea.

-   PendingWalStats.wal_sync++;
+   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+                           io_start, 1);

This is the only caller where this matters, and the count is always 1.

+	no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
+		bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
+		bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
+		bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
+
+	if (no_wal_normal_read &&
+		(io_object == IOOBJECT_WAL &&
+		 io_op == IOOP_READ))
+		return false;

This may be more readable if an enum is applied, without a default
clause so as it would not be forgotten if a new type is added, perhaps
in its own little routine.

-   if (track_io_timing)
+   if (track_io_timing || track_wal_io_timing)
        INSTR_TIME_SET_CURRENT(io_start);
    else

This interface from pgstat_prepare_io_time() is not really good,
because we could finish by setting io_start in the existing code paths
calling this routine even if track_io_timing is false when
track_wal_io_timing is true. Why not changing this interface a bit
and pass down a GUC (track_io_timing or track_wal_io_timing) as an
argument of the function depending on what we expect to trigger the
timings?

-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+	/*
+	 * There is no need to calculate timings for both pg_stat_wal and
+	 * pg_stat_io. So, fetch timings from pg_stat_io to make stats gathering
+	 * cheaper. Note that, since timings are fetched from pg_stat_io;
+	 * pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
+	 *
+	 * Convert counters from microsec to millisec for display
+	 */
+	values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+												   IOCONTEXT_NORMAL,
+												   IOOP_WRITE));
+	values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+												   IOCONTEXT_NORMAL,
+												   IOOP_FSYNC));

Perhaps it is simpler to remove these columns from pg_stat_get_wal()
and plug an SQL upgrade to the view definition of pg_stat_wal?

+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)

This interface looks like a good idea even if there is only one
caller.

Finding a good balance between the subroutines, the two GUCs, the
contexts, the I/O operation type and the objects is the tricky part of
this patch. If the dependency to PendingWalStats is removed and if
the interface of pgstat_prepare_io_time is improved, things are a bit
cleaner, but it feels like we could do more.. Nya.
--
Michael

#24

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#23)

6 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thanks for the feedback! The new version of the patch is attached.

On Tue, 5 Dec 2023 at 09:16, Michael Paquier <michael@paquier.xyz> wrote:

-       if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+       if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)

Unrelated diff.

Done.

+   if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+       io_op == IOOP_FSYNC)
+       PendingWalStats.wal_sync += cnt;
Nah, I really don't think that adding this dependency within
pg_stat_io is a good idea.
-   PendingWalStats.wal_sync++;
+   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+                           io_start, 1);
This is the only caller where this matters, and the count is always 1.

I reverted that, pgstat_count_io_op_n doesn't count
PendingWalStats.wal_sync now.

+       no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
+               bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
+               bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
+               bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
+               bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
+
+       if (no_wal_normal_read &&
+               (io_object == IOOBJECT_WAL &&
+                io_op == IOOP_READ))
+               return false;

This may be more readable if an enum is applied, without a default
clause so as it would not be forgotten if a new type is added, perhaps
in its own little routine.

Done.

-   if (track_io_timing)
+   if (track_io_timing || track_wal_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
else
This interface from pgstat_prepare_io_time() is not really good,
because we could finish by setting io_start in the existing code paths
calling this routine even if track_io_timing is false when
track_wal_io_timing is true. Why not changing this interface a bit
and pass down a GUC (track_io_timing or track_wal_io_timing) as an
argument of the function depending on what we expect to trigger the
timings?

Done in 0001.

-       /* Convert counters from microsec to millisec for display */
-       values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-       values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+       /*
+        * There is no need to calculate timings for both pg_stat_wal and
+        * pg_stat_io. So, fetch timings from pg_stat_io to make stats gathering
+        * cheaper. Note that, since timings are fetched from pg_stat_io;
+        * pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
+        *
+        * Convert counters from microsec to millisec for display
+        */
+       values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+                                                                                                  IOCONTEXT_NORMAL,
+                                                                                                  IOOP_WRITE));
+       values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+                                                                                                  IOCONTEXT_NORMAL,
+                                                                                                  IOOP_FSYNC));

Perhaps it is simpler to remove these columns from pg_stat_get_wal()
and plug an SQL upgrade to the view definition of pg_stat_wal?

Done in 0003 but I am not sure if that is what you expected.

Finding a good balance between the subroutines, the two GUCs, the
contexts, the I/O operation type and the objects is the tricky part of
this patch. If the dependency to PendingWalStats is removed and if
the interface of pgstat_prepare_io_time is improved, things are a bit
cleaner, but it feels like we could do more.. Nya.

I agree. The patch is not logically complicated but it is hard to
select the best way.

Any kind of feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v6-0006-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-read-tests.patchapplication/x-patch; name=v6-0006-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-read-tests.patchDownload

From b7bf7b92fa274775136314ecfde90fa32ed435cb Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 29 Nov 2023 15:30:03 +0300
Subject: [PATCH v6 6/6] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / read tests

---
 src/test/regress/expected/stats.out | 12 ++++++++++++
 src/test/regress/sql/stats.sql      |  8 ++++++++
 2 files changed, 20 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4adda9e479..7f5340cd7e 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -881,6 +881,18 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read.
+-- When the servers starts, StartupXLOG function must be called by postmaster
+-- or standalone-backend startup and WAL read must be done.
+-- So, check these before stats get resetted.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 72e864a0d2..b13f17f4d2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -452,6 +452,14 @@ SELECT :io_sum_wal_init_writes > 0;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_wal_init_fsyncs > 0;
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read.
+-- When the servers starts, StartupXLOG function must be called by postmaster
+-- or standalone-backend startup and WAL read must be done.
+-- So, check these before stats get resetted.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
-- 
2.43.0

v6-0003-Fetch-pg_stat_wal-s-timings-from-pg_stat_io.patchapplication/x-patch; name=v6-0003-Fetch-pg_stat_wal-s-timings-from-pg_stat_io.patchDownload

From afcc55a01f38ab79d7dd17d68e7b1d5c4be41d8d Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 12 Dec 2023 11:17:42 +0300
Subject: [PATCH v6 3/6] Fetch pg_stat_wal's timings from pg_stat_io

For the not calculating WAL timings on pg_stat_wal and pg_stat_io view,
pg_stat_wal view's WAL timings are fetched from pg_stat_io. Since these
timings are fetched from pg_stat_io, pg_stat_reset_shared('io') will
reset pg_stat_wal's timings too.
---
 doc/src/sgml/monitoring.sgml         |  9 ++++++---
 src/backend/catalog/system_views.sql | 15 ++++++++++++---
 src/backend/utils/adt/pgstatfuncs.c  | 14 +++-----------
 src/include/catalog/pg_proc.dat      |  6 +++---
 src/test/regress/expected/rules.out  | 27 +++++++++++++++++----------
 5 files changed, 41 insertions(+), 30 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0450f91ccb..679c527f46 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4753,7 +4753,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. Note that, this will
+          reset <structname>pg_stat_wal</structname> view's timing counters
+          too.
          </para>
         </listitem>
         <listitem>
@@ -4770,8 +4772,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. These timing
+          counters can be reset by calling pg_stat_reset_shared with 'io'.
          </para>
         </listitem>
         <listitem>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 11d18ed9dd..91c6acf754 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1169,6 +1169,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1176,10 +1183,12 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+    CROSS JOIN pgsio_sum_write_fsync_time p;
+
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 8d14a4183c..96e318bf69 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1452,7 +1452,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1473,11 +1473,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1500,11 +1496,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 77e8b13764..38050cbb09 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5770,9 +5770,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 05070393b9..d3f6301c9d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2222,16 +2222,23 @@ pg_stat_user_tables| SELECT relid,
     autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
-- 
2.43.0

v6-0005-Add-IOOBJECT_WAL-IOCONTEXT_INIT-write-and-fsync-t.patchapplication/x-patch; name=v6-0005-Add-IOOBJECT_WAL-IOCONTEXT_INIT-write-and-fsync-t.patchDownload

From 6367c92c957b05ad4a098c35e0df9ef51b2126c6 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 1 Dec 2023 10:03:21 +0300
Subject: [PATCH v6 5/6] Add IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync
 tests

---
 src/test/regress/expected/stats.out | 19 +++++++++++++++++++
 src/test/regress/sql/stats.sql      | 10 ++++++++++
 2 files changed, 29 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4d3a515bdd..4adda9e479 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,25 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index aa48e65dc8..72e864a0d2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,16 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
-- 
2.43.0

v6-0004-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patchapplication/x-patch; name=v6-0004-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patchDownload

From c328c9b8862e600328b4d91592031c393cc27464 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 10 Nov 2023 14:52:22 +0300
Subject: [PATCH v6 4/6] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / write and fsync
 tests

---
 src/test/regress/expected/stats.out | 26 ++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql      | 11 +++++++++++
 2 files changed, 37 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2..4d3a515bdd 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1255,6 +1255,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1264,6 +1265,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1301,6 +1305,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index e3b4ca96e8..aa48e65dc8 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -601,6 +601,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -612,6 +613,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -630,6 +634,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
-- 
2.43.0

v6-0002-Show-WAL-stats-on-pg_stat_io-except-streaming-rep.patchapplication/x-patch; name=v6-0002-Show-WAL-stats-on-pg_stat_io-except-streaming-rep.patchDownload

From ee53af3bfaf4582a940c0b2bd93f5b98739ef423 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 12 Dec 2023 10:40:02 +0300
Subject: [PATCH v6 2/6] Show WAL stats on pg_stat_io (except streaming
 replication)

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.

IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.

This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
  pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.

doesn't cover:
- Streaming replication WAL IO.
---
 doc/src/sgml/monitoring.sgml              |  19 +++-
 src/backend/access/transam/xlog.c         |  58 ++++------
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/utils/activity/pgstat_io.c    | 124 +++++++++++++++++++++-
 src/backend/utils/adt/pgstatfuncs.c       |  10 +-
 src/include/pgstat.h                      |  10 +-
 6 files changed, 182 insertions(+), 49 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 42509042ad..0450f91ccb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2499,9 +2499,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activities are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2554,6 +2555,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2578,6 +2584,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 01e0484584..6f7149084f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2277,38 +2277,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time(track_wal_io_timing);
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2333,6 +2317,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -3039,6 +3026,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3082,6 +3070,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3117,6 +3108,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3133,6 +3127,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3145,6 +3142,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8317,7 +8317,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8330,11 +8330,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8378,16 +8374,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1);
 
 	PendingWalStats.wal_sync++;
 }
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a2c8fa3981..16c1b9ba99 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1771,6 +1772,9 @@ PerformWalRecovery(void)
 			 */
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
+			/* Report pending statistics to the cumulative stats system  */
+			pgstat_flush_io(false);
+
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
 			{
@@ -3246,6 +3250,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3338,6 +3343,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3366,6 +3373,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 7263a80c72..36bf5ce708 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -18,6 +18,7 @@
 
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 
 
@@ -114,6 +115,22 @@ pgstat_prepare_io_time(bool track_time_guc)
 	return io_start;
 }
 
+/*
+ * Decide if the io timing needs be tracked.
+ */
+bool
+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+	/*
+	 * io times of IOOBJECT_WAL IOObject needs to be tracked when
+	 * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+	 */
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op_n() except it also accumulates time.
  */
@@ -121,7 +138,27 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate different type of times here. We want to eventually
+	 * deduplicate these counters, so we are consolidating them first. This
+	 * also makes it easy to compare what is tracked for which stats or
+	 * instrumentation purpose.
+	 *
+	 * Some of the IO counters didn't moved here because they track at a
+	 * different level of granularity or at a different point in the call
+	 * stack.
+	 *
+	 * pgstat_count_buffer is for pgstat_database. Since pg_stat_database only
+	 * counts blk_read_time and blk_write_time, it is set for IOOP_READ and
+	 * IOOP_WRITE.
+	 *
+	 * pgBufferUsage is for EXPLAIN. pgBufferUsage has only write and read
+	 * stats for shared/local and temporary blocks. Only shared/local blocks
+	 * are counted here.
+	 *
+	 * At the end of the if case, accumulate time for the pg_stat_io.
+	 */
+	if (pgstat_should_track_io_time(io_object, io_context))
 	{
 		instr_time	io_time;
 
@@ -149,6 +186,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 					   io_time);
 	}
 
+	/* The IO timings are counted, now count the IO numbers */
 	pgstat_count_io_op_n(io_object, io_context, io_op, cnt);
 }
 
@@ -229,12 +267,33 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "normal";
 		case IOCONTEXT_VACUUM:
 			return "vacuum";
+		case IOCONTEXT_INIT:
+			return "init";
 	}
 
 	elog(ERROR, "unrecognized IOContext value: %d", io_context);
 	pg_unreachable();
 }
 
+/*
+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default because most of the
+ * IOObject / IOContext uses BLCKSZ.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+	}
+
+	return BLCKSZ;
+}
+
 const char *
 pgstat_get_io_object_name(IOObject io_object)
 {
@@ -244,6 +303,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -325,10 +386,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_INVALID:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
 			return false;
 
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
 		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
@@ -363,6 +424,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -437,6 +507,33 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes / IOObjects will not do certain IOOps.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ)
+	{
+		switch (bktype)
+		{
+			case B_STANDALONE_BACKEND:
+			case B_STARTUP:
+				break;
+
+			case B_INVALID:
+			case B_ARCHIVER:
+			case B_AUTOVAC_LAUNCHER:
+			case B_AUTOVAC_WORKER:
+			case B_BACKEND:
+			case B_BG_WORKER:
+			case B_BG_WRITER:
+			case B_CHECKPOINTER:
+			case B_LOGGER:
+			case B_WAL_RECEIVER:
+			case B_WAL_SENDER:
+			case B_WAL_WRITER:
+				return false;
+		}
+	}
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -461,6 +558,27 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * Some IOOps are not valid in certain IOContexts / IOObjects and some
+	 * IOOps are only valid in certain IOContexts / IOObjects.
+	 */
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_INIT io_context means operations
+	 * done while creating new WAL segments.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_NORMAL io_context means operations
+	 * done on already created WAL segments.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0cea320c00..8d14a4183c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1377,6 +1377,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
+				int			op_bytes;
 
 				Datum		values[IO_NUM_COLUMNS] = {0};
 				bool		nulls[IO_NUM_COLUMNS] = {0};
@@ -1395,12 +1396,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
 
 				/*
-				 * Hard-code this to the value of BLCKSZ for now. Future
-				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
-				 * and constant multipliers, once non-block-oriented IO (e.g.
-				 * temporary file IO) is tracked.
+				 * op_bytes can change according to IOObject and IOContext.
+				 * Get the correct op_bytes.
 				 */
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f95d8db0c4..2a3d131dce 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -276,9 +276,10 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
@@ -286,9 +287,10 @@ typedef enum IOContext
 	IOCONTEXT_BULKWRITE,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
+	IOCONTEXT_INIT,
 } IOContext;
 
-#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_INIT + 1)
 
 typedef enum IOOp
 {
@@ -519,11 +521,13 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
 extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
-extern instr_time pgstat_prepare_io_time(void);
+extern instr_time pgstat_prepare_io_time(bool track_time_guc);
+extern bool pgstat_should_track_io_time(IOObject io_object, IOContext io_context);
 extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time, uint32 cnt);
 
 extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int	pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 
-- 
2.43.0

v6-0001-Use-timing-GUCs-on-pgstat_prepare_io_time-functio.patchapplication/x-patch; name=v6-0001-Use-timing-GUCs-on-pgstat_prepare_io_time-functio.patchDownload

From e439ddc128c9b9ea1e2832f339e872c27d308499 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 12 Dec 2023 10:33:55 +0300
Subject: [PATCH v6 1/6] Use timing GUCs on pgstat_prepare_io_time function

---
 src/backend/storage/buffer/bufmgr.c    | 10 +++++-----
 src/backend/storage/buffer/localbuf.c  |  4 ++--
 src/backend/storage/smgr/md.c          |  4 ++--
 src/backend/utils/activity/pgstat_io.c | 13 +++++++++++--
 4 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7c67d504c..a652907fc9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1143,7 +1143,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 	else
 	{
-		instr_time	io_start = pgstat_prepare_io_time();
+		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
 
 		smgrread(smgr, forkNum, blockNum, bufBlock);
 
@@ -2070,7 +2070,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		}
 	}
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/*
 	 * Note: if smgrzeroextend fails, we will end up with buffers that are
@@ -3523,7 +3523,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -4181,7 +4181,7 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-				io_start = pgstat_prepare_io_time();
+				io_start = pgstat_prepare_io_time(track_io_timing);
 
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
@@ -5614,7 +5614,7 @@ IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context)
 	sort_pending_writebacks(wb_context->pending_writebacks,
 							wb_context->nr_pending);
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index aebcf146b4..903ea97b85 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -246,7 +246,7 @@ GetLocalVictimBuffer(void)
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-		io_start = pgstat_prepare_io_time();
+		io_start = pgstat_prepare_io_time(track_io_timing);
 
 		/* And write... */
 		smgrwrite(oreln,
@@ -411,7 +411,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 		}
 	}
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/* actually extend relation */
 	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad170..62b2e79106 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1172,7 +1172,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
-		io_start = pgstat_prepare_io_time();
+		io_start = pgstat_prepare_io_time(track_io_timing);
 
 		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
@@ -1569,7 +1569,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..7263a80c72 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -92,14 +92,23 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
 	have_iostats = true;
 }
 
+/*
+ * Prepares the io_time for pgstat_count_io_op_time() function. This returns
+ * the current time if the 'track_time_guc' is true, otherwise returns zero.
+ */
 instr_time
-pgstat_prepare_io_time(void)
+pgstat_prepare_io_time(bool track_time_guc)
 {
 	instr_time	io_start;
 
-	if (track_io_timing)
+	if (track_time_guc)
 		INSTR_TIME_SET_CURRENT(io_start);
 	else
+
+		/*
+		 * If the time won't be tracked, there is no need to set io_start but
+		 * compiler complains about uninitialized use. So, set it to zero.
+		 */
 		INSTR_TIME_SET_ZERO(io_start);
 
 	return io_start;
-- 
2.43.0

#25

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#24)

Re: Show WAL write and fsync stats in pg_stat_io

On Tue, Dec 12, 2023 at 02:29:03PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 5 Dec 2023 at 09:16, Michael Paquier <michael@paquier.xyz> wrote:

This interface from pgstat_prepare_io_time() is not really good,
because we could finish by setting io_start in the existing code paths
calling this routine even if track_io_timing is false when
track_wal_io_timing is true. Why not changing this interface a bit
and pass down a GUC (track_io_timing or track_wal_io_timing) as an
argument of the function depending on what we expect to trigger the
timings?

Done in 0001.

One thing that 0001 missed is an update of the header where the
function is declared. I've edited a few things, and applied it to
start on this stuff. The rest will have to wait a bit more..
--
Michael

#26

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Michael Paquier (#25)

1 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

On Sat, Dec 16, 2023 at 08:20:57PM +0100, Michael Paquier wrote:

One thing that 0001 missed is an update of the header where the
function is declared. I've edited a few things, and applied it to
start on this stuff. The rest will have to wait a bit more..

I have been reviewing the whole, and spotted a couple of issues.

+	 * At the end of the if case, accumulate time for the pg_stat_io.
+	 */
+	if (pgstat_should_track_io_time(io_object, io_context))

There was a bug here. WAL operations can do IOOP_WRITE or IOOP_READ,
and this would cause pgstat_count_buffer_read_time() and
pgstat_count_buffer_write_time() to be called, incrementing
pgStatBlock{Read,Write}Time, which would be incorrect when it comes to
a WAL page or a WAL segment. I was wondering what to do here first,
but we could just avoid calling these routines when working on an
IOOBJECT_WAL as that's the only object not doing a buffer operation.

A comment at the top of pgstat_tracks_io_bktype() is incorrect,
because this patch adds the WAL writer sender in the I/O tracking.

+           case B_WAL_RECEIVER:
+           case B_WAL_SENDER:
+           case B_WAL_WRITER:
+               return false;

pgstat_tracks_io_op() now needs B_WAL_SUMMARIZER.

pgstat_should_track_io_time() is used only in pgstat_io.c, so it can
be static rather than published in pgstat.h.

pgstat_tracks_io_bktype() does not look correct to me. Why is the WAL
receiver considered as something correct in the list of backend types,
while the intention is to *not* add it to pg_stat_io? I have tried to
switche to the correct behavior of returning false for a
B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
freezes on its shutdown sequence. Something weird is going on here.
Could you look at it? See the XXX comment in the attached, which is
the same behavior as v6-0002. It looks to me that the patch has
introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
incorrect way to avoid the root issue.

I have also spent more time polishing the rest, touching a few things
while reviewing. Not sure that I see a point in splitting the tests
from the main patch.
--
Michael

Attachments:

v7-0001-Show-WAL-stats-on-pg_stat_io-except-streaming.patchtext/x-diff; charset=us-asciiDownload

From 2ca5a656fbd2a359cc585cf7f59331daf77d760b Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Mon, 25 Dec 2023 15:16:34 +0900
Subject: [PATCH v7] Show WAL stats on pg_stat_io (except streaming 
 replication)

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.

IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.

This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
  pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.

doesn't cover:
- Streaming replication WAL IO.
---
 src/include/catalog/pg_proc.dat           |   6 +-
 src/include/pgstat.h                      |   6 +-
 src/backend/access/transam/xlog.c         |  58 +++++-------
 src/backend/access/transam/xlogrecovery.c |  10 +++
 src/backend/catalog/system_views.sql      |  15 +++-
 src/backend/utils/activity/pgstat_io.c    | 102 ++++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c       |  24 ++---
 src/test/regress/expected/rules.out       |  27 +++---
 src/test/regress/expected/stats.out       |  53 +++++++++++
 src/test/regress/sql/stats.sql            |  25 ++++++
 doc/src/sgml/monitoring.sgml              |  29 ++++--
 11 files changed, 274 insertions(+), 81 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9052f5262a..ddccddd6bd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5785,9 +5785,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ab91b3b367..c23970da49 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -279,14 +279,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
@@ -527,6 +529,8 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time, uint32 cnt);
 
 extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int	pgstat_get_io_op_bytes(IOObject io_object,
+								   IOContext io_context);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e9019156a..aad01a0ef3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2281,38 +2281,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time(track_wal_io_timing);
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2337,6 +2321,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -3043,6 +3030,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3086,6 +3074,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3121,6 +3112,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3137,6 +3131,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3149,6 +3146,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8414,7 +8414,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8427,11 +8427,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8475,16 +8471,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1);
 
 	PendingWalStats.wal_sync++;
 }
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6f4f81f992..bb42263f2d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1777,6 +1778,9 @@ PerformWalRecovery(void)
 			 */
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
+			/* Report pending statistics to the cumulative stats system  */
+			pgstat_flush_io(false);
+
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
 			{
@@ -3252,6 +3256,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3344,6 +3349,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3372,6 +3379,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 058fc47c91..c57b5428ae 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1172,6 +1172,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1179,10 +1186,12 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+        CROSS JOIN pgsio_sum_write_fsync_time p;
+
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 0dd9b98b3e..f3bf703b11 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
@@ -115,6 +116,20 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op_n() except it also accumulates time.
  */
@@ -122,7 +137,16 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate timing data.  pgstat_count_buffer is for pgstat_database.
+	 * As pg_stat_database only counts blk_read_time and blk_write_time, it
+	 * is set for IOOP_READ and IOOP_WRITE.
+	 *
+	 * pgBufferUsage is for EXPLAIN.  pgBufferUsage has write and read stats
+	 * for shared, local and temporary blocks.  Temporary blocks are ignored
+	 * here.
+	 */
+	if (pgstat_should_track_io_time(io_object))
 	{
 		instr_time	io_time;
 
@@ -131,7 +155,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object != IOOBJECT_WAL)
+				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
 			else if (io_object == IOOBJECT_TEMP_RELATION)
@@ -139,7 +165,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		}
 		else if (io_op == IOOP_READ)
 		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object != IOOBJECT_WAL)
+				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
 			else if (io_object == IOOBJECT_TEMP_RELATION)
@@ -226,6 +254,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -236,6 +266,30 @@ pgstat_get_io_context_name(IOContext io_context)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of bytes associated to an operation depending on
+ * an IOObject and an IOContext.
+ *
+ * Returns BLCKSZ by default because that is what most of the
+ * IOObject/IOContext combinations use.  The number of bytes for an
+ * IOOBJECT_WAL depends on the context.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+
+		Assert(false);		/* not reachable */
+	}
+
+	return BLCKSZ;
+}
+
 const char *
 pgstat_get_io_object_name(IOObject io_object)
 {
@@ -245,6 +299,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -306,8 +362,8 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
+* - WAL Receiver and WAL Summarizer IO are not tracked in pg_stat_io for
+*   now.
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -327,8 +383,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_INVALID:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
 		case B_WAL_SUMMARIZER:
 			return false;
 
@@ -340,7 +394,9 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_CHECKPOINTER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
+		case B_WAL_RECEIVER:	/* XXX: isn't that incorrect? */
 		case B_WAL_SENDER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -366,6 +422,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -440,6 +505,13 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Most BackendTypes don't do reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		!(bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -464,6 +536,22 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * An IOCONTEXT_INIT done for an IOOBJECT_WAL io_object does writes
+	 * and syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	/*
+	 * An IOCONTEXT_NORMAL done for an IOOBJECT_WAL io_object does writes,
+	 * reads and syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e65cbf41e9..a77c2ce062 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1395,6 +1395,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
+				int			op_bytes;
 
 				Datum		values[IO_NUM_COLUMNS] = {0};
 				bool		nulls[IO_NUM_COLUMNS] = {0};
@@ -1413,12 +1414,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
 
 				/*
-				 * Hard-code this to the value of BLCKSZ for now. Future
-				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
-				 * and constant multipliers, once non-block-oriented IO (e.g.
-				 * temporary file IO) is tracked.
+				 * op_bytes can change according to IOObject and IOContext.
+				 * Get the correct op_bytes.
 				 */
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
@@ -1470,7 +1470,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1491,11 +1491,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1518,11 +1514,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f645e8486b..1b9db4e45a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2225,16 +2225,23 @@ pg_stat_user_tables| SELECT relid,
     autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2..c566ddbdd4 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1255,6 +1282,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1264,6 +1292,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1301,6 +1332,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index e3b4ca96e8..aa6caf7fd7 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -601,6 +615,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -612,6 +627,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -630,6 +648,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5e..a721f7b345 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2499,9 +2499,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2554,6 +2555,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2578,6 +2584,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
@@ -4767,7 +4780,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. This resets the timing
+          counter stored in the <structname>pg_stat_wal</structname> view
+          too.
          </para>
         </listitem>
         <listitem>
@@ -4784,8 +4799,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. The timing
+          counters can be reset by calling pg_stat_reset_shared set to
+          <literal>io</literal>.
          </para>
         </listitem>
         <listitem>
-- 
2.43.0

#27

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Michael Paquier (#26)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:

pgstat_tracks_io_bktype() does not look correct to me. Why is the WAL
receiver considered as something correct in the list of backend types,
while the intention is to *not* add it to pg_stat_io? I have tried to
switche to the correct behavior of returning false for a
B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
freezes on its shutdown sequence. Something weird is going on here.
Could you look at it? See the XXX comment in the attached, which is
the same behavior as v6-0002. It looks to me that the patch has
introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
incorrect way to avoid the root issue.

Ah, that's because it would trigger an assertion failure:
TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
postgres: standby_local: walreceiver
(ExceptionalCondition+0xa8)[0x560d1b4dd38a]

And the backtrace just tells that this is the WAL receiver
initializing a WAL segment:
#5 0x0000560d1b3322c8 in pgstat_count_io_op_n
(io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
cnt=1) at pgstat_io.c:89
#6 0x0000560d1b33254a in pgstat_count_io_op_time
(io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
start_time=..., cnt=1) at pgstat_io.c:181
#7 0x0000560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/00000001", '0'
<repeats 15 times>, "3") at xlog.c:3115
#8 0x0000560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
xlog.c:3215

Wouldn't it be simpler to just bite the bullet in this case and handle
WAL receivers in the IO tracking?
--
Michael

#28

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#27)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thanks for the review and feedback on your previous reply!

On Mon, 25 Dec 2023 at 09:40, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:

pgstat_tracks_io_bktype() does not look correct to me. Why is the WAL
receiver considered as something correct in the list of backend types,
while the intention is to *not* add it to pg_stat_io? I have tried to
switche to the correct behavior of returning false for a
B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
freezes on its shutdown sequence. Something weird is going on here.
Could you look at it? See the XXX comment in the attached, which is
the same behavior as v6-0002. It looks to me that the patch has
introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
incorrect way to avoid the root issue.

Ah, that's because it would trigger an assertion failure:
TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
postgres: standby_local: walreceiver
(ExceptionalCondition+0xa8)[0x560d1b4dd38a]

And the backtrace just tells that this is the WAL receiver
initializing a WAL segment:
#5 0x0000560d1b3322c8 in pgstat_count_io_op_n
(io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
cnt=1) at pgstat_io.c:89
#6 0x0000560d1b33254a in pgstat_count_io_op_time
(io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
start_time=..., cnt=1) at pgstat_io.c:181
#7 0x0000560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/00000001", '0'
<repeats 15 times>, "3") at xlog.c:3115
#8 0x0000560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
xlog.c:3215

Correct.

Wouldn't it be simpler to just bite the bullet in this case and handle
WAL receivers in the IO tracking?

There is one problem and I couldn't decide how to solve it. We need to
handle read IO in WALRead() in xlogreader.c. How many bytes the
WALRead() function will read is controlled by a variable and it can be
different from XLOG_BLCKSZ. This is a problem because pg_stat_io's
op_bytes column is a constant.

Here are all WALRead() function calls:

1- read_local_xlog_page_guts() in xlogutils.c => WALRead(XLOG_BLCKSZ)
=> always reads XLOG_BLCKSZ.

2- summarizer_read_local_xlog_page() in walsummarizer.c =>
WALRead(XLOG_BLCKSZ) => always reads XLOG_BLCKSZ.

3- logical_read_xlog_page() in walsender.c => WALRead(XLOG_BLCKSZ) =>
always reads XLOG_BLCKSZ.

4- XLogSendPhysical() in walsender.c => WALRead(nbytes) => nbytes can
be different from XLOG_BLCKSZ.

5- WALDumpReadPage() in pg_waldump.c => WALRead(count) => count can be
different from XLOG_BLCKSZ.

4 and 5 are the problematic calls.

Melanie's answer to this problem on previous discussions:

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

Using op_bytes as 1 solves this problem but since it will be different
from the rest of the pg_stat_io view it could be hard to understand.
There is no clear separation by backends as it can be seen from the walsender.

The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
treat op_bytes * number of reads as an approximation of the number of
bytes read. I don't actually know what makes more sense. I don't think I
would like having a number for bytes that is not accurate.

Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
important as the previous problem but it still is a problem.

I would be glad to hear opinions on these problems.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#29

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#28)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, Dec 25, 2023 at 04:09:34PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

Using op_bytes as 1 solves this problem but since it will be different
from the rest of the pg_stat_io view it could be hard to understand.
There is no clear separation by backends as it can be seen from the walsender.

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation, but it would be much less than that. Another
solution would be to use NULL (as a synonym of "I don't know") and
then document that in this case all the bigint counters of pg_stat_io
track the number of bytes rather than the number of operations?

The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
treat op_bytes * number of reads as an approximation of the number of
bytes read. I don't actually know what makes more sense. I don't think I
would like having a number for bytes that is not accurate.

Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
important as the previous problem but it still is a problem.

I would be glad to hear opinions on these problems.

Correctness matters a lot for monitoring, IMO.
--
Michael

#30

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#29)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Tue, 26 Dec 2023 at 03:06, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Dec 25, 2023 at 04:09:34PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

Using op_bytes as 1 solves this problem but since it will be different
from the rest of the pg_stat_io view it could be hard to understand.
There is no clear separation by backends as it can be seen from the walsender.

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation, but it would be much less than that. Another
solution would be to use NULL (as a synonym of "I don't know") and
then document that in this case all the bigint counters of pg_stat_io
track the number of bytes rather than the number of operations?

Yes, that makes sense.

Maybe it is better to create a pg_stat_io_wal view like you said
before. We could remove unused columns and add op_bytes for each
writes and reads. Also, we can track both the number of bytes and the
number of the operations. This doesn't fully solve the problem but it
will be easier to modify it to meet our needs.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#31

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#30)

Re: Show WAL write and fsync stats in pg_stat_io

On Tue, Dec 26, 2023 at 11:27:16AM +0300, Nazir Bilal Yavuz wrote:

Maybe it is better to create a pg_stat_io_wal view like you said
before. We could remove unused columns and add op_bytes for each
writes and reads. Also, we can track both the number of bytes and the
number of the operations. This doesn't fully solve the problem but it
will be easier to modify it to meet our needs.

I am not sure while the whole point of the exercise is to have all the
I/O related data in a single view. Something that I've also found a
bit disturbing yesterday while looking at your patch is the fact that
the operation size is guessed from the context and object type when
querying the view because now everything is tied to BLCKSZ. This
patch extends it with two more operation sizes, and there are even
cases where it may be a variable. Could it be a better option to
extend pgstat_count_io_op_time() so as callers can themselves give the
size of the operation?

The whole patch is kind of itself complicated enough, so I'd be OK to
discard the case of the WAL receiver for now. Now, if we do so, the
code stack of pgstat_io.c should handle WAL receivers as something
entirely disabled until all the known issues are solved. There is
still a lot of value in tracking WAL data associated to the WAL
writer, normal backends and WAL senders.
--
Michael

#32

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#31)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Tue, 26 Dec 2023 at 13:10, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Dec 26, 2023 at 11:27:16AM +0300, Nazir Bilal Yavuz wrote:

Maybe it is better to create a pg_stat_io_wal view like you said
before. We could remove unused columns and add op_bytes for each
writes and reads. Also, we can track both the number of bytes and the
number of the operations. This doesn't fully solve the problem but it
will be easier to modify it to meet our needs.

I am not sure while the whole point of the exercise is to have all the
I/O related data in a single view. Something that I've also found a
bit disturbing yesterday while looking at your patch is the fact that
the operation size is guessed from the context and object type when
querying the view because now everything is tied to BLCKSZ. This
patch extends it with two more operation sizes, and there are even
cases where it may be a variable. Could it be a better option to
extend pgstat_count_io_op_time() so as callers can themselves give the
size of the operation?

Do you mean removing the op_bytes column and tracking the number of
bytes in reads, writes, and extends? If so, that makes sense to me but
I don't want to remove the number of operations; I believe that has a
value too. We can extend the pgstat_count_io_op_time() so it can both
track the number of bytes and the number of operations.
Also, it is not directly related to this patch but vectored IO [1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com is
coming soon; so the number of operations could be wrong since vectored
IO could merge a couple of operations.

The whole patch is kind of itself complicated enough, so I'd be OK to
discard the case of the WAL receiver for now. Now, if we do so, the
code stack of pgstat_io.c should handle WAL receivers as something
entirely disabled until all the known issues are solved. There is
still a lot of value in tracking WAL data associated to the WAL
writer, normal backends and WAL senders.

Why can't we add comments and leave it as it is? Is it because this
could cause misunderstandings?

If we want to entirely disable it, we can add

if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
return;

to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
calls are done by this function, then we can disable it at
pgstat_tracks_io_bktype().

[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

#33

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#32)

Re: Show WAL write and fsync stats in pg_stat_io

On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 26 Dec 2023 at 13:10, Michael Paquier <michael@paquier.xyz> wrote:

I am not sure while the whole point of the exercise is to have all the
I/O related data in a single view. Something that I've also found a
bit disturbing yesterday while looking at your patch is the fact that
the operation size is guessed from the context and object type when
querying the view because now everything is tied to BLCKSZ. This
patch extends it with two more operation sizes, and there are even
cases where it may be a variable. Could it be a better option to
extend pgstat_count_io_op_time() so as callers can themselves give the
size of the operation?

Do you mean removing the op_bytes column and tracking the number of
bytes in reads, writes, and extends? If so, that makes sense to me but
I don't want to remove the number of operations; I believe that has a
value too. We can extend the pgstat_count_io_op_time() so it can both
track the number of bytes and the number of operations.

Apologies if my previous wording sounded confusing. The idea I had in
mind was to keep op_bytes in pg_stat_io, and extend it so as a value
of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
as a number of bytes.

Also, it is not directly related to this patch but vectored IO [1] is
coming soon; so the number of operations could be wrong since vectored
IO could merge a couple of operations.

Hmm. I have not checked this patch series so I cannot say for sure,
but we'd likely just want to track the number of bytes if a single
operation has a non-equal size rather than registering in pg_stat_io N
rows with different op_bytes, no? I am looping in Thomas Munro in CC
for comments.

The whole patch is kind of itself complicated enough, so I'd be OK to
discard the case of the WAL receiver for now. Now, if we do so, the
code stack of pgstat_io.c should handle WAL receivers as something
entirely disabled until all the known issues are solved. There is
still a lot of value in tracking WAL data associated to the WAL
writer, normal backends and WAL senders.

Why can't we add comments and leave it as it is? Is it because this
could cause misunderstandings?

If we want to entirely disable it, we can add

if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
return;

to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
calls are done by this function, then we can disable it at
pgstat_tracks_io_bktype().

Yeah, a limitation like that may be acceptable for now. Tracking the
WAL writer and WAL sender activities can be relevant in a lot of cases
even if we don't have the full picture for the WAL receiver yet.
--
Michael

#34

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#33)

1 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 26 Dec 2023 at 13:10, Michael Paquier <michael@paquier.xyz> wrote:

I am not sure while the whole point of the exercise is to have all the
I/O related data in a single view. Something that I've also found a
bit disturbing yesterday while looking at your patch is the fact that
the operation size is guessed from the context and object type when
querying the view because now everything is tied to BLCKSZ. This
patch extends it with two more operation sizes, and there are even
cases where it may be a variable. Could it be a better option to
extend pgstat_count_io_op_time() so as callers can themselves give the
size of the operation?

Do you mean removing the op_bytes column and tracking the number of
bytes in reads, writes, and extends? If so, that makes sense to me but
I don't want to remove the number of operations; I believe that has a
value too. We can extend the pgstat_count_io_op_time() so it can both
track the number of bytes and the number of operations.

Apologies if my previous wording sounded confusing. The idea I had in
mind was to keep op_bytes in pg_stat_io, and extend it so as a value
of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
as a number of bytes.

Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

Also, it is not directly related to this patch but vectored IO [1] is
coming soon; so the number of operations could be wrong since vectored
IO could merge a couple of operations.

Hmm. I have not checked this patch series so I cannot say for sure,
but we'd likely just want to track the number of bytes if a single
operation has a non-equal size rather than registering in pg_stat_io N
rows with different op_bytes, no?

Yes, that is correct.

I am looping in Thomas Munro in CC for comments.

Thanks for doing that.

If we want to entirely disable it, we can add

if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
return;

to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
calls are done by this function, then we can disable it at
pgstat_tracks_io_bktype().

Yeah, a limitation like that may be acceptable for now. Tracking the
WAL writer and WAL sender activities can be relevant in a lot of cases
even if we don't have the full picture for the WAL receiver yet.

I added that and disabled B_WAL_RECEIVER backend with comments
explaining why. v8 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v8-0001-Show-WAL-stats-on-pg_stat_io-except-streaming-rep.patchtext/x-diff; charset=US-ASCII; name=v8-0001-Show-WAL-stats-on-pg_stat_io-except-streaming-rep.patchDownload

From dc25cccc8a5b168f15c9771f174924261b151193 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 3 Jan 2024 15:36:19 +0300
Subject: [PATCH v8] Show WAL stats on pg_stat_io (except streaming
 replication)

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.

IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.

This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
  pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.

doesn't cover:
- Streaming replication WAL IO.
---
 src/include/catalog/pg_proc.dat           |   6 +-
 src/include/pgstat.h                      |   6 +-
 src/backend/access/transam/xlog.c         |  58 +++++------
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/catalog/system_views.sql      |  15 ++-
 src/backend/utils/activity/pgstat_io.c    | 119 ++++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c       |  24 ++---
 src/test/regress/expected/rules.out       |  27 +++--
 src/test/regress/expected/stats.out       |  53 ++++++++++
 src/test/regress/sql/stats.sql            |  25 +++++
 doc/src/sgml/monitoring.sgml              |  29 ++++--
 11 files changed, 292 insertions(+), 80 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5b67784731a..26197dbb817 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5785,9 +5785,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ab91b3b367d..c23970da49f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -279,14 +279,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
@@ -527,6 +529,8 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time, uint32 cnt);
 
 extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int	pgstat_get_io_op_bytes(IOObject io_object,
+								   IOContext io_context);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1264849883b..84b6ca9d1d8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2281,38 +2281,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			ssize_t		written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time(track_wal_io_timing);
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2337,6 +2321,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -3043,6 +3030,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3086,6 +3074,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3121,6 +3112,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3137,6 +3131,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3149,6 +3146,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8414,7 +8414,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8427,11 +8427,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8475,16 +8471,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1);
 
 	PendingWalStats.wal_sync++;
 }
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6f4f81f9927..bb42263f2da 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1777,6 +1778,9 @@ PerformWalRecovery(void)
 			 */
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
+			/* Report pending statistics to the cumulative stats system  */
+			pgstat_flush_io(false);
+
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
 			{
@@ -3252,6 +3256,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3344,6 +3349,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3372,6 +3379,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 058fc47c919..c57b5428ae5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1172,6 +1172,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1179,10 +1186,12 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+        CROSS JOIN pgsio_sum_write_fsync_time p;
+
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 0dd9b98b3e4..8953a705615 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
@@ -115,6 +116,20 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op_n() except it also accumulates time.
  */
@@ -122,7 +137,28 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt)
 {
-	if (track_io_timing)
+	/*
+	 * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp IOs
+	 * but these IOs are not countable for now because IOOP_READ IOs' op_bytes
+	 * (number of bytes per unit of I/O) might not be the same all the time.
+	 * The current implementation requires that the op_bytes must be the same
+	 * for the same IOObject, IOContext and IOOp. To avoid confusion, the
+	 * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
+	 * now.
+	 */
+	if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
+		return;
+
+	/*
+	 * Accumulate timing data.  pgstat_count_buffer is for pgstat_database. As
+	 * pg_stat_database only counts blk_read_time and blk_write_time, it is
+	 * set for IOOP_READ and IOOP_WRITE.
+	 *
+	 * pgBufferUsage is for EXPLAIN.  pgBufferUsage has write and read stats
+	 * for shared, local and temporary blocks.  Temporary blocks are ignored
+	 * here.
+	 */
+	if (pgstat_should_track_io_time(io_object))
 	{
 		instr_time	io_time;
 
@@ -131,7 +167,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object != IOOBJECT_WAL)
+				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
 			else if (io_object == IOOBJECT_TEMP_RELATION)
@@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		}
 		else if (io_op == IOOP_READ)
 		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object != IOOBJECT_WAL)
+				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
 			else if (io_object == IOOBJECT_TEMP_RELATION)
@@ -226,6 +266,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -236,6 +278,30 @@ pgstat_get_io_context_name(IOContext io_context)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of bytes associated to an operation depending on
+ * an IOObject and an IOContext.
+ *
+ * Returns BLCKSZ by default because that is what most of the
+ * IOObject/IOContext combinations use.  The number of bytes for an
+ * IOOBJECT_WAL depends on the context.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+
+		Assert(false);			/* not reachable */
+	}
+
+	return BLCKSZ;
+}
+
 const char *
 pgstat_get_io_object_name(IOObject io_object)
 {
@@ -245,6 +311,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -306,8 +374,8 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
+* - WAL Receiver and WAL Summarizer IO are not tracked in pg_stat_io for
+*   now.
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -328,7 +396,13 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_ARCHIVER:
 		case B_LOGGER:
 		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
+
+			/*
+			 * B_WAL_RECEIVER can do IOs but it is disabled for now to avoid
+			 * confusion. See comment at the top of the
+			 * pgstat_count_io_op_time() function.
+			 */
+
 		case B_WAL_SUMMARIZER:
 			return false;
 
@@ -341,6 +415,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
 		case B_WAL_SENDER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -366,6 +441,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -440,6 +524,13 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Most BackendTypes don't do reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		!(bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -464,6 +555,22 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * An IOCONTEXT_INIT done for an IOOBJECT_WAL io_object does writes and
+	 * syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	/*
+	 * An IOCONTEXT_NORMAL done for an IOOBJECT_WAL io_object does writes,
+	 * reads and syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e65cbf41e9f..a77c2ce0622 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1395,6 +1395,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
+				int			op_bytes;
 
 				Datum		values[IO_NUM_COLUMNS] = {0};
 				bool		nulls[IO_NUM_COLUMNS] = {0};
@@ -1413,12 +1414,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
 
 				/*
-				 * Hard-code this to the value of BLCKSZ for now. Future
-				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
-				 * and constant multipliers, once non-block-oriented IO (e.g.
-				 * temporary file IO) is tracked.
+				 * op_bytes can change according to IOObject and IOContext.
+				 * Get the correct op_bytes.
 				 */
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
@@ -1470,7 +1470,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1491,11 +1491,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1518,11 +1514,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f645e8486bf..1b9db4e45af 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2225,16 +2225,23 @@ pg_stat_user_tables| SELECT relid,
     autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2b..c566ddbdd42 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1255,6 +1282,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1264,6 +1292,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1301,6 +1332,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index e3b4ca96e89..aa6caf7fd7c 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -601,6 +615,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -612,6 +627,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -630,6 +648,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5ef..a721f7b3459 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2499,9 +2499,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2554,6 +2555,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2578,6 +2584,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
@@ -4767,7 +4780,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. This resets the timing
+          counter stored in the <structname>pg_stat_wal</structname> view
+          too.
          </para>
         </listitem>
         <listitem>
@@ -4784,8 +4799,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. The timing
+          counters can be reset by calling pg_stat_reset_shared set to
+          <literal>io</literal>.
          </para>
         </listitem>
         <listitem>
-- 
2.43.0

#35

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Nazir Bilal Yavuz (#34)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:

On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:

Apologies if my previous wording sounded confusing. The idea I had in
mind was to keep op_bytes in pg_stat_io, and extend it so as a value
of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
as a number of bytes.

Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

But then you'd lose the possibility to analyze correlations between
the size and the number of the operations, which is something that
matters for more complex I/O scenarios. This does not need to be
tackled in this patch, which is useful on its own, though I am really
wondering if this is required for the recent work done by Thomas.
Perhaps Andres, Thomas or Melanie could comment on that?

Yeah, a limitation like that may be acceptable for now. Tracking the
WAL writer and WAL sender activities can be relevant in a lot of cases
even if we don't have the full picture for the WAL receiver yet.

I added that and disabled B_WAL_RECEIVER backend with comments
explaining why. v8 is attached.

I can see that's what you have been adding here, which should be OK:

-    if (track_io_timing)
+    /*
+     * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp IOs
+     * but these IOs are not countable for now because IOOP_READ IOs' op_bytes
+     * (number of bytes per unit of I/O) might not be the same all the time.
+     * The current implementation requires that the op_bytes must be the same
+     * for the same IOObject, IOContext and IOOp. To avoid confusion, the
+     * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
+     * now.
+     */
+    if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
+        return;

This could be worded better, but that's one of these nits from me I
usually tweak when committing stuff.

+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+    if (io_object == IOOBJECT_WAL)
+        return track_wal_io_timing;
+
+    return track_io_timing;
+}

One thing I was also considering is if eliminating this routine would
make pgstat_count_io_op_time() more readable the result, but I cannot
get to that.

if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
{
-            pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+            if (io_object != IOOBJECT_WAL)
+                pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+
if (io_object == IOOBJECT_RELATION)
INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
@@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
}
else if (io_op == IOOP_READ)
{
-            pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+            if (io_object != IOOBJECT_WAL)
+                pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+
if (io_object == IOOBJECT_RELATION)
INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)

A second thing is if this would be better with more switch/cases, say:
switch (io_op):
{
case IOOP_EXTEND:
case IOOP_WRITE:
switch (io_object):
{
case WAL:
/* do nothing */
break;
case RELATION:
case TEMP:
.. blah ..
}
break;
case IOOP_READ:
switch (io_object):
{
.. blah ..
}
break;
}

Or just this one to make it clear that nothing happens for WAL
objects:
switch (io_object):
{
case WAL:
/* do nothing */
break;
case RELATION:
switch (io_op):
{
case IOOP_EXTEND:
case IOOP_WRITE:
.. blah ..
case IOOP_READ:
.. blah ..
}
break;
case TEMP:
/* same switch as RELATION */
break;
}

This duplicates a bit things, but at least in the second case it's
clear which counters are updated when I/O timings are tracked. It's
OK by me if people don't like this suggestion, but that would avoid
bugs like the one I found upthread.
--
Michael

#36

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#35)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, 10 Jan 2024 at 08:25, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:

I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

But then you'd lose the possibility to analyze correlations between
the size and the number of the operations, which is something that
matters for more complex I/O scenarios. This does not need to be
tackled in this patch, which is useful on its own, though I am really
wondering if this is required for the recent work done by Thomas.
Perhaps Andres, Thomas or Melanie could comment on that?

Yes, you are right.

Yeah, a limitation like that may be acceptable for now. Tracking the
WAL writer and WAL sender activities can be relevant in a lot of cases
even if we don't have the full picture for the WAL receiver yet.

I added that and disabled B_WAL_RECEIVER backend with comments
explaining why. v8 is attached.

I can see that's what you have been adding here, which should be OK:
-    if (track_io_timing)
+    /*
+     * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp IOs
+     * but these IOs are not countable for now because IOOP_READ IOs' op_bytes
+     * (number of bytes per unit of I/O) might not be the same all the time.
+     * The current implementation requires that the op_bytes must be the same
+     * for the same IOObject, IOContext and IOOp. To avoid confusion, the
+     * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
+     * now.
+     */
+    if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
+        return;
This could be worded better, but that's one of these nits from me I
usually tweak when committing stuff.

Thanks for doing that! Do you have any specific comments that can help
improve it?

+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+    if (io_object == IOOBJECT_WAL)
+        return track_wal_io_timing;
+
+    return track_io_timing;
+}

One thing I was also considering is if eliminating this routine would
make pgstat_count_io_op_time() more readable the result, but I cannot
get to that.

I could not think of a way to eliminate pgstat_should_track_io_time()
route without causing performance regressions. What do you think about
moving inside of 'pgstat_should_track_io_time(io_object) if check' to
another function and call this function from
pgstat_count_io_op_time()? This does not change anything but IMO it
increases the readability.

if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
{
-            pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+            if (io_object != IOOBJECT_WAL)
+                pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+
if (io_object == IOOBJECT_RELATION)
INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
@@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
}
else if (io_op == IOOP_READ)
{
-            pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+            if (io_object != IOOBJECT_WAL)
+                pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+
if (io_object == IOOBJECT_RELATION)
INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
A second thing is if this would be better with more switch/cases, say:
switch (io_op):
{
case IOOP_EXTEND:
case IOOP_WRITE:
switch (io_object):
{
case WAL:
/* do nothing */
break;
case RELATION:
case TEMP:
.. blah ..
}
break;
case IOOP_READ:
switch (io_object):
{
.. blah ..
}
break;
}

Or just this one to make it clear that nothing happens for WAL
objects:
switch (io_object):
{
case WAL:
/* do nothing */
break;
case RELATION:
switch (io_op):
{
case IOOP_EXTEND:
case IOOP_WRITE:
.. blah ..
case IOOP_READ:
.. blah ..
}
break;
case TEMP:
/* same switch as RELATION */
break;
}

This duplicates a bit things, but at least in the second case it's
clear which counters are updated when I/O timings are tracked. It's
OK by me if people don't like this suggestion, but that would avoid
bugs like the one I found upthread.

I am more inclined towards the second one because it is more likely
that a new io_object will be introduced rather than a new io_op. So, I
think the second one is a bit more future proof.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#37

Melanie Plageman

melanieplageman@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#34)

Re: Show WAL write and fsync stats in pg_stat_io

I have code review feedback as well, but I've saved that for my next email.

On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 26 Dec 2023 at 13:10, Michael Paquier <michael@paquier.xyz> wrote:

I am not sure while the whole point of the exercise is to have all the
I/O related data in a single view. Something that I've also found a
bit disturbing yesterday while looking at your patch is the fact that
the operation size is guessed from the context and object type when
querying the view because now everything is tied to BLCKSZ. This
patch extends it with two more operation sizes, and there are even
cases where it may be a variable. Could it be a better option to
extend pgstat_count_io_op_time() so as callers can themselves give the
size of the operation?

Do you mean removing the op_bytes column and tracking the number of
bytes in reads, writes, and extends? If so, that makes sense to me but
I don't want to remove the number of operations; I believe that has a
value too. We can extend the pgstat_count_io_op_time() so it can both
track the number of bytes and the number of operations.

Apologies if my previous wording sounded confusing. The idea I had in
mind was to keep op_bytes in pg_stat_io, and extend it so as a value
of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
as a number of bytes.

Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

Forgive me as I catch up on this thread.

Upthread, Michael says:

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation,

I didn't understand this argument, so I'm not sure if I agree or
disagree with it.

I think these are the three proposals for handling WAL reads:

1) setting op_bytes to 1 and the number of reads is the number of bytes
2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
number of calls to pg_pread() or similar
3) setting op_bytes to NULL and the number of reads is the number of
calls to pg_pread() or similar

Looking at the patch, I think it is still doing 2.

It would be good to list all our options with pros and cons (if only
because they are a bit spread throughout the thread now).

For an unpopular idea: we could add separate [IOOp]_bytes columns for
all those IOOps for which it would be relevant. It kind of stinks but
it would give us the freedom to document exactly what a single IOOp
means for each combination of BackendType, IOContext, IOObject, and
IOOp (as relevant) and still have an accurate number in the *bytes
columns. Everyone will probably hate us if we do that, though.
Especially because having bytes for the existing IOObjects is an
existing feature.

A separate question: suppose [1] goes in (to read WAL from WAL buffers
directly). Now, WAL reads are not from permanent storage anymore. Are
we only tracking permanent storage I/O in pg_stat_io? I also had this
question for some of the WAL receiver functions. Should we track any
I/O other than permanent storage I/O? Or did I miss this being
addressed upthread?

Also, it is not directly related to this patch but vectored IO [1] is
coming soon; so the number of operations could be wrong since vectored
IO could merge a couple of operations.

Hmm. I have not checked this patch series so I cannot say for sure,
but we'd likely just want to track the number of bytes if a single
operation has a non-equal size rather than registering in pg_stat_io N
rows with different op_bytes, no?

Yes, that is correct.

I do not like the idea of having basically GROUP BY op_bytes in the
view (if that is the suggestion).

In terms of what I/O we should track in a streaming/asynchronous
world, the options would be:

1) track read/write syscalls
2) track blocks of BLCKSZ submitted to the kernel
3) track bytes submitted to the kernel
4) track merged I/Os (after doing any merging in the application)

I think the debate was largely between 2 and 4. There was some
disagreement, but I think we landed on 2 because there is merging that
can happen at many levels in the storage stack (even the storage
controller). Distinguishing between whether or not Postgres submitted
2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
but I think it might be confusing for the Postgres user trying to
determine why their query is slow. It probably makes the most sense to
still track in block size.

No matter what solution we pick, you should get a correct number if
you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
rather, there should be some way of getting an accurate number in
bytes of the amount of a particular kind of I/O that has been done.

- Melanie

#38

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Melanie Plageman (#37)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:

I have code review feedback as well, but I've saved that for my next email.

Ah, cool.

On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

Upthread, Michael says:

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation,

I didn't understand this argument, so I'm not sure if I agree or
disagree with it.

Nazir has mentioned upthread one thing: what should we do for the case
where a combination of (io_object,io_context) does I/O with a
*variable* op_bytes, because that may be the case for the WAL
receiver? For this case, he has mentioned that we should set op_bytes
to 1, but that's something I find confusing because it would mean that
we are doing read, writes or extends 1 byte at a time. My suggestion
would be to use op_bytes = -1 or NULL for the variable case instead,
with reads, writes and extends referring to a number of bytes rather
than a number of operations.

I think these are the three proposals for handling WAL reads:

1) setting op_bytes to 1 and the number of reads is the number of bytes
2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
number of calls to pg_pread() or similar
3) setting op_bytes to NULL and the number of reads is the number of
calls to pg_pread() or similar

3) could be a number of bytes, actually.

Looking at the patch, I think it is still doing 2.

The patch disables stats for the WAL receiver, while the startup
process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
discard the variable case.

For an unpopular idea: we could add separate [IOOp]_bytes columns for
all those IOOps for which it would be relevant. It kind of stinks but
it would give us the freedom to document exactly what a single IOOp
means for each combination of BackendType, IOContext, IOObject, and
IOOp (as relevant) and still have an accurate number in the *bytes
columns. Everyone will probably hate us if we do that, though.
Especially because having bytes for the existing IOObjects is an
existing feature.

An issue I have with this one is that having multiple tuples for
each (object,context) if they have multiple op_bytes leads to
potentially a lot of bloat in the view. That would be up to 8k extra
tuples just for the sake of op_byte's variability.

A separate question: suppose [1] goes in (to read WAL from WAL buffers
directly). Now, WAL reads are not from permanent storage anymore. Are
we only tracking permanent storage I/O in pg_stat_io? I also had this
question for some of the WAL receiver functions. Should we track any
I/O other than permanent storage I/O? Or did I miss this being
addressed upthread?

That's a good point. I guess that this should just be a different
IOOp? That's not a IOOP_READ. A IOOP_HIT is also different.

In terms of what I/O we should track in a streaming/asynchronous
world, the options would be:

1) track read/write syscalls
2) track blocks of BLCKSZ submitted to the kernel
3) track bytes submitted to the kernel
4) track merged I/Os (after doing any merging in the application)

I think the debate was largely between 2 and 4. There was some
disagreement, but I think we landed on 2 because there is merging that
can happen at many levels in the storage stack (even the storage
controller). Distinguishing between whether or not Postgres submitted
2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
but I think it might be confusing for the Postgres user trying to
determine why their query is slow. It probably makes the most sense to
still track in block size.

No matter what solution we pick, you should get a correct number if
you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
rather, there should be some way of getting an accurate number in
bytes of the amount of a particular kind of I/O that has been done.

Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
writes or extends are counted as bytes, because we don't have a fixed
operation size for some (object,context) cases.
--
Michael

#39

Nazir Bilal Yavuz

byavuz81@gmail.com

about 2 years ago

In reply to: Michael Paquier (#38)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Thu, 11 Jan 2024 at 08:01, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:

I have code review feedback as well, but I've saved that for my next email.

Ah, cool.

On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

Upthread, Michael says:

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation,

I didn't understand this argument, so I'm not sure if I agree or
disagree with it.

Nazir has mentioned upthread one thing: what should we do for the case
where a combination of (io_object,io_context) does I/O with a
*variable* op_bytes, because that may be the case for the WAL
receiver? For this case, he has mentioned that we should set op_bytes
to 1, but that's something I find confusing because it would mean that
we are doing read, writes or extends 1 byte at a time. My suggestion
would be to use op_bytes = -1 or NULL for the variable case instead,
with reads, writes and extends referring to a number of bytes rather
than a number of operations.

I agree but we can't do this only for the *variable* cases since
B_WAL_RECEIVER and other backends use the same
pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
is, if two backends use the same pgstat_count_io_op_time() function
call in the code; they must count the same thing (number of calls,
bytes, etc.). It could be better to count the number of bytes for all
the IOOBJECT_WAL IOs.

I think these are the three proposals for handling WAL reads:

1) setting op_bytes to 1 and the number of reads is the number of bytes
2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
number of calls to pg_pread() or similar
3) setting op_bytes to NULL and the number of reads is the number of
calls to pg_pread() or similar

3) could be a number of bytes, actually.

One important point is that we can't change only reads, if we decide
to count the number of bytes for the reads; writes and extends should
be counted as a number of bytes as well.

Looking at the patch, I think it is still doing 2.

The patch disables stats for the WAL receiver, while the startup
process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
discard the variable case.

For an unpopular idea: we could add separate [IOOp]_bytes columns for
all those IOOps for which it would be relevant. It kind of stinks but
it would give us the freedom to document exactly what a single IOOp
means for each combination of BackendType, IOContext, IOObject, and
IOOp (as relevant) and still have an accurate number in the *bytes
columns. Everyone will probably hate us if we do that, though.
Especially because having bytes for the existing IOObjects is an
existing feature.

An issue I have with this one is that having multiple tuples for
each (object,context) if they have multiple op_bytes leads to
potentially a lot of bloat in the view. That would be up to 8k extra
tuples just for the sake of op_byte's variability.

Yes, that doesn't seem applicable to me.

A separate question: suppose [1] goes in (to read WAL from WAL buffers
directly). Now, WAL reads are not from permanent storage anymore. Are
we only tracking permanent storage I/O in pg_stat_io? I also had this
question for some of the WAL receiver functions. Should we track any
I/O other than permanent storage I/O? Or did I miss this being
addressed upthread?

That's a good point. I guess that this should just be a different
IOOp? That's not a IOOP_READ. A IOOP_HIT is also different.

I think different IOContext rather than IOOp suits better for this.

In terms of what I/O we should track in a streaming/asynchronous
world, the options would be:

1) track read/write syscalls
2) track blocks of BLCKSZ submitted to the kernel
3) track bytes submitted to the kernel
4) track merged I/Os (after doing any merging in the application)

I think the debate was largely between 2 and 4. There was some
disagreement, but I think we landed on 2 because there is merging that
can happen at many levels in the storage stack (even the storage
controller). Distinguishing between whether or not Postgres submitted
2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
but I think it might be confusing for the Postgres user trying to
determine why their query is slow. It probably makes the most sense to
still track in block size.

No matter what solution we pick, you should get a correct number if
you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
rather, there should be some way of getting an accurate number in
bytes of the amount of a particular kind of I/O that has been done.

Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
writes or extends are counted as bytes, because we don't have a fixed
operation size for some (object,context) cases.

Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
IOs and 2 for the other IOs.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#40

Melanie Plageman

melanieplageman@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#39)

Re: Show WAL write and fsync stats in pg_stat_io

On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Thu, 11 Jan 2024 at 08:01, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:

On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

Upthread, Michael says:

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation,

I didn't understand this argument, so I'm not sure if I agree or
disagree with it.

Nazir has mentioned upthread one thing: what should we do for the case
where a combination of (io_object,io_context) does I/O with a
*variable* op_bytes, because that may be the case for the WAL
receiver? For this case, he has mentioned that we should set op_bytes
to 1, but that's something I find confusing because it would mean that
we are doing read, writes or extends 1 byte at a time. My suggestion
would be to use op_bytes = -1 or NULL for the variable case instead,
with reads, writes and extends referring to a number of bytes rather
than a number of operations.

I agree but we can't do this only for the *variable* cases since
B_WAL_RECEIVER and other backends use the same
pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
is, if two backends use the same pgstat_count_io_op_time() function
call in the code; they must count the same thing (number of calls,
bytes, etc.). It could be better to count the number of bytes for all
the IOOBJECT_WAL IOs.

I'm a bit confused by this. pgstat_count_io_op_time() can check
MyBackendType. In fact, you do this to ban the wal receiver already.
It is true that you would need to count all wal receiver normal
context wal object IOOps in the variable way, but I don't see how
pgstat_count_io_op_time() is the limiting factor as long as the
callsite is always doing either the number of bytes or the number of
calls.

I think these are the three proposals for handling WAL reads:

1) setting op_bytes to 1 and the number of reads is the number of bytes
2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
number of calls to pg_pread() or similar
3) setting op_bytes to NULL and the number of reads is the number of
calls to pg_pread() or similar

3) could be a number of bytes, actually.

One important point is that we can't change only reads, if we decide
to count the number of bytes for the reads; writes and extends should
be counted as a number of bytes as well.

Yes, that is true.

Looking at the patch, I think it is still doing 2.

The patch disables stats for the WAL receiver, while the startup
process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
discard the variable case.

For an unpopular idea: we could add separate [IOOp]_bytes columns for
all those IOOps for which it would be relevant. It kind of stinks but
it would give us the freedom to document exactly what a single IOOp
means for each combination of BackendType, IOContext, IOObject, and
IOOp (as relevant) and still have an accurate number in the *bytes
columns. Everyone will probably hate us if we do that, though.
Especially because having bytes for the existing IOObjects is an
existing feature.

An issue I have with this one is that having multiple tuples for
each (object,context) if they have multiple op_bytes leads to
potentially a lot of bloat in the view. That would be up to 8k extra
tuples just for the sake of op_byte's variability.

Yes, that doesn't seem applicable to me.

My suggestion (again not sure it is a good idea) was actually that we
remove op_bytes and add "write_bytes", "read_bytes", and
"extend_bytes". AFAICT, this would add columns not rows. In this
schema, read bytes for wal receiver could be counted in one way and
writes in another. We could document that, for wal receiver, the reads
are not always done in units of the same size, so the read_bytes /
reads could be thought of as an average size of read.

Even if we made a separate view for WAL I/O stats, we would still have
this issue of variable sized I/O vs block sized I/O and would probably
end up solving it with separate columns for the number of bytes and
number of operations.

A separate question: suppose [1] goes in (to read WAL from WAL buffers
directly). Now, WAL reads are not from permanent storage anymore. Are
we only tracking permanent storage I/O in pg_stat_io? I also had this
question for some of the WAL receiver functions. Should we track any
I/O other than permanent storage I/O? Or did I miss this being
addressed upthread?

That's a good point. I guess that this should just be a different
IOOp? That's not a IOOP_READ. A IOOP_HIT is also different.

I think different IOContext rather than IOOp suits better for this.

That makes sense to me.

In terms of what I/O we should track in a streaming/asynchronous
world, the options would be:

1) track read/write syscalls
2) track blocks of BLCKSZ submitted to the kernel
3) track bytes submitted to the kernel
4) track merged I/Os (after doing any merging in the application)

I think the debate was largely between 2 and 4. There was some
disagreement, but I think we landed on 2 because there is merging that
can happen at many levels in the storage stack (even the storage
controller). Distinguishing between whether or not Postgres submitted
2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
but I think it might be confusing for the Postgres user trying to
determine why their query is slow. It probably makes the most sense to
still track in block size.

No matter what solution we pick, you should get a correct number if
you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
rather, there should be some way of getting an accurate number in
bytes of the amount of a particular kind of I/O that has been done.

Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
writes or extends are counted as bytes, because we don't have a fixed
operation size for some (object,context) cases.

Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
IOs and 2 for the other IOs.

We can do this. One concern I have is that much of WAL I/O is done in
XLOG_BLCKSZ, so it feels kind of odd for all WAL I/O to appear as if
it is being done in random chunks of bytes. We anticipated other
uniformly non-block-based I/O types where having 1 in op_bytes would
be meaningful. I didn't realize at the time that there would be
variable-sized and block-sized I/O mixed together for the same backend
type, io object, and io context.

- Melanie

#41

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#40)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Thu, 11 Jan 2024 at 17:28, Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Thu, 11 Jan 2024 at 08:01, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:

On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

Upthread, Michael says:

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation,

I didn't understand this argument, so I'm not sure if I agree or
disagree with it.

Nazir has mentioned upthread one thing: what should we do for the case
where a combination of (io_object,io_context) does I/O with a
*variable* op_bytes, because that may be the case for the WAL
receiver? For this case, he has mentioned that we should set op_bytes
to 1, but that's something I find confusing because it would mean that
we are doing read, writes or extends 1 byte at a time. My suggestion
would be to use op_bytes = -1 or NULL for the variable case instead,
with reads, writes and extends referring to a number of bytes rather
than a number of operations.

I agree but we can't do this only for the *variable* cases since
B_WAL_RECEIVER and other backends use the same
pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
is, if two backends use the same pgstat_count_io_op_time() function
call in the code; they must count the same thing (number of calls,
bytes, etc.). It could be better to count the number of bytes for all
the IOOBJECT_WAL IOs.

I'm a bit confused by this. pgstat_count_io_op_time() can check
MyBackendType. In fact, you do this to ban the wal receiver already.
It is true that you would need to count all wal receiver normal
context wal object IOOps in the variable way, but I don't see how
pgstat_count_io_op_time() is the limiting factor as long as the
callsite is always doing either the number of bytes or the number of
calls.

Apologies for not being clear. Let me try to explain this by giving examples:

Let's assume that there are 3 different pgstat_count_io_op_time()
calls in the code base and they are labeled as 1, 2 and 3.

And let's' assume that: WAL receiver uses 1st and 2nd
pgstat_count_io_op_time(), autovacuum uses 2nd and 3rd
pgstat_count_io_op_time() and checkpointer uses 3rd
pgstat_count_io_op_time() to count IOs.

The 1st one is the only pgstat_count_io_op_time() call that must count
the number of bytes because of the variable cases and the others count
the number of calls or blocks.

a) WAL receiver uses both 1st and 2nd => 1st and 2nd
pgstat_count_io_op_time() must count the same thing => 2nd
pgstat_count_io_op_time() must count the number of bytes as well.

b) 2nd pgstat_count_io_op_time() started to count the number of bytes
=> Autovacuum will start to count the number of bytes => 2nd and 3rd
both are used by autocavuum => 3rd pgstat_count_io_op_time() must
count the number of bytes as well.

c) 3rd pgstat_count_io_op_time() started to count the number of bytes
=> Checkpointer will start to count the number of bytes.

The list goes on like this and if we don't have [write | read |
extend]_bytes, this effect will be multiplied.

I think these are the three proposals for handling WAL reads:

1) setting op_bytes to 1 and the number of reads is the number of bytes
2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
number of calls to pg_pread() or similar
3) setting op_bytes to NULL and the number of reads is the number of
calls to pg_pread() or similar

3) could be a number of bytes, actually.

One important point is that we can't change only reads, if we decide
to count the number of bytes for the reads; writes and extends should
be counted as a number of bytes as well.

Yes, that is true.

Looking at the patch, I think it is still doing 2.

The patch disables stats for the WAL receiver, while the startup
process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
discard the variable case.

For an unpopular idea: we could add separate [IOOp]_bytes columns for
all those IOOps for which it would be relevant. It kind of stinks but
it would give us the freedom to document exactly what a single IOOp
means for each combination of BackendType, IOContext, IOObject, and
IOOp (as relevant) and still have an accurate number in the *bytes
columns. Everyone will probably hate us if we do that, though.
Especially because having bytes for the existing IOObjects is an
existing feature.

An issue I have with this one is that having multiple tuples for
each (object,context) if they have multiple op_bytes leads to
potentially a lot of bloat in the view. That would be up to 8k extra
tuples just for the sake of op_byte's variability.

Yes, that doesn't seem applicable to me.

My suggestion (again not sure it is a good idea) was actually that we
remove op_bytes and add "write_bytes", "read_bytes", and
"extend_bytes". AFAICT, this would add columns not rows. In this
schema, read bytes for wal receiver could be counted in one way and
writes in another. We could document that, for wal receiver, the reads
are not always done in units of the same size, so the read_bytes /
reads could be thought of as an average size of read.

That looks like one of the best options to me. I suggested something
similar upthread and Michael's answer was:

But then you'd lose the possibility to analyze correlations between
the size and the number of the operations, which is something that
matters for more complex I/O scenarios. This does not need to be
tackled in this patch, which is useful on its own, though I am really
wondering if this is required for the recent work done by Thomas.
Perhaps Andres, Thomas or Melanie could comment on that?

Even if we made a separate view for WAL I/O stats, we would still have
this issue of variable sized I/O vs block sized I/O and would probably
end up solving it with separate columns for the number of bytes and
number of operations.

Yes, I think it is more about flexibility and not changing the already
published pg_stat_io view.

A separate question: suppose [1] goes in (to read WAL from WAL buffers
directly). Now, WAL reads are not from permanent storage anymore. Are
we only tracking permanent storage I/O in pg_stat_io? I also had this
question for some of the WAL receiver functions. Should we track any
I/O other than permanent storage I/O? Or did I miss this being
addressed upthread?

That's a good point. I guess that this should just be a different
IOOp? That's not a IOOP_READ. A IOOP_HIT is also different.

I think different IOContext rather than IOOp suits better for this.

That makes sense to me.

In terms of what I/O we should track in a streaming/asynchronous
world, the options would be:

1) track read/write syscalls
2) track blocks of BLCKSZ submitted to the kernel
3) track bytes submitted to the kernel
4) track merged I/Os (after doing any merging in the application)

I think the debate was largely between 2 and 4. There was some
disagreement, but I think we landed on 2 because there is merging that
can happen at many levels in the storage stack (even the storage
controller). Distinguishing between whether or not Postgres submitted
2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
but I think it might be confusing for the Postgres user trying to
determine why their query is slow. It probably makes the most sense to
still track in block size.

No matter what solution we pick, you should get a correct number if
you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
rather, there should be some way of getting an accurate number in
bytes of the amount of a particular kind of I/O that has been done.

Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
writes or extends are counted as bytes, because we don't have a fixed
operation size for some (object,context) cases.

Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
IOs and 2 for the other IOs.

We can do this. One concern I have is that much of WAL I/O is done in
XLOG_BLCKSZ, so it feels kind of odd for all WAL I/O to appear as if
it is being done in random chunks of bytes. We anticipated other
uniformly non-block-based I/O types where having 1 in op_bytes would
be meaningful. I didn't realize at the time that there would be
variable-sized and block-sized I/O mixed together for the same backend
type, io object, and io context.

Correct. What is the lowest level that can use two different options?
I mean, could we use 3 for the WAL receiver, IOOP_READ, IOOBJECT_WAL,
IOCONTEXT_NORMAL and the 2 for the rest?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#42

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#41)

Re: Show WAL write and fsync stats in pg_stat_io

On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote:

On Thu, 11 Jan 2024 at 17:28, Melanie Plageman <melanieplageman@gmail.com> wrote:

Even if we made a separate view for WAL I/O stats, we would still have
this issue of variable sized I/O vs block sized I/O and would probably
end up solving it with separate columns for the number of bytes and
number of operations.

Yes, I think it is more about flexibility and not changing the already
published pg_stat_io view.

I don't know. Adding more columns or changing op_bytes with an extra
mode that reflects on what the other columns mean is kind of the same
thing to me: we want pg_stat_io to report more modes so as all I/O can
be evaluated from a single view, but the complication is now that
everything is tied to BLCKSZ.

IMHO, perhaps we'd better put this patch aside until we are absolutely
*sure* of what we want to achieve when it comes to WAL, and I am
afraid that this cannot happen until we're happy with the way we
handle WAL reads *and* writes, including WAL receiver or anything that
has the idea of pulling its own page callback with
XLogReaderAllocate() in the backend. Well, writes should be
relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but
reads are the unknown part.

That also seems furiously related to the work happening with async I/O
or the fact that we may want to have in the view a separate meaning
for cached pages or pages read directly from disk. The worst thing
that we would do is rush something into the tree and then have to deal
with the aftermath of what we'd need to deal with in terms of
compatibility depending on the state of the other I/O related work
when the new view is released. That would not be fun for the users
and any hackers who would have to deal with that (aka mainly me if I
were to commit something), because pg_stat_io could mean something in
version N, still mean something entirely different in version N+1.
--
Michael

#43

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#42)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Mon, 15 Jan 2024 at 09:27, Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote:

On Thu, 11 Jan 2024 at 17:28, Melanie Plageman <melanieplageman@gmail.com> wrote:

Even if we made a separate view for WAL I/O stats, we would still have
this issue of variable sized I/O vs block sized I/O and would probably
end up solving it with separate columns for the number of bytes and
number of operations.

Yes, I think it is more about flexibility and not changing the already
published pg_stat_io view.

I don't know. Adding more columns or changing op_bytes with an extra
mode that reflects on what the other columns mean is kind of the same
thing to me: we want pg_stat_io to report more modes so as all I/O can
be evaluated from a single view, but the complication is now that
everything is tied to BLCKSZ.

IMHO, perhaps we'd better put this patch aside until we are absolutely
*sure* of what we want to achieve when it comes to WAL, and I am
afraid that this cannot happen until we're happy with the way we
handle WAL reads *and* writes, including WAL receiver or anything that
has the idea of pulling its own page callback with
XLogReaderAllocate() in the backend. Well, writes should be
relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but
reads are the unknown part.

That also seems furiously related to the work happening with async I/O
or the fact that we may want to have in the view a separate meaning
for cached pages or pages read directly from disk. The worst thing
that we would do is rush something into the tree and then have to deal
with the aftermath of what we'd need to deal with in terms of
compatibility depending on the state of the other I/O related work
when the new view is released. That would not be fun for the users
and any hackers who would have to deal with that (aka mainly me if I
were to commit something), because pg_stat_io could mean something in
version N, still mean something entirely different in version N+1.

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#44

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#43)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

Okay, I've marked the patch as RwF for this CF.
--
Michael

#45

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#44)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Thu, 18 Jan 2024 at 04:22, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

Okay, I've marked the patch as RwF for this CF.

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#46

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#45)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Mon, 19 Feb 2024 at 10:28, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Thu, 18 Jan 2024 at 04:22, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

Okay, I've marked the patch as RwF for this CF.

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]/messages/by-id/CAFiTN-sE7CJn-ZFj+-0Wv6TNytv_fp4n+eCszspxJ3mt77t5ig@mail.gmail.com. We may follow the same
logic by not including these to pg_stat_io?

* With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
does not block anything related to putting WAL stats in pg_stat_io.

If I am not missing any new changes, the only problem is reading
variable bytes now. We have discussed a couple of solutions:

1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
that this means some variable byte I/O is happening.

I kind of dislike this solution because if the *only* read I/O is
happening in variable bytes, it will look like write and extend I/Os
are happening in variable bytes as well. As a solution, it could be
documented that only read I/Os could happen in variable bytes for now.

2- Use op_bytes_[read | write | extend] columns instead of one
op_bytes column, also use the first solution.

This can solve the first solution's weakness but it introduces two
more columns. This is more future proof compared to the first solution
if there is a chance that some variable I/O could happen in write and
extend calls as well in the future.

3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.

pg_stat_io could remain untouchable and we will have flexibility to
edit this new view as much as we want. But the original aim of the
pg_stat_io is evaluating all I/O from a single view and adding a new
view breaks this aim.

I hope that I did not miss anything and my explanations are clear.

Any kind of feedback would be appreciated.

[1]: /messages/by-id/CAFiTN-sE7CJn-ZFj+-0Wv6TNytv_fp4n+eCszspxJ3mt77t5ig@mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

#47

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#46)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Fri, 19 Apr 2024 at 11:01, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Thu, 18 Jan 2024 at 04:22, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

Okay, I've marked the patch as RwF for this CF.

Since the last commitfest entry was returned with feedback, I created
a new commitfest entry: https://commitfest.postgresql.org/48/4950/

--
Regards,
Nazir Bilal Yavuz
Microsoft

#48

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#46)

Re: Show WAL write and fsync stats in pg_stat_io

On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

Thanks for working on this.

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]. We may follow the same
logic by not including these to pg_stat_io?

Right. WALReadFromBuffers doesn't do any I/O.

Whoever reads WAL from disk (backends, walsenders, recovery process)
using pg_pread (XLogPageRead, WALRead) needs to be tracked in
pg_stat_io or some other view. If it were to be in pg_stat_io,
although we may not be able to distinguish WAL read stats at a backend
level (like how many times/bytes a walsender or recovery process or a
backend read WAL from disk), but it can help understand overall impact
of WAL read I/O at a cluster level. With this approach, the WAL I/O
stats are divided up - WAL read I/O and write I/O stats are in
pg_stat_io and pg_stat_wal respectively.

This makes me think if we need to add WAL read I/O stats also to
pg_stat_wal. Then, we can also add WALReadFromBuffers stats
hits/misses there. With this approach, pg_stat_wal can be a one-stop
view for all the WAL related stats. If needed, we can join info from
pg_stat_wal to pg_stat_io in system_views.sql so that the I/O stats
are emitted to the end-user via pg_stat_io.

* With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
does not block anything related to putting WAL stats in pg_stat_io.

If I am not missing any new changes, the only problem is reading
variable bytes now. We have discussed a couple of solutions:

1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
that this means some variable byte I/O is happening.

I kind of dislike this solution because if the *only* read I/O is
happening in variable bytes, it will look like write and extend I/Os
are happening in variable bytes as well. As a solution, it could be
documented that only read I/Os could happen in variable bytes for now.

Yes, read I/O for relation and WAL can happen in variable bytes. I
think this idea seems reasonable and simple yet useful to know the
cluster-wide read I/O.

However, another point here is how the total number of bytes read is
represented with existing pg_stat_io columns 'reads' and 'op_bytes'.
It is known now with 'reads' * 'op_bytes', but with variable bytes,
how is read bytes calculated? Maybe add new columns
read_bytes/write_bytes?

2- Use op_bytes_[read | write | extend] columns instead of one
op_bytes column, also use the first solution.

This can solve the first solution's weakness but it introduces two
more columns. This is more future proof compared to the first solution
if there is a chance that some variable I/O could happen in write and
extend calls as well in the future.

-1 as more columns impact the readability and usability.

3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.

pg_stat_io could remain untouchable and we will have flexibility to
edit this new view as much as we want. But the original aim of the
pg_stat_io is evaluating all I/O from a single view and adding a new
view breaks this aim.

-1 as it defeats the very purpose of one-stop view pg_stat_io for all
kinds of I/O. PS: see my response above about adding both WAL write
I/O and read I/O stats to pg_stat_wal.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#49

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: Bharath Rupireddy (#48)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, May 13, 2024 at 7:42 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

Thanks for working on this.

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]. We may follow the same
logic by not including these to pg_stat_io?

Right. WALReadFromBuffers doesn't do any I/O.

Whoever reads WAL from disk (backends, walsenders, recovery process)
using pg_pread (XLogPageRead, WALRead) needs to be tracked in
pg_stat_io or some other view. If it were to be in pg_stat_io,
although we may not be able to distinguish WAL read stats at a backend
level (like how many times/bytes a walsender or recovery process or a
backend read WAL from disk), but it can help understand overall impact
of WAL read I/O at a cluster level. With this approach, the WAL I/O
stats are divided up - WAL read I/O and write I/O stats are in
pg_stat_io and pg_stat_wal respectively.

This makes me think if we need to add WAL read I/O stats also to
pg_stat_wal. Then, we can also add WALReadFromBuffers stats
hits/misses there. With this approach, pg_stat_wal can be a one-stop
view for all the WAL related stats.

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

--
With Regards,
Amit Kapila.

#50

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 1 year ago

In reply to: Amit Kapila (#49)

Re: Show WAL write and fsync stats in pg_stat_io

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
database. It signifies the count of cache hits, or in other words, the
instances where data was located in the ‘shared_buffers’. As a result,
keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
knowledge about the buffer cache’s efficiency and assist users in
making educated choices when fine-tuning their database. However, if
we include the hit count of WAL buffers in this, it may lead to
misleading interpretations for database tuning. If there’s something
I’ve overlooked that’s already been discussed, please feel free to
correct me.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Show quoted text

On Tue, May 28, 2024 at 6:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 13, 2024 at 7:42 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

Thanks for working on this.

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]. We may follow the same
logic by not including these to pg_stat_io?

Right. WALReadFromBuffers doesn't do any I/O.

Whoever reads WAL from disk (backends, walsenders, recovery process)
using pg_pread (XLogPageRead, WALRead) needs to be tracked in
pg_stat_io or some other view. If it were to be in pg_stat_io,
although we may not be able to distinguish WAL read stats at a backend
level (like how many times/bytes a walsender or recovery process or a
backend read WAL from disk), but it can help understand overall impact
of WAL read I/O at a cluster level. With this approach, the WAL I/O
stats are divided up - WAL read I/O and write I/O stats are in
pg_stat_io and pg_stat_wal respectively.

This makes me think if we need to add WAL read I/O stats also to
pg_stat_wal. Then, we can also add WALReadFromBuffers stats
hits/misses there. With this approach, pg_stat_wal can be a one-stop
view for all the WAL related stats.

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

--
With Regards,
Amit Kapila.

#51

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Bharath Rupireddy (#48)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thank you for looking into this! And, sorry for the late answer.

On Mon, 13 May 2024 at 17:12, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

Thanks for working on this.

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]. We may follow the same
logic by not including these to pg_stat_io?

Right. WALReadFromBuffers doesn't do any I/O.

Whoever reads WAL from disk (backends, walsenders, recovery process)
using pg_pread (XLogPageRead, WALRead) needs to be tracked in
pg_stat_io or some other view. If it were to be in pg_stat_io,
although we may not be able to distinguish WAL read stats at a backend
level (like how many times/bytes a walsender or recovery process or a
backend read WAL from disk), but it can help understand overall impact
of WAL read I/O at a cluster level. With this approach, the WAL I/O
stats are divided up - WAL read I/O and write I/O stats are in
pg_stat_io and pg_stat_wal respectively.

This makes me think if we need to add WAL read I/O stats also to
pg_stat_wal. Then, we can also add WALReadFromBuffers stats
hits/misses there. With this approach, pg_stat_wal can be a one-stop
view for all the WAL related stats. If needed, we can join info from
pg_stat_wal to pg_stat_io in system_views.sql so that the I/O stats
are emitted to the end-user via pg_stat_io.

I agree that the ultimate goal is seeing WAL I/O stats from one place.
There is a reply to this from Amit:

On Tue, 28 May 2024 at 03:48, Amit Kapila <amit.kapila16@gmail.com> wrote:

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

I think it is better to have all the I/O stats in pg_stat_io like Amit
said. And, it makes sense to me to show 'WAL data we get from buffers'
in the hits column. Since, basically instead of doing I/O from disk;
we get data directly from WAL buffers. I think that fits the
explanation of the hits column in pg_stat_io, which is 'The number of
times a desired block was found in a shared buffer.' [1]https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEW.

* With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
does not block anything related to putting WAL stats in pg_stat_io.

If I am not missing any new changes, the only problem is reading
variable bytes now. We have discussed a couple of solutions:

1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
that this means some variable byte I/O is happening.

I kind of dislike this solution because if the *only* read I/O is
happening in variable bytes, it will look like write and extend I/Os
are happening in variable bytes as well. As a solution, it could be
documented that only read I/Os could happen in variable bytes for now.

Yes, read I/O for relation and WAL can happen in variable bytes. I
think this idea seems reasonable and simple yet useful to know the
cluster-wide read I/O.

I agree.

However, another point here is how the total number of bytes read is
represented with existing pg_stat_io columns 'reads' and 'op_bytes'.
It is known now with 'reads' * 'op_bytes', but with variable bytes,
how is read bytes calculated? Maybe add new columns
read_bytes/write_bytes?

2- Use op_bytes_[read | write | extend] columns instead of one
op_bytes column, also use the first solution.

This can solve the first solution's weakness but it introduces two
more columns. This is more future proof compared to the first solution
if there is a chance that some variable I/O could happen in write and
extend calls as well in the future.

-1 as more columns impact the readability and usability.

I did not understand the overall difference between what you suggested
(adding read_bytes/write_bytes columns) and my suggestion (adding
op_bytes_[read | write | extend] columns). They both introduce new
columns. Could you please explain what you suggested in more detail?

3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.

pg_stat_io could remain untouchable and we will have flexibility to
edit this new view as much as we want. But the original aim of the
pg_stat_io is evaluating all I/O from a single view and adding a new
view breaks this aim.

-1 as it defeats the very purpose of one-stop view pg_stat_io for all
kinds of I/O. PS: see my response above about adding both WAL write
I/O and read I/O stats to pg_stat_wal.

I agree.

[1]: https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEW

--
Regards,
Nazir Bilal Yavuz
Microsoft

#52

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Nitin Jadhav (#50)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thank you for looking into this!

On Sun, 9 Jun 2024 at 18:05, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
database. It signifies the count of cache hits, or in other words, the
instances where data was located in the ‘shared_buffers’. As a result,
keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
knowledge about the buffer cache’s efficiency and assist users in
making educated choices when fine-tuning their database. However, if
we include the hit count of WAL buffers in this, it may lead to
misleading interpretations for database tuning. If there’s something
I’ve overlooked that’s already been discussed, please feel free to
correct me.

I think counting them as a hit makes sense. We read data from WAL
buffers instead of reading them from disk. And, WAL buffers are stored
in shared memory so I believe they can be counted as hits in the
shared buffers. Could you please explain how this change can 'lead to
misleading interpretations for database tuning' a bit more?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#53

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#52)

Re: Show WAL write and fsync stats in pg_stat_io

On Thu, Jun 13, 2024 at 5:24 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Sun, 9 Jun 2024 at 18:05, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
database. It signifies the count of cache hits, or in other words, the
instances where data was located in the ‘shared_buffers’. As a result,
keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
knowledge about the buffer cache’s efficiency and assist users in
making educated choices when fine-tuning their database. However, if
we include the hit count of WAL buffers in this, it may lead to
misleading interpretations for database tuning. If there’s something
I’ve overlooked that’s already been discussed, please feel free to
correct me.

I think counting them as a hit makes sense. We read data from WAL
buffers instead of reading them from disk. And, WAL buffers are stored
in shared memory so I believe they can be counted as hits in the
shared buffers. Could you please explain how this change can 'lead to
misleading interpretations for database tuning' a bit more?

Perhaps Nitin was thinking of a scenario in which WAL hits are counted
as hits on the same IOObject as shared buffer hits. Since this thread
has been going on for awhile and we haven't recently had a schema
overview, I could understand if there was some confusion. For clarity,
I will restate that the current proposal is to count WAL buffer hits
for IOObject WAL, which means they will not be mixed in with shared
buffer hits.

And I think it makes sense to count WAL IOObject hits since increasing
wal_buffers can lead to more hits, right?

- Melanie

#54

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 1 year ago

In reply to: Melanie Plageman (#53)

Re: Show WAL write and fsync stats in pg_stat_io

Perhaps Nitin was thinking of a scenario in which WAL hits are counted
as hits on the same IOObject as shared buffer hits. Since this thread
has been going on for awhile and we haven't recently had a schema
overview, I could understand if there was some confusion

Yes. I was considering a scenario where WAL hits are counted as hits
on the same IOObject as shared buffer hits.

For clarity,
I will restate that the current proposal is to count WAL buffer hits
for IOObject WAL, which means they will not be mixed in with shared
buffer hits.

And I think it makes sense to count WAL IOObject hits since increasing
wal_buffers can lead to more hits, right?

Thank you for the clarification. I agree with the proposal to count
WAL buffer hits for IOObject WAL separately from shared buffer hits.
This distinction will provide a more accurate representation.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Mon, Jun 17, 2024 at 8:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Show quoted text

On Thu, Jun 13, 2024 at 5:24 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Sun, 9 Jun 2024 at 18:05, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
database. It signifies the count of cache hits, or in other words, the
instances where data was located in the ‘shared_buffers’. As a result,
keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
knowledge about the buffer cache’s efficiency and assist users in
making educated choices when fine-tuning their database. However, if
we include the hit count of WAL buffers in this, it may lead to
misleading interpretations for database tuning. If there’s something
I’ve overlooked that’s already been discussed, please feel free to
correct me.

I think counting them as a hit makes sense. We read data from WAL
buffers instead of reading them from disk. And, WAL buffers are stored
in shared memory so I believe they can be counted as hits in the
shared buffers. Could you please explain how this change can 'lead to
misleading interpretations for database tuning' a bit more?

Perhaps Nitin was thinking of a scenario in which WAL hits are counted
as hits on the same IOObject as shared buffer hits. Since this thread
has been going on for awhile and we haven't recently had a schema
overview, I could understand if there was some confusion. For clarity,
I will restate that the current proposal is to count WAL buffer hits
for IOObject WAL, which means they will not be mixed in with shared
buffer hits.

And I think it makes sense to count WAL IOObject hits since increasing
wal_buffers can lead to more hits, right?

- Melanie

#55

Nazir Bilal Yavuz

byavuz81@gmail.com

12 months ago

In reply to: Nazir Bilal Yavuz (#46)

1 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Fri, 19 Apr 2024 at 11:01, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

If I am not missing any new changes, the only problem is reading
variable bytes now. We have discussed a couple of solutions:

With the recent commit [1]f92c854cf, pg_stat_io tracks IOs as bytes instead of
blocks. This solves the variable IO size problem.

I encountered another problem while rebasing the patch. The problem is
basically we do not expect any pending stats while restoring the stats
at the initdb. However, WAL IOs (WAL read and WAL init IOs for now)
may happen before restoring the stats, so we end up having pending
stats before restoring them and that causes initdb to fail.

I wrote this problem to another thread [2]postgr.es/m/CAN55FZ1uOq%3DFVJObp0bdj-Z8q1ZRNmA-RymPqbMD%2Bp4QaHXP3A%40mail.gmail.com but this thread is a better
place to discuss it, so rewriting the problem:

This is where we restore stats and do not expect any pending stats at
the Assert:

'''
pgstat_restore_stats() ->
pgstat_read_statsfile() ->
pgstat_reset_after_failure() ->
pgstat_drop_all_entries() ->
pgstat_drop_entry_internal() ->
We have an assertion there which checks if there is a pending stat entry:

/* should already have released local reference */
if (pgStatEntryRefHash)
Assert(!pgstat_entry_ref_hash_lookup(pgStatEntryRefHash, shent->key));
'''

This is where the WAL read happens before restoring the stats:

'''
BootstrapModeMain() ->
InitPostgres() ->
StartupXLOG() ->
ReadCheckpointRecord() ->
InitWalRecovery() ->
... ->
XLogReadAhead() ->
XLogDecodeNextRecord() ->
ReadPageInternal() ->
state->routine.page_read = XLogPageRead() then WAL read happens
'''

So, this assert fails because we have pending stats for the
PGSTAT_KIND_BACKEND. It is only PGSTAT_KIND_BACKEND because all
fixed-numbered stats (which include PGSTAT_KIND_IO) are reset there:
'pgstat_reset_after_failure() -> kind_info->reset_all_cb()' at the
pgstat_reset_after_failure(). It seems that we do not care about stats
that happen before restoring the stats part as we reset all
fixed-numbered stats there, so not counting these WAL IOs at the
initdb may be a one solution.

A simple reproducer patch is attached, it includes two
pgstat_count_io_op() calls. I did not include the rest of the patchset
as I thought it may increase the complexity. To reproduce, just run
initdb on assert enabled build after applying the patch. Then you
should see:

creating configuration files ... ok
running bootstrap script ... TRAP: failed
Assert("!pgstat_entry_ref_hash_lookup(pgStatEntryRefHash,
shent->key)"), File:
"../../postgres/src/backend/utils/activity/pgstat_shmem.c", Line: 859,
PID: 51001
.../install/bin/postgres(ExceptionalCondition+0xab) [0x55da0959feea]

I would be happy to hear your thoughts.

[1]: f92c854cf
[2]: postgr.es/m/CAN55FZ1uOq%3DFVJObp0bdj-Z8q1ZRNmA-RymPqbMD%2Bp4QaHXP3A%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

Break-initdb-with-pending-stats.patchtext/x-patch; charset=US-ASCII; name=Break-initdb-with-pending-stats.patchDownload

From 93bbfd611b86654356a84415367fe562eae74b68 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 16 Jan 2025 10:10:46 +0300
Subject: [PATCH] Break initdb with pending stats

To reproduce, run initdb on asserts enabled build.
---
 src/backend/access/transam/xlog.c         | 2 ++
 src/backend/access/transam/xlogrecovery.c | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..857f8379f3e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3322,6 +3322,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC, 1, 0);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bbe2eea206..cc29b19ca0a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3406,6 +3406,9 @@ retry:
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_READ, 1, r);
+
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
 		if (r < 0)
 		{
@@ -3426,6 +3429,8 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_READ, 1, r);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
-- 
2.47.1

#56

Michael Paquier

michael@paquier.xyz

12 months ago

In reply to: Nazir Bilal Yavuz (#55)

Re: Show WAL write and fsync stats in pg_stat_io

On Thu, Jan 16, 2025 at 11:40:51AM +0300, Nazir Bilal Yavuz wrote:

I encountered another problem while rebasing the patch. The problem is
basically we do not expect any pending stats while restoring the stats
at the initdb. However, WAL IOs (WAL read and WAL init IOs for now)
may happen before restoring the stats, so we end up having pending
stats before restoring them and that causes initdb to fail.

On top of 4feba03d8b92, I've reused something close to the patch you
have posted previously in case, and the issue with allocations for
pending stats should be gone.

Could it be possible to post a new version of the patch? You should
be able to reuse pgstat_count_backend_io_op[_time]() for your work
with WAL data in pg_stat_io if you need a low-level control of things,
but I suspect that calling pgstat_count_io_op() & the other should be
enough to get the job done with a new IOObject.
--
Michael

#57

Nazir Bilal Yavuz

byavuz81@gmail.com

12 months ago

In reply to: Michael Paquier (#56)

3 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, 22 Jan 2025 at 03:14, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jan 16, 2025 at 11:40:51AM +0300, Nazir Bilal Yavuz wrote:

I encountered another problem while rebasing the patch. The problem is
basically we do not expect any pending stats while restoring the stats
at the initdb. However, WAL IOs (WAL read and WAL init IOs for now)
may happen before restoring the stats, so we end up having pending
stats before restoring them and that causes initdb to fail.

On top of 4feba03d8b92, I've reused something close to the patch you
have posted previously in case, and the issue with allocations for
pending stats should be gone.

Yes, they are fixed; thanks!

Could it be possible to post a new version of the patch? You should
be able to reuse pgstat_count_backend_io_op[_time]() for your work
with WAL data in pg_stat_io if you need a low-level control of things,
but I suspect that calling pgstat_count_io_op() & the other should be
enough to get the job done with a new IOObject.

I think there is only one problem remaining now. walsenders have stats
to report with this patch and they may shutdown after the
checkpointer, which causes '027_stream_regress.pl' test to fail.
Andres is already working on fixing that issue [1]postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp,
'027_stream_regress.pl' test passes after applying Andres' proposed
fix.

v9 is rebased and attached as three patches. The first one is a
squashed patch for the current version of Andres' proposed fix to pass
the CI, the second one is for adding WAL stats to pg_stat_io and the
third one is for fetching timing columns from pg_stat_io in the
pg_stat_wal view.

There is a change in the main patch (0002). Now, stats are being
flushed after the main loop in the PerformWalRecovery() function in
the xlogrecovery.c file. Stats were flushed in the main loop before
but I thought that might be costly so moved it to after main loop.

[1]: postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v9-0001-Squash-reorder-shutdown-sequence-patches.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Squash-reorder-shutdown-sequence-patches.patchDownload

From 5ee9f2ca206f201beb7264f460de9e8ad4c08ba0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 10 Jan 2025 11:11:40 -0500
Subject: [PATCH v9 1/3] Squash reorder shutdown sequence patches

---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 157 +++++---
 src/backend/postmaster/postmaster.c           | 341 +++++++++++-------
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 318 insertions(+), 184 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..767bf9f5cf8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,10 +10,13 @@
  * fill WAL segments; the checkpointer itself doesn't watch for the
  * condition.)
  *
- * Normal termination is by SIGUSR2, which instructs the checkpointer to
- * execute a shutdown checkpoint and then exit(0).  (All backends must be
- * stopped before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT;
- * like any backend, the checkpointer will simply abort and exit on SIGQUIT.
+ * The normal termination sequence is that checkpointer is instructed to
+ * execute the shutdown checkpoint by SIGINT.  After that checkpointer waits
+ * to be terminated via SIGUSR2, which instructs the checkpointer to exit(0).
+ * All backends must be stopped before SIGINT or SIGUSR2 is issued!
+ *
+ * Emergency termination is by SIGQUIT; like any backend, the checkpointer
+ * will simply abort and exit on SIGQUIT.
  *
  * If the checkpointer exits unexpectedly, the postmaster treats that the same
  * as a backend crash: shared memory may be corrupted, so remaining backends
@@ -51,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +145,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -160,7 +165,7 @@ static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
 /* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
+static void ReqShutdownXLOG(SIGNAL_ARGS);
 
 
 /*
@@ -191,7 +196,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, ReqShutdownXLOG);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -214,8 +219,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -330,7 +338,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint or terminate.
 	 */
 	for (;;)
 	{
@@ -349,7 +357,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -520,8 +531,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			ckpt_active = false;
 
-			/* We may have received an interrupt during the checkpoint. */
+			/*
+			 * We may have received an interrupt during the checkpoint and the
+			 * latch might have been reset (e.g. in CheckpointWriteDelay).
+			 */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -560,6 +576,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By separating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -589,29 +655,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -732,6 +775,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -865,14 +909,11 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
-/* SIGINT: set flag to run a normal checkpoint right away */
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
 static void
-ReqCheckpointHandler(SIGNAL_ARGS)
+ReqShutdownXLOG(SIGNAL_ARGS)
 {
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
+	ShutdownXLOGPending = true;
 	SetLatch(MyLatch);
 }
 
@@ -990,38 +1031,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * when it does start, with or without the SetLatch().
 	 */
 #define MAX_SIGNAL_TRIES 600	/* max wait 60.0 sec */
 	for (ntries = 0;; ntries++)
 	{
-		if (CheckpointerShmem->checkpointer_pid == 0)
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		ProcNumber	checkpointerProc = procglobal->checkpointerProc;
+
+		if (checkpointerProc == INVALID_PROC_NUMBER)
 		{
 			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
 			{
 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: checkpointer is not running");
-				break;
-			}
-		}
-		else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
-		{
-			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
+					 "could not notify checkpoint: checkpointer is not running");
 				break;
 			}
 		}
 		else
-			break;				/* signal sent successfully */
+		{
+			SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+			/* notified successfully */
+			break;
+		}
 
 		CHECK_FOR_INTERRUPTS();
 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f605..115ad3d31d2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -334,6 +334,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
@@ -424,7 +425,6 @@ static int	BackendStartup(ClientSocket *client_sock);
 static void report_fork_failure_to_client(ClientSocket *client_sock, int errnum);
 static CAC_state canAcceptConnections(BackendType backend_type);
 static void signal_child(PMChild *pmchild, int signal);
-static void sigquit_child(PMChild *pmchild);
 static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
@@ -2355,35 +2355,19 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know checkpointer wrote a shutdown
+				 * checkpoint, otherwise we'd still be in
+				 * PM_WAIT_XLOG_SHUTDOWN state.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
-				 * we might have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
+				 * At this point only dead-end children should be left.
 				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
-				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+				UpdatePMState(PM_WAIT_DEAD_END);
+				ConfigurePostmasterWaitSet(false);
+				SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 			}
 			else
 			{
@@ -2665,6 +2649,86 @@ CleanupBackend(PMChild *bp,
 		LogChildExit(DEBUG2, procname, bp_pid, exitstatus);
 }
 
+/*
+ * Transition into FatalError state, in response to something bad having
+ * happened. Commonly the caller will have logged the reason for entering
+ * FatalError state.
+ */
+static void
+HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
+{
+	int			sigtosend;
+
+	SetQuitSignalReason(reason);
+
+	if (consider_sigabrt && send_abort_for_crash)
+		sigtosend = SIGABRT;
+	else
+		sigtosend = SIGQUIT;
+
+	/*
+	 * Signal all other child processes to exit.
+	 *
+	 * We could exclude dead-end children here, but at least when sending
+	 * SIGABRT it seems better to include them.
+	 */
+	TerminateChildren(sigtosend);
+
+	if (Shutdown != ImmediateShutdown)
+		FatalError = true;
+
+	/* We now transit into a state of waiting for children to die */
+	switch (pmState)
+	{
+		case PM_INIT:
+			/* shouldn't have any children */
+			Assert(false);
+			break;
+		case PM_STARTUP:
+			/* should have been handled in process_pm_child_exit */
+			Assert(false);
+			break;
+
+			/* wait for children to die */
+		case PM_RECOVERY:
+		case PM_HOT_STANDBY:
+		case PM_RUN:
+		case PM_STOP_BACKENDS:
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_BACKENDS:
+			/* there might be more backends to wait for */
+			break;
+
+		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_XLOG_ARCHIVAL:
+		case PM_WAIT_CHECKPOINTER:
+
+			/*
+			 * Note that we switch *back* to PM_WAIT_BACKENDS here. This way
+			 * the PM_WAIT_BACKENDS && FatalError code in
+			 * PostmasterStateMachine does not have to be duplicated.
+			 *
+			 * XXX: This seems rather ugly, but it's not obvious if the
+			 * alternative is better.
+			 */
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_DEAD_END:
+		case PM_NO_CHILDREN:
+			break;
+	}
+
+	/*
+	 * .. and if this doesn't happen quickly enough, now the clock is ticking
+	 * for us to kill them without mercy.
+	 */
+	if (AbortStartTime == 0)
+		AbortStartTime = time(NULL);
+}
+
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
  * walwriter, autovacuum, archiver, slot sync worker, or background worker.
@@ -2677,8 +2741,6 @@ CleanupBackend(PMChild *bp,
 static void
 HandleChildCrash(int pid, int exitstatus, const char *procname)
 {
-	bool		take_action;
-
 	/*
 	 * We only log messages and send signals if this is the first process
 	 * crash and we're not doing an immediate shutdown; otherwise, we're only
@@ -2686,63 +2748,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	if (FatalError || Shutdown == ImmediateShutdown)
+		return;
 
-	if (take_action)
-	{
-		LogChildExit(LOG, procname, pid, exitstatus);
-		ereport(LOG,
-				(errmsg("terminating any other active server processes")));
-		SetQuitSignalReason(PMQUIT_FOR_CRASH);
-	}
+	LogChildExit(LOG, procname, pid, exitstatus);
+	ereport(LOG,
+			(errmsg("terminating any other active server processes")));
 
 	/*
-	 * Signal all other child processes to exit.  The crashed process has
-	 * already been removed from ActiveChildList.
+	 * Switch into error state. The crashed process has already been removed
+	 * from ActiveChildList.
 	 */
-	if (take_action)
-	{
-		dlist_iter	iter;
-
-		dlist_foreach(iter, &ActiveChildList)
-		{
-			PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
-
-			/* We do NOT restart the syslogger */
-			if (bp == SysLoggerPMChild)
-				continue;
-
-			if (bp == StartupPMChild)
-				StartupStatus = STARTUP_SIGNALED;
-
-			/*
-			 * This backend is still alive.  Unless we did so already, tell it
-			 * to commit hara-kiri.
-			 *
-			 * We could exclude dead-end children here, but at least when
-			 * sending SIGABRT it seems better to include them.
-			 */
-			sigquit_child(bp);
-		}
-	}
-
-	if (Shutdown != ImmediateShutdown)
-		FatalError = true;
-
-	/* We now transit into a state of waiting for children to die */
-	if (pmState == PM_RECOVERY ||
-		pmState == PM_HOT_STANDBY ||
-		pmState == PM_RUN ||
-		pmState == PM_STOP_BACKENDS ||
-		pmState == PM_WAIT_XLOG_SHUTDOWN)
-		UpdatePMState(PM_WAIT_BACKENDS);
-
-	/*
-	 * .. and if this doesn't happen quickly enough, now the clock is ticking
-	 * for us to kill them without mercy.
-	 */
-	if (AbortStartTime == 0)
-		AbortStartTime = time(NULL);
+	HandleFatalError(PMQUIT_FOR_CRASH, true);
 }
 
 /*
@@ -2840,6 +2857,9 @@ PostmasterStateMachine(void)
 	 * PM_WAIT_BACKENDS, but we signal the processes first, before waiting for
 	 * them.  Treating it as a distinct pmState allows us to share this code
 	 * across multiple shutdown code paths.
+	 *
+	 * Note that HandleFatalError() switches to PM_WAIT_BACKENDS even if we
+	 * were, before the fatal error, in a "more advanced" state.
 	 */
 	if (pmState == PM_STOP_BACKENDS || pmState == PM_WAIT_BACKENDS)
 	{
@@ -2872,16 +2892,20 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect the checkpointer to exit as well, otherwise not.
+		 * expect archiver, checkpointer and walsender to exit as well,
+		 * otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
-			targetMask = btmask_add(targetMask, B_CHECKPOINTER);
+			targetMask = btmask_add(targetMask,
+									B_CHECKPOINTER,
+									B_ARCHIVER,
+									B_WAL_SENDER);
 
 		/*
-		 * Walsenders and archiver will continue running; they will be
-		 * terminated later after writing the checkpoint record.  We also let
-		 * dead-end children to keep running for now.  The syslogger process
-		 * exits last.
+		 * Normally walsenders and archiver will continue running; they will
+		 * be terminated later after writing the checkpoint record.  We also
+		 * let dead-end children to keep running for now.  The syslogger
+		 * process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2892,13 +2916,17 @@ PostmasterStateMachine(void)
 			BackendTypeMask remainMask = BTYPE_MASK_NONE;
 
 			remainMask = btmask_add(remainMask,
-									B_WAL_SENDER,
-									B_ARCHIVER,
 									B_DEAD_END_BACKEND,
 									B_LOGGER);
 
-			/* checkpointer may or may not be in targetMask already */
-			remainMask = btmask_add(remainMask, B_CHECKPOINTER);
+			/*
+			 * Archiver, checkpointer and walsender may or may not be in
+			 * targetMask already.
+			 */
+			remainMask = btmask_add(remainMask,
+									B_ARCHIVER,
+									B_CHECKPOINTER,
+									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
 			remainMask = btmask_add(remainMask,
@@ -2938,9 +2966,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2954,10 +2982,10 @@ PostmasterStateMachine(void)
 				/* Start the checkpointer if not running */
 				if (CheckpointerPMChild == NULL)
 					CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
-				/* And tell it to shut down */
+				/* And tell it to write the shutdown checkpoint */
 				if (CheckpointerPMChild != NULL)
 				{
-					signal_child(CheckpointerPMChild, SIGUSR2);
+					signal_child(CheckpointerPMChild, SIGINT);
 					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
@@ -2971,34 +2999,50 @@ PostmasterStateMachine(void)
 					 * We don't consult send_abort_for_crash here, as it's
 					 * unlikely that dumping cores would illuminate the reason
 					 * for checkpointer fork failure.
+					 *
+					 * XXX: Is it worth inventing a different PMQUIT value
+					 * that signals that the cluster is in a bad state,
+					 * without a process having crashed?
 					 */
-					FatalError = true;
-					UpdatePMState(PM_WAIT_DEAD_END);
-					ConfigurePostmasterWaitSet(false);
-
-					/* Kill the walsenders and archiver too */
-					SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
+					HandleFatalError(PMQUIT_FOR_CRASH, false);
 				}
 			}
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_XLOG_SHUTDOWN to
+	 * PM_WAIT_XLOG_ARCHIVAL is in proccess_pm_pmsignal(), in response to
+	 * PMSIGNAL_XLOG_IS_SHUTDOWN.
+	 */
+
 	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
-		 * than dead-end children left. There shouldn't be any regular
-		 * backends left by now anyway; what we're really waiting for is
-		 * walsenders and archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
+		 * than checkpointer and dead-end children left. There shouldn't be
+		 * any regular backends left by now anyway; what we're really waiting
+		 * for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			UpdatePMState(PM_WAIT_DEAD_END);
-			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+			/*
+			 * Now that everyone important is gone, tell checkpointer to shut
+			 * down too. That allows checkpointer to perform some last bits of
+			 * cleanup without other processes interfering.
+			 */
+			if (CheckpointerPMChild != NULL)
+				signal_child(CheckpointerPMChild, SIGUSR2);
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_CHECKPOINTER to PM_WAIT_DEAD_END is
+	 * in proccess_pm_child_exit().
+	 */
+
 	if (pmState == PM_WAIT_DEAD_END)
 	{
 		/*
@@ -3135,6 +3179,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_TOSTR_CASE
@@ -3347,19 +3392,6 @@ signal_child(PMChild *pmchild, int signal)
 #endif
 }
 
-/*
- * Convenience function for killing a child process after a crash of some
- * other child process.  We apply send_abort_for_crash to decide which signal
- * to send.  Normally it's SIGQUIT -- and most other comments in this file are
- * written on the assumption that it is -- but developers might prefer to use
- * SIGABRT to collect per-child core dumps.
- */
-static void
-sigquit_child(PMChild *pmchild)
-{
-	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
-}
-
 /*
  * Send a signal to the targeted children.
  */
@@ -3565,6 +3597,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3676,9 +3710,67 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/* Checkpointer completed the shutdown checkpoint */
+		if (pmState == PM_WAIT_XLOG_SHUTDOWN)
+		{
+			/*
+			 * If we have an archiver subprocess, tell it to do a last archive
+			 * cycle and quit. Likewise, if we have walsender processes, tell
+			 * them to send any remaining WAL and quit.
+			 */
+			Assert(Shutdown > NoShutdown);
+
+			/* Waken archiver for the last time */
+			if (PgArchPMChild != NULL)
+				signal_child(PgArchPMChild, SIGUSR2);
+
+			/*
+			 * Waken walsenders for the last time. No regular backends should
+			 * be around anymore.
+			 */
+			SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+			UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+		}
+		else if (!FatalError && Shutdown != ImmediateShutdown)
+		{
+			/*
+			 * Checkpointer only ought to perform the shutdown checkpoint
+			 * during shutdown.  If somehow checkpointer did so in another
+			 * situation, we have no choice but to crash-restart.
+			 *
+			 * It's possible however that we get PMSIGNAL_XLOG_IS_SHUTDOWN
+			 * outside of PM_WAIT_XLOG_SHUTDOWN if an orderly shutdown was
+			 * "interrupted" by a crash or an immediate shutdown.
+			 */
+			ereport(LOG,
+					(errmsg("WAL was shut down unexpectedly")));
+
+			/*
+			 * Doesn't seem likely to help to take send_abort_for_crash into
+			 * account here.
+			 */
+			HandleFatalError(PMQUIT_FOR_CRASH, false);
+		}
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3686,7 +3778,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -3997,6 +4089,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_WAIT_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.47.2

v9-0002-Add-WAL-I-O-stats-to-both-pg_stat_io-view-and-per.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Add-WAL-I-O-stats-to-both-pg_stat_io-view-and-per.patchDownload

From 67c5e26dd6da61e212ce45f19a366c109ce456d5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 23 Jan 2025 10:45:23 +0300
Subject: [PATCH v9 2/3] Add WAL I/O stats to both pg_stat_io view and per
 backend I/O statistics

This commit adds WAL I/O stats to both pg_stat_io view and per backend
I/O statistics (pg_stat_get_backend_io()).

This commit introduces a three new I/O concepts:

- IOObject IOOBJECT_WAL is used for tracking all WAL I/Os.
- IOOBJECT_WAL / IOCONTEXT_NORMAL is used for tracking I/O operations
  done on already created wal segments.
- IOOBJECT_WAL / IOCONTEXT_INIT is used for tracking I/O operations done
  while creating the WAL segments.

For now 027_stream_regress.pl -> 'check contents of pg_stat_statements on
regression database' test fails. It is because walsenders may shutdown
after the checkpointer. There is ongoing discussion to fix that problem:
postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp
---
 src/include/pgstat.h                        |  4 +-
 src/backend/access/transam/xlog.c           | 57 ++++++--------
 src/backend/access/transam/xlogreader.c     |  8 ++
 src/backend/access/transam/xlogrecovery.c   | 14 ++++
 src/backend/utils/activity/pgstat_backend.c |  9 ++-
 src/backend/utils/activity/pgstat_io.c      | 84 ++++++++++++++++++---
 src/test/regress/expected/stats.out         | 53 +++++++++++++
 src/test/regress/sql/stats.sql              | 25 ++++++
 doc/src/sgml/monitoring.sgml                | 19 ++++-
 9 files changed, 221 insertions(+), 52 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d0d45150977..0a456858c9c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -274,14 +274,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..ee26e9739bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2425,38 +2425,23 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			ssize_t		written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2481,6 +2466,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages, npages * XLOG_BLCKSZ);
+
 			npages = 0;
 
 			/*
@@ -3216,6 +3204,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3259,6 +3248,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3294,6 +3285,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1, wal_segment_size);
+
 	if (save_errno)
 	{
 		/*
@@ -3310,6 +3304,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3322,6 +3318,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1, 0);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8683,7 +8682,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8696,11 +8695,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8746,16 +8741,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1, 0);
 
 	PendingWalStats.wal_sync++;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3596af06172..0b6e3db353e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1507,6 +1507,9 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	instr_time	io_start;
+#endif
 
 	p = buf;
 	recptr = startptr;
@@ -1552,6 +1555,8 @@ WALRead(XLogReaderState *state,
 			segbytes = nbytes;
 
 #ifndef FRONTEND
+		io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
 
@@ -1561,6 +1566,9 @@ WALRead(XLogReaderState *state,
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, readbytes);
 #endif
 
 		if (readbytes <= 0)
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806f..91fa72fadcc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1836,6 +1837,9 @@ PerformWalRecovery(void)
 		 * end of main redo apply loop
 		 */
 
+		/* Report pending statistics to the cumulative stats system  */
+		pgstat_flush_io(false);
+
 		if (reachedRecoveryTarget)
 		{
 			if (!reachedConsistency)
@@ -3306,6 +3310,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3398,6 +3403,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3406,6 +3413,10 @@ retry:
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, r);
+
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
 		if (r < 0)
 		{
@@ -3426,6 +3437,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1, r);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..338da73a9a9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -24,6 +24,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
@@ -43,7 +44,7 @@ void
 pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
 								IOOp io_op, instr_time io_time)
 {
-	Assert(track_io_timing);
+	Assert(track_io_timing || track_wal_io_timing);
 
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return;
@@ -236,9 +237,6 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
 		case B_STARTUP:
@@ -249,7 +247,10 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_SLOTSYNC_WORKER:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 6ff5d9e96a1..17924759e90 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
@@ -105,6 +106,20 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  */
@@ -112,14 +127,24 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt, uint64 bytes)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate timing data.  pgstat_count_buffer is for pgstat_database. As
+	 * pg_stat_database only counts blk_read_time and blk_write_time, it is
+	 * set for IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
+	 *
+	 * pgBufferUsage is for EXPLAIN.  pgBufferUsage has write and read stats
+	 * for shared, local and temporary blocks.  Temporary blocks are ignored
+	 * here.
+	 */
+	if (pgstat_should_track_io_time(io_object))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		if (io_object != IOOBJECT_WAL &&
+			(io_op == IOOP_WRITE || io_op == IOOP_EXTEND))
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -127,7 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			else if (io_object == IOOBJECT_TEMP_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
 		}
-		else if (io_op == IOOP_READ)
+		else if (io_object != IOOBJECT_WAL && io_op == IOOP_READ)
 		{
 			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -241,6 +266,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -260,6 +287,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -332,8 +361,6 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -354,9 +381,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 			return false;
 
 		case B_AUTOVAC_LAUNCHER:
@@ -368,7 +392,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -394,6 +421,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -460,14 +496,28 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	/*
 	 * Some BackendTypes will not do certain IOOps.
 	 */
-	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+	if ((bktype == B_BG_WRITER) &&
 		(io_op == IOOP_READ || io_op == IOOP_EVICT || io_op == IOOP_HIT))
 		return false;
 
+	if ((bktype == B_CHECKPOINTER) &&
+		((io_object != IOOBJECT_WAL && io_op == IOOP_READ) ||
+		 (io_op == IOOP_EVICT || io_op == IOOP_HIT)))
+		return false;
+
 	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes don't do reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		(bktype == B_WAL_RECEIVER || bktype == B_BG_WRITER ||
+		 bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_WORKER ||
+		 bktype == B_WAL_WRITER))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -492,6 +542,22 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * An IOCONTEXT_INIT done for an IOOBJECT_WAL io_object does writes and
+	 * syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	/*
+	 * An IOCONTEXT_NORMAL done for an IOOBJECT_WAL io_object does writes,
+	 * reads and syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a0317b7208e..7dfcaf03b78 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1256,6 +1283,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1273,6 +1301,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1335,6 +1366,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 399c72bbcf7..9e368637199 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -602,6 +616,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -621,6 +636,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -649,6 +667,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5888fae2b5..c6f2c384512 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2581,9 +2581,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2636,6 +2637,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2660,6 +2666,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
-- 
2.47.2

v9-0003-Fetch-timing-columns-from-pg_stat_io-in-the-pg_st.patchtext/x-patch; charset=US-ASCII; name=v9-0003-Fetch-timing-columns-from-pg_stat_io-in-the-pg_st.patchDownload

From af0946112394a478e9424d1870e2200abdb77f7d Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 23 Jan 2025 13:52:31 +0300
Subject: [PATCH v9 3/3] Fetch timing columns from pg_stat_io in the
 pg_stat_wal view

In the prior commit WAL I/Os' timing stats are started to be tracked in
the pg_stat_io view. So, to avoid double accounting and to have same
timing values in both pg_stat_io and pg_stat_wal; fetch timing columns
from pg_stat_io in the pg_stat_wal view.
---
 src/include/catalog/pg_proc.dat         |  6 +++---
 src/include/pgstat.h                    |  4 ----
 src/backend/catalog/system_views.sql    | 14 ++++++++++---
 src/backend/utils/activity/pgstat_wal.c |  2 --
 src/backend/utils/adt/pgstatfuncs.c     | 14 +++----------
 src/test/regress/expected/rules.out     | 27 ++++++++++++++++---------
 doc/src/sgml/monitoring.sgml            | 10 ++++++---
 7 files changed, 41 insertions(+), 36 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 18560755d26..79f31c48d1d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5925,9 +5925,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a456858c9c..faac508015e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -477,8 +477,6 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	PgStat_Counter wal_write_time;
-	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -493,8 +491,6 @@ typedef struct PgStat_PendingWalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	instr_time	wal_write_time;
-	instr_time	wal_sync_time;
 } PgStat_PendingWalStats;
 
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 46868bf7e89..9448e1ec355 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,6 +1180,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1187,10 +1194,11 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+        CROSS JOIN pgsio_sum_write_fsync_time p;
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 18fa6b2936a..8ee650cdc30 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -126,8 +126,6 @@ pgstat_wal_flush_cb(bool nowait)
 	WALSTAT_ACC(wal_buffers_full, PendingWalStats);
 	WALSTAT_ACC(wal_write, PendingWalStats);
 	WALSTAT_ACC(wal_sync, PendingWalStats);
-	WALSTAT_ACC_INSTR_TIME(wal_write_time);
-	WALSTAT_ACC_INSTR_TIME(wal_sync_time);
 #undef WALSTAT_ACC_INSTR_TIME
 #undef WALSTAT_ACC
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0f5e0a9778d..5de31979bc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1609,7 +1609,7 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1630,11 +1630,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1657,11 +1653,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 856a8349c50..a15e0096d9c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2241,16 +2241,23 @@ pg_stat_user_tables| SELECT relid,
     autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c6f2c384512..516c4b96112 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4965,7 +4965,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. This resets the timing
+          counter stored in the <structname>pg_stat_wal</structname> view
+          too.
          </para>
         </listitem>
         <listitem>
@@ -4982,8 +4984,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. The timing
+          counters can be reset by calling pg_stat_reset_shared set to
+          <literal>io</literal>.
          </para>
         </listitem>
         <listitem>
-- 
2.47.2

#58

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

12 months ago

In reply to: Nazir Bilal Yavuz (#57)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Fri, Jan 24, 2025 at 11:31:02AM +0300, Nazir Bilal Yavuz wrote:

v9 is rebased and attached as three patches. The first one is a
squashed patch for the current version of Andres' proposed fix to pass
the CI, the second one is for adding WAL stats to pg_stat_io and the
third one is for fetching timing columns from pg_stat_io in the
pg_stat_wal view.

Thanks for the patch!

I did not look at the code yet but did a few tests.
I can see diff between pg_stat_wal and pg_stat_io, for example:

"
postgres=# select pg_stat_reset_shared();
pg_stat_reset_shared
----------------------

(1 row)

postgres=# insert into bdt select a from generate_series(1,200000) a ;
INSERT 0 200000

postgres=# select wal_bytes,stats_reset from pg_stat_wal;
wal_bytes | stats_reset
-----------+-------------------------------
11800088 | 2025-01-24 14:17:28.507994+00
(1 row)

postgres=# select sum(write_bytes),stats_reset from pg_stat_io where object = 'wal' group by stats_reset;
sum | stats_reset
----------+-------------------------------
12853248 | 2025-01-24 14:17:28.507988+00
(1 row)

Is that expected?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#59

Nazir Bilal Yavuz

byavuz81@gmail.com

12 months ago

In reply to: Bertrand Drouvot (#58)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

Thanks for looking into this!

On Fri, 24 Jan 2025 at 17:20, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I did not look at the code yet but did a few tests.
I can see diff between pg_stat_wal and pg_stat_io, for example:

"
postgres=# select pg_stat_reset_shared();
pg_stat_reset_shared
----------------------

(1 row)

postgres=# insert into bdt select a from generate_series(1,200000) a ;
INSERT 0 200000

postgres=# select wal_bytes,stats_reset from pg_stat_wal;
wal_bytes | stats_reset
-----------+-------------------------------
11800088 | 2025-01-24 14:17:28.507994+00
(1 row)

postgres=# select sum(write_bytes),stats_reset from pg_stat_io where object = 'wal' group by stats_reset;
sum | stats_reset
----------+-------------------------------
12853248 | 2025-01-24 14:17:28.507988+00
(1 row)

Is that expected?

I am not sure, I think they need to be the same. I could not
understand the cause of the difference at first glance. I will look
into this and will come back to you.

Quick note: WAL stats in pg_stat_io view includes WAL initialization
stats under object = 'wal' and context = 'init', your query may count
these initialization stats too. So the correct query is:

'select sum(write_bytes),stats_reset from pg_stat_io where object =
'wal' and context = 'normal' group by stats_reset;'.

By saying that, this does not solve the problem; there is still a
difference although you omit WAL initialization stats from the
pg_stat_io.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#60

Michael Paquier

michael@paquier.xyz

12 months ago

In reply to: Nazir Bilal Yavuz (#59)

2 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

On Fri, Jan 24, 2025 at 06:29:46PM +0300, Nazir Bilal Yavuz wrote:

By saying that, this does not solve the problem; there is still a
difference although you omit WAL initialization stats from the
pg_stat_io.

Reattaching the remaining patches to make the cfbot happy, as 0001 has
been applied as 87a6690cc695.

- if (track_wal_io_timing)
- {
- instr_time end;
-
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
- }

In 0002, you are removing PendingWalStats.wal_write_time, which does
not seem OK to me because we should still aggregate this data for
track_wal_io_timing, no?
--
Michael

Attachments:

v10-0002-Add-WAL-I-O-stats-to-both-pg_stat_io-view-and-pe.patchtext/x-diff; charset=us-asciiDownload

From 63d5d25c562fbe42e9c9d80d624edc0d1c2d92d9 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 23 Jan 2025 10:45:23 +0300
Subject: [PATCH v10 2/3] Add WAL I/O stats to both pg_stat_io view and per
 backend I/O statistics

This commit adds WAL I/O stats to both pg_stat_io view and per backend
I/O statistics (pg_stat_get_backend_io()).

This commit introduces a three new I/O concepts:

- IOObject IOOBJECT_WAL is used for tracking all WAL I/Os.
- IOOBJECT_WAL / IOCONTEXT_NORMAL is used for tracking I/O operations
  done on already created wal segments.
- IOOBJECT_WAL / IOCONTEXT_INIT is used for tracking I/O operations done
  while creating the WAL segments.

For now 027_stream_regress.pl -> 'check contents of pg_stat_statements on
regression database' test fails. It is because walsenders may shutdown
after the checkpointer. There is ongoing discussion to fix that problem:
postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp
---
 src/include/pgstat.h                        |  4 +-
 src/backend/access/transam/xlog.c           | 57 ++++++--------
 src/backend/access/transam/xlogreader.c     |  8 ++
 src/backend/access/transam/xlogrecovery.c   | 14 ++++
 src/backend/utils/activity/pgstat_backend.c |  9 ++-
 src/backend/utils/activity/pgstat_io.c      | 84 ++++++++++++++++++---
 src/test/regress/expected/stats.out         | 53 +++++++++++++
 src/test/regress/sql/stats.sql              | 25 ++++++
 doc/src/sgml/monitoring.sgml                | 19 ++++-
 9 files changed, 221 insertions(+), 52 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d0d45150977..0a456858c9c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -274,14 +274,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..ee26e9739bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2425,38 +2425,23 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			ssize_t		written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2481,6 +2466,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages, npages * XLOG_BLCKSZ);
+
 			npages = 0;
 
 			/*
@@ -3216,6 +3204,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3259,6 +3248,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3294,6 +3285,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1, wal_segment_size);
+
 	if (save_errno)
 	{
 		/*
@@ -3310,6 +3304,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3322,6 +3318,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1, 0);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8683,7 +8682,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8696,11 +8695,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8746,16 +8741,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1, 0);
 
 	PendingWalStats.wal_sync++;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 91b6a91767d..e8d97cf0732 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1507,6 +1507,9 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	instr_time	io_start;
+#endif
 
 	p = buf;
 	recptr = startptr;
@@ -1552,6 +1555,8 @@ WALRead(XLogReaderState *state,
 			segbytes = nbytes;
 
 #ifndef FRONTEND
+		io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
 
@@ -1561,6 +1566,9 @@ WALRead(XLogReaderState *state,
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, readbytes);
 #endif
 
 		if (readbytes <= 0)
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806f..91fa72fadcc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1836,6 +1837,9 @@ PerformWalRecovery(void)
 		 * end of main redo apply loop
 		 */
 
+		/* Report pending statistics to the cumulative stats system  */
+		pgstat_flush_io(false);
+
 		if (reachedRecoveryTarget)
 		{
 			if (!reachedConsistency)
@@ -3306,6 +3310,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3398,6 +3403,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3406,6 +3413,10 @@ retry:
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, r);
+
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
 		if (r < 0)
 		{
@@ -3426,6 +3437,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1, r);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..338da73a9a9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -24,6 +24,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
@@ -43,7 +44,7 @@ void
 pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
 								IOOp io_op, instr_time io_time)
 {
-	Assert(track_io_timing);
+	Assert(track_io_timing || track_wal_io_timing);
 
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return;
@@ -236,9 +237,6 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
 		case B_STARTUP:
@@ -249,7 +247,10 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_SLOTSYNC_WORKER:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 6ff5d9e96a1..17924759e90 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
@@ -105,6 +106,20 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  */
@@ -112,14 +127,24 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt, uint64 bytes)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate timing data.  pgstat_count_buffer is for pgstat_database. As
+	 * pg_stat_database only counts blk_read_time and blk_write_time, it is
+	 * set for IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
+	 *
+	 * pgBufferUsage is for EXPLAIN.  pgBufferUsage has write and read stats
+	 * for shared, local and temporary blocks.  Temporary blocks are ignored
+	 * here.
+	 */
+	if (pgstat_should_track_io_time(io_object))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		if (io_object != IOOBJECT_WAL &&
+			(io_op == IOOP_WRITE || io_op == IOOP_EXTEND))
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -127,7 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			else if (io_object == IOOBJECT_TEMP_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
 		}
-		else if (io_op == IOOP_READ)
+		else if (io_object != IOOBJECT_WAL && io_op == IOOP_READ)
 		{
 			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -241,6 +266,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -260,6 +287,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -332,8 +361,6 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -354,9 +381,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 			return false;
 
 		case B_AUTOVAC_LAUNCHER:
@@ -368,7 +392,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -394,6 +421,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -460,14 +496,28 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	/*
 	 * Some BackendTypes will not do certain IOOps.
 	 */
-	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+	if ((bktype == B_BG_WRITER) &&
 		(io_op == IOOP_READ || io_op == IOOP_EVICT || io_op == IOOP_HIT))
 		return false;
 
+	if ((bktype == B_CHECKPOINTER) &&
+		((io_object != IOOBJECT_WAL && io_op == IOOP_READ) ||
+		 (io_op == IOOP_EVICT || io_op == IOOP_HIT)))
+		return false;
+
 	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes don't do reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		(bktype == B_WAL_RECEIVER || bktype == B_BG_WRITER ||
+		 bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_WORKER ||
+		 bktype == B_WAL_WRITER))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -492,6 +542,22 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * An IOCONTEXT_INIT done for an IOOBJECT_WAL io_object does writes and
+	 * syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	/*
+	 * An IOCONTEXT_NORMAL done for an IOOBJECT_WAL io_object does writes,
+	 * reads and syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a0317b7208e..7dfcaf03b78 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1256,6 +1283,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1273,6 +1301,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1335,6 +1366,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 399c72bbcf7..9e368637199 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -602,6 +616,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -621,6 +636,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -649,6 +667,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5888fae2b5..c6f2c384512 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2581,9 +2581,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2636,6 +2637,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2660,6 +2666,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
-- 
2.47.2

v10-0003-Fetch-timing-columns-from-pg_stat_io-in-the-pg_s.patchtext/x-diff; charset=us-asciiDownload

From 0cf23ff24939406f28e021b706f36944c8c189a4 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 23 Jan 2025 13:52:31 +0300
Subject: [PATCH v10 3/3] Fetch timing columns from pg_stat_io in the
 pg_stat_wal view

In the prior commit WAL I/Os' timing stats are started to be tracked in
the pg_stat_io view. So, to avoid double accounting and to have same
timing values in both pg_stat_io and pg_stat_wal; fetch timing columns
from pg_stat_io in the pg_stat_wal view.
---
 src/include/catalog/pg_proc.dat         |  6 +++---
 src/include/pgstat.h                    |  4 ----
 src/backend/catalog/system_views.sql    | 14 ++++++++++---
 src/backend/utils/activity/pgstat_wal.c |  2 --
 src/backend/utils/adt/pgstatfuncs.c     | 14 +++----------
 src/test/regress/expected/rules.out     | 27 ++++++++++++++++---------
 doc/src/sgml/monitoring.sgml            | 10 ++++++---
 7 files changed, 41 insertions(+), 36 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2aafdbc3e93..aaf2432cc17 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5928,9 +5928,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a456858c9c..faac508015e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -477,8 +477,6 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	PgStat_Counter wal_write_time;
-	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -493,8 +491,6 @@ typedef struct PgStat_PendingWalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	instr_time	wal_write_time;
-	instr_time	wal_sync_time;
 } PgStat_PendingWalStats;
 
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 46868bf7e89..9448e1ec355 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,6 +1180,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1187,10 +1194,11 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+        CROSS JOIN pgsio_sum_write_fsync_time p;
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 18fa6b2936a..8ee650cdc30 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -126,8 +126,6 @@ pgstat_wal_flush_cb(bool nowait)
 	WALSTAT_ACC(wal_buffers_full, PendingWalStats);
 	WALSTAT_ACC(wal_write, PendingWalStats);
 	WALSTAT_ACC(wal_sync, PendingWalStats);
-	WALSTAT_ACC_INSTR_TIME(wal_write_time);
-	WALSTAT_ACC_INSTR_TIME(wal_sync_time);
 #undef WALSTAT_ACC_INSTR_TIME
 #undef WALSTAT_ACC
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0f5e0a9778d..5de31979bc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1609,7 +1609,7 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1630,11 +1630,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1657,11 +1653,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 856a8349c50..a15e0096d9c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2241,16 +2241,23 @@ pg_stat_user_tables| SELECT relid,
     autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c6f2c384512..516c4b96112 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4965,7 +4965,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. This resets the timing
+          counter stored in the <structname>pg_stat_wal</structname> view
+          too.
          </para>
         </listitem>
         <listitem>
@@ -4982,8 +4984,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. The timing
+          counters can be reset by calling pg_stat_reset_shared set to
+          <literal>io</literal>.
          </para>
         </listitem>
         <listitem>
-- 
2.47.2

#61

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

12 months ago

In reply to: Nazir Bilal Yavuz (#59)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Fri, Jan 24, 2025 at 06:29:46PM +0300, Nazir Bilal Yavuz wrote:

Hi,

Thanks for looking into this!

On Fri, 24 Jan 2025 at 17:20, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I did not look at the code yet but did a few tests.
I can see diff between pg_stat_wal and pg_stat_io, for example:

"
postgres=# select pg_stat_reset_shared();
pg_stat_reset_shared
----------------------

(1 row)

postgres=# insert into bdt select a from generate_series(1,200000) a ;
INSERT 0 200000

postgres=# select wal_bytes,stats_reset from pg_stat_wal;
wal_bytes | stats_reset
-----------+-------------------------------
11800088 | 2025-01-24 14:17:28.507994+00
(1 row)

postgres=# select sum(write_bytes),stats_reset from pg_stat_io where object = 'wal' group by stats_reset;
sum | stats_reset
----------+-------------------------------
12853248 | 2025-01-24 14:17:28.507988+00
(1 row)

Is that expected?

I am not sure, I think they need to be the same.

Yeah I think so (or at least we should document the reason(s) why (and how)
they could differ).

I could not
understand the cause of the difference at first glance. I will look
into this and will come back to you.

Thanks.

Quick note: WAL stats in pg_stat_io view includes WAL initialization
stats under object = 'wal' and context = 'init', your query may count
these initialization stats too. So the correct query is:

'select sum(write_bytes),stats_reset from pg_stat_io where object =
'wal' and context = 'normal' group by stats_reset;'.

Thanks for mentioning this filtering!

By saying that, this does not solve the problem; there is still a
difference although you omit WAL initialization stats from the
pg_stat_io.

Yeah. The "init" data was also empty on my side after the pg_stat_reset_shared()
and after the insertion in the table.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#62

Nazir Bilal Yavuz

byavuz81@gmail.com

12 months ago

In reply to: Michael Paquier (#60)

2 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Mon, 27 Jan 2025 at 03:52, Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Jan 24, 2025 at 06:29:46PM +0300, Nazir Bilal Yavuz wrote:

By saying that, this does not solve the problem; there is still a
difference although you omit WAL initialization stats from the
pg_stat_io.

Reattaching the remaining patches to make the cfbot happy, as 0001 has
been applied as 87a6690cc695.

Thanks!

- if (track_wal_io_timing)
- {
- instr_time end;
-
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
- }

In 0002, you are removing PendingWalStats.wal_write_time, which does
not seem OK to me because we should still aggregate this data for
track_wal_io_timing, no?

We use PendingWalStats.wal_[ write | sync ]_time only to show timings
on the pg_stat_wal view, right? And now these pg_stat_wal.wal_[ write
| sync ]_time datas are fetched from the pg_stat_io view when the
track_wal_io_timing is on. So, I think it is correct to remove these.

I made a mistake while splitting the patches. The places where
'PendingWalStats.wal_[ write | sync ]_time are incremented (the code
piece you shared)' are removed in 0002 (0001 now), but they should be
removed in 0003 (0002 now) instead. This is corrected in v11.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v11-0001-Add-WAL-I-O-stats-to-both-pg_stat_io-view-and-pe.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Add-WAL-I-O-stats-to-both-pg_stat_io-view-and-pe.patchDownload

From c94b3ce6963fd312d74649107d184c45d1f5966c Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 27 Jan 2025 14:03:54 +0300
Subject: [PATCH v11 1/2] Add WAL I/O stats to both pg_stat_io view and per
 backend I/O statistics

This commit adds WAL I/O stats to both pg_stat_io view and per backend
I/O statistics (pg_stat_get_backend_io()).

This commit introduces a three new I/O concepts:

- IOObject IOOBJECT_WAL is used for tracking all WAL I/Os.
- IOOBJECT_WAL / IOCONTEXT_NORMAL is used for tracking I/O operations
  done on already created wal segments.
- IOOBJECT_WAL / IOCONTEXT_INIT is used for tracking I/O operations done
  while creating the WAL segments.

For now 027_stream_regress.pl -> 'check contents of pg_stat_statements on
regression database' test fails. It is because walsenders may shutdown
after the checkpointer. There is ongoing discussion to fix that problem:
postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp
---
 src/include/pgstat.h                        |  4 +-
 src/backend/access/transam/xlog.c           | 41 ++++++-----
 src/backend/access/transam/xlogreader.c     |  8 +++
 src/backend/access/transam/xlogrecovery.c   | 14 ++++
 src/backend/utils/activity/pgstat_backend.c |  9 +--
 src/backend/utils/activity/pgstat_io.c      | 79 ++++++++++++++++++---
 src/test/regress/expected/stats.out         | 53 ++++++++++++++
 src/test/regress/sql/stats.sql              | 25 +++++++
 doc/src/sgml/monitoring.sgml                | 19 ++++-
 9 files changed, 219 insertions(+), 33 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d0d45150977..0a456858c9c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -274,14 +274,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..14c4f62b6bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2425,7 +2425,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			ssize_t		written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2435,16 +2435,17 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
+				io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
+				pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+										IOOP_WRITE, io_start, 1, written);
+
+				PendingWalStats.wal_write++;
+
 				/*
 				 * Increment the I/O timing and the number of times WAL data
 				 * were written out to disk.
@@ -2454,11 +2455,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 					instr_time	end;
 
 					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
+					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, io_start);
 				}
 
-				PendingWalStats.wal_write++;
-
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -3216,6 +3215,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3259,6 +3259,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3294,6 +3296,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1, wal_segment_size);
+
 	if (save_errno)
 	{
 		/*
@@ -3310,6 +3315,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3322,6 +3329,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1, 0);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8683,7 +8693,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8696,11 +8706,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8754,9 +8760,12 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		instr_time	end;
 
 		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
+		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, io_start);
 	}
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1, 0);
+
 	PendingWalStats.wal_sync++;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 91b6a91767d..e8d97cf0732 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1507,6 +1507,9 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	instr_time	io_start;
+#endif
 
 	p = buf;
 	recptr = startptr;
@@ -1552,6 +1555,8 @@ WALRead(XLogReaderState *state,
 			segbytes = nbytes;
 
 #ifndef FRONTEND
+		io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
 
@@ -1561,6 +1566,9 @@ WALRead(XLogReaderState *state,
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, readbytes);
 #endif
 
 		if (readbytes <= 0)
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806f..91fa72fadcc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1836,6 +1837,9 @@ PerformWalRecovery(void)
 		 * end of main redo apply loop
 		 */
 
+		/* Report pending statistics to the cumulative stats system  */
+		pgstat_flush_io(false);
+
 		if (reachedRecoveryTarget)
 		{
 			if (!reachedConsistency)
@@ -3306,6 +3310,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3398,6 +3403,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3406,6 +3413,10 @@ retry:
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, r);
+
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
 		if (r < 0)
 		{
@@ -3426,6 +3437,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1, r);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..338da73a9a9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -24,6 +24,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
@@ -43,7 +44,7 @@ void
 pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
 								IOOp io_op, instr_time io_time)
 {
-	Assert(track_io_timing);
+	Assert(track_io_timing || track_wal_io_timing);
 
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return;
@@ -236,9 +237,6 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
 		case B_STARTUP:
@@ -249,7 +247,10 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_SLOTSYNC_WORKER:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 6ff5d9e96a1..f6e5d260c92 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
@@ -105,6 +106,20 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  */
@@ -112,14 +127,24 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt, uint64 bytes)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate timing data.  pgstat_count_buffer is for pgstat_database. As
+	 * pg_stat_database only counts blk_read_time and blk_write_time, it is
+	 * set for IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
+	 *
+	 * pgBufferUsage is for EXPLAIN.  pgBufferUsage has write and read stats
+	 * for shared, local and temporary blocks.  Temporary blocks are ignored
+	 * here.
+	 */
+	if (pgstat_should_track_io_time(io_object))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		if (io_object != IOOBJECT_WAL &&
+			(io_op == IOOP_WRITE || io_op == IOOP_EXTEND))
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -127,7 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			else if (io_object == IOOBJECT_TEMP_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
 		}
-		else if (io_op == IOOP_READ)
+		else if (io_object != IOOBJECT_WAL && io_op == IOOP_READ)
 		{
 			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -241,6 +266,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -260,6 +287,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -332,8 +361,6 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -354,9 +381,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 			return false;
 
 		case B_AUTOVAC_LAUNCHER:
@@ -368,7 +392,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -394,6 +421,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -460,14 +496,28 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	/*
 	 * Some BackendTypes will not do certain IOOps.
 	 */
-	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+	if ((bktype == B_BG_WRITER) &&
 		(io_op == IOOP_READ || io_op == IOOP_EVICT || io_op == IOOP_HIT))
 		return false;
 
+	if ((bktype == B_CHECKPOINTER) &&
+		((io_object != IOOBJECT_WAL && io_op == IOOP_READ) ||
+		 (io_op == IOOP_EVICT || io_op == IOOP_HIT)))
+		return false;
+
 	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes don't do reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		(bktype == B_WAL_RECEIVER || bktype == B_BG_WRITER ||
+		 bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_WORKER ||
+		 bktype == B_WAL_WRITER))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -492,6 +542,17 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * IOOBJECT_WAL IOObject will not do certain IOOps depending on IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a0317b7208e..7dfcaf03b78 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1256,6 +1283,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1273,6 +1301,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1335,6 +1366,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 399c72bbcf7..9e368637199 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -602,6 +616,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -621,6 +636,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -649,6 +667,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5888fae2b5..c6f2c384512 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2581,9 +2581,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2636,6 +2637,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2660,6 +2666,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
-- 
2.47.2

v11-0002-Fetch-timing-columns-from-pg_stat_io-in-the-pg_s.patchtext/x-patch; charset=US-ASCII; name=v11-0002-Fetch-timing-columns-from-pg_stat_io-in-the-pg_s.patchDownload

From 35a3e7190653fc88d61d3e65d773e57ed9217a6a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 27 Jan 2025 14:05:51 +0300
Subject: [PATCH v11 2/2] Fetch timing columns from pg_stat_io in the
 pg_stat_wal view

In the prior commit WAL I/Os' timing stats are started to be tracked in
the pg_stat_io view. So, to avoid double accounting and to have same
timing values in both pg_stat_io and pg_stat_wal; fetch timing columns
from pg_stat_io in the pg_stat_wal view.
---
 src/include/catalog/pg_proc.dat         |  6 +++---
 src/include/pgstat.h                    |  4 ----
 src/backend/access/transam/xlog.c       | 23 ---------------------
 src/backend/catalog/system_views.sql    | 14 ++++++++++---
 src/backend/utils/activity/pgstat_wal.c |  2 --
 src/backend/utils/adt/pgstatfuncs.c     | 14 +++----------
 src/test/regress/expected/rules.out     | 27 ++++++++++++++++---------
 doc/src/sgml/monitoring.sgml            | 10 ++++++---
 8 files changed, 41 insertions(+), 59 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2aafdbc3e93..aaf2432cc17 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5928,9 +5928,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a456858c9c..faac508015e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -477,8 +477,6 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	PgStat_Counter wal_write_time;
-	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -493,8 +491,6 @@ typedef struct PgStat_PendingWalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	instr_time	wal_write_time;
-	instr_time	wal_sync_time;
 } PgStat_PendingWalStats;
 
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 14c4f62b6bb..4404e7cbac9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2446,18 +2446,6 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 
 				PendingWalStats.wal_write++;
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, io_start);
-				}
-
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -8752,17 +8740,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, io_start);
-	}
-
 	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
 							io_start, 1, 0);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 46868bf7e89..9448e1ec355 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,6 +1180,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1187,10 +1194,11 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+        CROSS JOIN pgsio_sum_write_fsync_time p;
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 18fa6b2936a..8ee650cdc30 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -126,8 +126,6 @@ pgstat_wal_flush_cb(bool nowait)
 	WALSTAT_ACC(wal_buffers_full, PendingWalStats);
 	WALSTAT_ACC(wal_write, PendingWalStats);
 	WALSTAT_ACC(wal_sync, PendingWalStats);
-	WALSTAT_ACC_INSTR_TIME(wal_write_time);
-	WALSTAT_ACC_INSTR_TIME(wal_sync_time);
 #undef WALSTAT_ACC_INSTR_TIME
 #undef WALSTAT_ACC
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0f5e0a9778d..5de31979bc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1609,7 +1609,7 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1630,11 +1630,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1657,11 +1653,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 856a8349c50..a15e0096d9c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2241,16 +2241,23 @@ pg_stat_user_tables| SELECT relid,
     autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c6f2c384512..516c4b96112 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4965,7 +4965,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. This resets the timing
+          counter stored in the <structname>pg_stat_wal</structname> view
+          too.
          </para>
         </listitem>
         <listitem>
@@ -4982,8 +4984,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. The timing
+          counters can be reset by calling pg_stat_reset_shared set to
+          <literal>io</literal>.
          </para>
         </listitem>
         <listitem>
-- 
2.47.2

#63

Nazir Bilal Yavuz

byavuz81@gmail.com

12 months ago

In reply to: Nazir Bilal Yavuz (#62)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Mon, 27 Jan 2025 at 16:59, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Mon, 27 Jan 2025 at 03:52, Michael Paquier <michael@paquier.xyz> wrote:

In 0002, you are removing PendingWalStats.wal_write_time, which does
not seem OK to me because we should still aggregate this data for
track_wal_io_timing, no?

We use PendingWalStats.wal_[ write | sync ]_time only to show timings
on the pg_stat_wal view, right? And now these pg_stat_wal.wal_[ write
| sync ]_time datas are fetched from the pg_stat_io view when the
track_wal_io_timing is on. So, I think it is correct to remove these.

I made a mistake while splitting the patches. The places where
'PendingWalStats.wal_[ write | sync ]_time are incremented (the code
piece you shared)' are removed in 0002 (0001 now), but they should be
removed in 0003 (0002 now) instead. This is corrected in v11.

Oops, I forgot to add one thing to the previous mail.

If we agree with removing PendingWalStats.wal_[ write | sync ]_time
variables, then it would make sense to remove PgStat_PendingWalStats
struct completely. We have that struct because [1]ca7b3c4c00 it is cheap to
store PendingWalStats.wal_[ write | sync ]_time as instr_time instead
of PgStat_Counter.

[1]: ca7b3c4c00

--
Regards,
Nazir Bilal Yavuz
Microsoft

#64

Michael Paquier

michael@paquier.xyz

12 months ago

In reply to: Nazir Bilal Yavuz (#63)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, Jan 27, 2025 at 05:13:39PM +0300, Nazir Bilal Yavuz wrote:

On Mon, 27 Jan 2025 at 16:59, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Mon, 27 Jan 2025 at 03:52, Michael Paquier <michael@paquier.xyz> wrote:
We use PendingWalStats.wal_[ write | sync ]_time only to show timings
on the pg_stat_wal view, right? And now these pg_stat_wal.wal_[ write
| sync ]_time datas are fetched from the pg_stat_io view when the
track_wal_io_timing is on. So, I think it is correct to remove these.

As you say, removing the counters in the second patch does not matter
as if you are going to combine them and..

I made a mistake while splitting the patches. The places where
'PendingWalStats.wal_[ write | sync ]_time are incremented (the code
piece you shared)' are removed in 0002 (0001 now), but they should be
removed in 0003 (0002 now) instead. This is corrected in v11.

My issue was in the first patch that should not have removed them. My
apologies for the confusion, I should have pointed out that this was
likely an incorrect rebase and/or patch split.

If we agree with removing PendingWalStats.wal_[ write | sync ]_time
variables, then it would make sense to remove PgStat_PendingWalStats
struct completely. We have that struct because [1] it is cheap to
store PendingWalStats.wal_[ write | sync ]_time as instr_time instead
of PgStat_Counter.

[1] ca7b3c4c00

I agree that some simplification would be nice because it also makes
Bertrand's patch around [1]https://commitfest.postgresql.org/52/5492/ -- Michael to not have some special handling with
PgStat_PendingWalStats without us losing monitoring capabilities, I
hope. So maximizing simplifications before integrating more
capabilities should make the whole implementation effort easier.

What you doing in 0001 is a first good step towards this goal, as this
also plugs in a few things for backend statistics with the calls to
pgstat_count_io_op[_time]().

+       /* Report pending statistics to the cumulative stats system */
+       pgstat_flush_io(false);

This has been discussed previously under a pgstat_report_wal() call.
Why do you need this specific call? Perhaps this should be documented
as a comment?

+ if (io_object == IOOBJECT_WAL)
+ return track_wal_io_timing

Hmm. Andres has commented in the past that we want pg_stat_io to
server as a central place for all the I/O statistics. Thinking more
about that, I am not really convinced that we actually need to make
this area of the code in pgstat_io.c rely on two GUCs. How about
simplifying things so as we just rely on track_io_timing for
everything, without creating a strange dependency on the IOObject with
more routines like pgstat_should_track_io_time()? I'd really want
less of these GUCs, not more of them with cross-dependencies depending
on the stats kinds we are dealing with. If we replace the timings
from pg_stat_wal with the ones in pg_stat_io, we should be in a good
position to remove track_wal_io_timing entirely, as well. This has
the merit of making your patch a lot simpler, meaning extra bonus
points.

[1]: https://commitfest.postgresql.org/52/5492/ -- Michael
--
Michael

#65

Nazir Bilal Yavuz

byavuz81@gmail.com

12 months ago

In reply to: Bertrand Drouvot (#61)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Mon, 27 Jan 2025 at 10:45, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Fri, Jan 24, 2025 at 06:29:46PM +0300, Nazir Bilal Yavuz wrote:

Hi,

Thanks for looking into this!

On Fri, 24 Jan 2025 at 17:20, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I did not look at the code yet but did a few tests.
I can see diff between pg_stat_wal and pg_stat_io, for example:

"
postgres=# select pg_stat_reset_shared();
pg_stat_reset_shared
----------------------

(1 row)

postgres=# insert into bdt select a from generate_series(1,200000) a ;
INSERT 0 200000

postgres=# select wal_bytes,stats_reset from pg_stat_wal;
wal_bytes | stats_reset
-----------+-------------------------------
11800088 | 2025-01-24 14:17:28.507994+00
(1 row)

postgres=# select sum(write_bytes),stats_reset from pg_stat_io where object = 'wal' group by stats_reset;
sum | stats_reset
----------+-------------------------------
12853248 | 2025-01-24 14:17:28.507988+00
(1 row)

Is that expected?

I am not sure, I think they need to be the same.

Yeah I think so (or at least we should document the reason(s) why (and how)
they could differ).

I could not
understand the cause of the difference at first glance. I will look
into this and will come back to you.

Thanks.

Sorry for replying late. It turns out that the difference is expected.
It is because pg_stat_wal.wal_bytes tracks the total amount of WAL
generated, while pg_stat_io tracks the I/Os that are written to disk.

From the docs of pg_stat_io: pg_stat_io view will contain one row for
each combination of backend type, target I/O object, and I/O context,
showing cluster-wide I/O statistics. Combinations which do not make
sense are omitted.

From the docs of pg_stat_wal.wal_bytes: Total amount of WAL generated in bytes.

It is explicitly said that pg_stat_io tracks I/O operations and
pg_stat_wal.wal_bytes tracks total amount of WAL generated in bytes. I
think this is clear enough. Do you think we still need additional
explanation?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#66

Nazir Bilal Yavuz

byavuz81@gmail.com

12 months ago

In reply to: Michael Paquier (#64)

2 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Tue, 28 Jan 2025 at 07:23, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jan 27, 2025 at 05:13:39PM +0300, Nazir Bilal Yavuz wrote:

On Mon, 27 Jan 2025 at 16:59, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Mon, 27 Jan 2025 at 03:52, Michael Paquier <michael@paquier.xyz> wrote:
We use PendingWalStats.wal_[ write | sync ]_time only to show timings
on the pg_stat_wal view, right? And now these pg_stat_wal.wal_[ write
| sync ]_time datas are fetched from the pg_stat_io view when the
track_wal_io_timing is on. So, I think it is correct to remove these.

As you say, removing the counters in the second patch does not matter
as if you are going to combine them and..

I made a mistake while splitting the patches. The places where
'PendingWalStats.wal_[ write | sync ]_time are incremented (the code
piece you shared)' are removed in 0002 (0001 now), but they should be
removed in 0003 (0002 now) instead. This is corrected in v11.

My issue was in the first patch that should not have removed them. My
apologies for the confusion, I should have pointed out that this was
likely an incorrect rebase and/or patch split.

Ah, I see. Thanks for clarifying!

If we agree with removing PendingWalStats.wal_[ write | sync ]_time
variables, then it would make sense to remove PgStat_PendingWalStats
struct completely. We have that struct because [1] it is cheap to
store PendingWalStats.wal_[ write | sync ]_time as instr_time instead
of PgStat_Counter.

[1] ca7b3c4c00

I agree that some simplification would be nice because it also makes
Bertrand's patch around [1] to not have some special handling with
PgStat_PendingWalStats without us losing monitoring capabilities, I
hope. So maximizing simplifications before integrating more
capabilities should make the whole implementation effort easier.

What you doing in 0001 is a first good step towards this goal, as this
also plugs in a few things for backend statistics with the calls to
pgstat_count_io_op[_time]().

I agree. Do you think that we need to do this simplification in this
thread or does it need its own thread?

+       /* Report pending statistics to the cumulative stats system */
+       pgstat_flush_io(false);
This has been discussed previously under a pgstat_report_wal() call.
Why do you need this specific call? Perhaps this should be documented
as a comment?

Without this change, stats in the startup process would only get
reported during shutdown of the startup process, so I added this but I
was wrong. It was already fixed [1]e3cb1a586c, so yes; we do not need it now.
This is removed in v12.

+ if (io_object == IOOBJECT_WAL)
+ return track_wal_io_timing

Hmm. Andres has commented in the past that we want pg_stat_io to
server as a central place for all the I/O statistics. Thinking more
about that, I am not really convinced that we actually need to make
this area of the code in pgstat_io.c rely on two GUCs. How about
simplifying things so as we just rely on track_io_timing for
everything, without creating a strange dependency on the IOObject with
more routines like pgstat_should_track_io_time()? I'd really want
less of these GUCs, not more of them with cross-dependencies depending
on the stats kinds we are dealing with. If we replace the timings
from pg_stat_wal with the ones in pg_stat_io, we should be in a good
position to remove track_wal_io_timing entirely, as well. This has
the merit of making your patch a lot simpler, meaning extra bonus
points.

I agree with you but it was discussed before in this thread [2]/messages/by-id/ZUmuJ7P8THHz03nx@paquier.xyz. It
was decided to use both track_wal_io_timing and track_io_timing
because of the overhead that track_wal_io_timing creates but we can
still re-discuss it. Do you think that this discussion needs its own
thread?

If we continue to discuss it in this thread, I am in favor of removing
track_wal_io_timing and using track_io_timing for all types of I/Os.
Like you said, this cross-dependency makes things more complex than
they used to be. Downside of removing track_wal_io_timing is affecting
people who:

1- Want to track timings of only WAL I/Os.
2- Want to track timings of all IOs except WAL I/Os.

I think the first group is more important than the second because
track_io_timing already creates overhead.

One additional thing is that I think track_io_timing is a general
word. When it exists, I do not expect there to be another GUC like
track_wal_io_timing to track WAL I/Os' timings.

[1]: e3cb1a586c
[2]: /messages/by-id/ZUmuJ7P8THHz03nx@paquier.xyz

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v12-0001-Add-WAL-I-O-stats-to-both-pg_stat_io-view-and-pe.patchtext/x-patch; charset=US-ASCII; name=v12-0001-Add-WAL-I-O-stats-to-both-pg_stat_io-view-and-pe.patchDownload

From 2f473008518de14154765193415dfcc7e7910ba5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 27 Jan 2025 14:03:54 +0300
Subject: [PATCH v12 1/2] Add WAL I/O stats to both pg_stat_io view and per
 backend I/O statistics

This commit adds WAL I/O stats to both pg_stat_io view and per backend
I/O statistics (pg_stat_get_backend_io()).

This commit introduces a three new I/O concepts:

- IOObject IOOBJECT_WAL is used for tracking all WAL I/Os.
- IOOBJECT_WAL / IOCONTEXT_NORMAL is used for tracking I/O operations
  done on already created wal segments.
- IOOBJECT_WAL / IOCONTEXT_INIT is used for tracking I/O operations done
  while creating the WAL segments.

For now 027_stream_regress.pl -> 'check contents of pg_stat_statements on
regression database' test fails. It is because walsenders may shutdown
after the checkpointer. There is ongoing discussion to fix that problem:
postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp
---
 src/include/pgstat.h                        |  4 +-
 src/backend/access/transam/xlog.c           | 41 ++++++-----
 src/backend/access/transam/xlogreader.c     |  8 +++
 src/backend/access/transam/xlogrecovery.c   | 11 +++
 src/backend/utils/activity/pgstat_backend.c |  9 +--
 src/backend/utils/activity/pgstat_io.c      | 79 ++++++++++++++++++---
 src/test/regress/expected/stats.out         | 53 ++++++++++++++
 src/test/regress/sql/stats.sql              | 25 +++++++
 doc/src/sgml/monitoring.sgml                | 19 ++++-
 9 files changed, 216 insertions(+), 33 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 81ec0161c09..a61b488e8d8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -274,14 +274,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..14c4f62b6bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2425,7 +2425,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			ssize_t		written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2435,16 +2435,17 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
+				io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
+				pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+										IOOP_WRITE, io_start, 1, written);
+
+				PendingWalStats.wal_write++;
+
 				/*
 				 * Increment the I/O timing and the number of times WAL data
 				 * were written out to disk.
@@ -2454,11 +2455,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 					instr_time	end;
 
 					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
+					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, io_start);
 				}
 
-				PendingWalStats.wal_write++;
-
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -3216,6 +3215,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3259,6 +3259,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3294,6 +3296,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1, wal_segment_size);
+
 	if (save_errno)
 	{
 		/*
@@ -3310,6 +3315,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3322,6 +3329,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1, 0);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8683,7 +8693,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8696,11 +8706,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8754,9 +8760,12 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		instr_time	end;
 
 		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
+		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, io_start);
 	}
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1, 0);
+
 	PendingWalStats.wal_sync++;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 91b6a91767d..e8d97cf0732 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1507,6 +1507,9 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	instr_time	io_start;
+#endif
 
 	p = buf;
 	recptr = startptr;
@@ -1552,6 +1555,8 @@ WALRead(XLogReaderState *state,
 			segbytes = nbytes;
 
 #ifndef FRONTEND
+		io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
 
@@ -1561,6 +1566,9 @@ WALRead(XLogReaderState *state,
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, readbytes);
 #endif
 
 		if (readbytes <= 0)
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806f..c3d7cec1e09 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -3306,6 +3307,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3398,6 +3400,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3406,6 +3410,10 @@ retry:
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, r);
+
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
 		if (r < 0)
 		{
@@ -3426,6 +3434,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1, r);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..338da73a9a9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -24,6 +24,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
@@ -43,7 +44,7 @@ void
 pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
 								IOOp io_op, instr_time io_time)
 {
-	Assert(track_io_timing);
+	Assert(track_io_timing || track_wal_io_timing);
 
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return;
@@ -236,9 +237,6 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
 		case B_STARTUP:
@@ -249,7 +247,10 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_SLOTSYNC_WORKER:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 6ff5d9e96a1..f6e5d260c92 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
@@ -105,6 +106,20 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  */
@@ -112,14 +127,24 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt, uint64 bytes)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate timing data.  pgstat_count_buffer is for pgstat_database. As
+	 * pg_stat_database only counts blk_read_time and blk_write_time, it is
+	 * set for IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
+	 *
+	 * pgBufferUsage is for EXPLAIN.  pgBufferUsage has write and read stats
+	 * for shared, local and temporary blocks.  Temporary blocks are ignored
+	 * here.
+	 */
+	if (pgstat_should_track_io_time(io_object))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		if (io_object != IOOBJECT_WAL &&
+			(io_op == IOOP_WRITE || io_op == IOOP_EXTEND))
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -127,7 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			else if (io_object == IOOBJECT_TEMP_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
 		}
-		else if (io_op == IOOP_READ)
+		else if (io_object != IOOBJECT_WAL && io_op == IOOP_READ)
 		{
 			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -241,6 +266,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -260,6 +287,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -332,8 +361,6 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -354,9 +381,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 			return false;
 
 		case B_AUTOVAC_LAUNCHER:
@@ -368,7 +392,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -394,6 +421,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -460,14 +496,28 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	/*
 	 * Some BackendTypes will not do certain IOOps.
 	 */
-	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+	if ((bktype == B_BG_WRITER) &&
 		(io_op == IOOP_READ || io_op == IOOP_EVICT || io_op == IOOP_HIT))
 		return false;
 
+	if ((bktype == B_CHECKPOINTER) &&
+		((io_object != IOOBJECT_WAL && io_op == IOOP_READ) ||
+		 (io_op == IOOP_EVICT || io_op == IOOP_HIT)))
+		return false;
+
 	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes don't do reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		(bktype == B_WAL_RECEIVER || bktype == B_BG_WRITER ||
+		 bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_WORKER ||
+		 bktype == B_WAL_WRITER))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -492,6 +542,17 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * IOOBJECT_WAL IOObject will not do certain IOOps depending on IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a0317b7208e..7dfcaf03b78 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1256,6 +1283,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1273,6 +1301,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1335,6 +1366,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 399c72bbcf7..9e368637199 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -602,6 +616,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -621,6 +636,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -649,6 +667,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e917f159aa..edc2470bcf9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2581,9 +2581,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2636,6 +2637,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2660,6 +2666,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
-- 
2.47.2

v12-0002-Fetch-timing-columns-from-pg_stat_io-in-the-pg_s.patchtext/x-patch; charset=US-ASCII; name=v12-0002-Fetch-timing-columns-from-pg_stat_io-in-the-pg_s.patchDownload

From 5ef211b0e2b83f6b3103d7cf948870951f9d2ccd Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 27 Jan 2025 14:05:51 +0300
Subject: [PATCH v12 2/2] Fetch timing columns from pg_stat_io in the
 pg_stat_wal view

In the prior commit WAL I/Os' timing stats are started to be tracked in
the pg_stat_io view. So, to avoid double accounting and to have same
timing values in both pg_stat_io and pg_stat_wal; fetch timing columns
from pg_stat_io in the pg_stat_wal view.
---
 src/include/catalog/pg_proc.dat         |  6 +++---
 src/include/pgstat.h                    |  4 ----
 src/backend/access/transam/xlog.c       | 23 ---------------------
 src/backend/catalog/system_views.sql    | 14 ++++++++++---
 src/backend/utils/activity/pgstat_wal.c |  2 --
 src/backend/utils/adt/pgstatfuncs.c     | 14 +++----------
 src/test/regress/expected/rules.out     | 27 ++++++++++++++++---------
 doc/src/sgml/monitoring.sgml            | 10 ++++++---
 8 files changed, 41 insertions(+), 59 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5b8c2ad2a54..e8409b4667f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5944,9 +5944,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a61b488e8d8..c2b722268ce 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -482,8 +482,6 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	PgStat_Counter wal_write_time;
-	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -498,8 +496,6 @@ typedef struct PgStat_PendingWalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	instr_time	wal_write_time;
-	instr_time	wal_sync_time;
 } PgStat_PendingWalStats;
 
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 14c4f62b6bb..4404e7cbac9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2446,18 +2446,6 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 
 				PendingWalStats.wal_write++;
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, io_start);
-				}
-
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -8752,17 +8740,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, io_start);
-	}
-
 	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
 							io_start, 1, 0);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index cddc3ea9b53..f954673bdc9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1184,6 +1184,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1191,10 +1198,11 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+        CROSS JOIN pgsio_sum_write_fsync_time p;
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 18fa6b2936a..8ee650cdc30 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -126,8 +126,6 @@ pgstat_wal_flush_cb(bool nowait)
 	WALSTAT_ACC(wal_buffers_full, PendingWalStats);
 	WALSTAT_ACC(wal_write, PendingWalStats);
 	WALSTAT_ACC(wal_sync, PendingWalStats);
-	WALSTAT_ACC_INSTR_TIME(wal_write_time);
-	WALSTAT_ACC_INSTR_TIME(wal_sync_time);
 #undef WALSTAT_ACC_INSTR_TIME
 #undef WALSTAT_ACC
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e9096a88492..cab87c74871 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1637,7 +1637,7 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1658,11 +1658,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1685,11 +1681,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3361f6a69c9..c9a98ee0037 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2253,16 +2253,23 @@ pg_stat_user_tables| SELECT relid,
     total_autoanalyze_time
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index edc2470bcf9..a98a027b53c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5003,7 +5003,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. This resets the timing
+          counter stored in the <structname>pg_stat_wal</structname> view
+          too.
          </para>
         </listitem>
         <listitem>
@@ -5020,8 +5022,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. The timing
+          counters can be reset by calling pg_stat_reset_shared set to
+          <literal>io</literal>.
          </para>
         </listitem>
         <listitem>
-- 
2.47.2

#67

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

12 months ago

In reply to: Nazir Bilal Yavuz (#65)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, Jan 29, 2025 at 01:49:20PM +0300, Nazir Bilal Yavuz wrote:

Sorry for replying late.

No problem at all!

It turns out that the difference is expected.
It is because pg_stat_wal.wal_bytes tracks the total amount of WAL
generated, while pg_stat_io tracks the I/Os that are written to disk.

Yeah, I do agree it's expected: Indeed pg_stat_wal.wal_bytes somehow "focus"
on the wal records size while the pg_stat_io's unit is the wal_block_size. That
makes sense and "only" affects the "bytes" (the remaining common fields looks
consistent across the 2 views).

It is explicitly said that pg_stat_io tracks I/O operations and
pg_stat_wal.wal_bytes tracks total amount of WAL generated in bytes. I
think this is clear enough. Do you think we still need additional
explanation?

I think that would not hurt to add a comment mentioning that the "bytes" are
expected to differ given the fact that the IO unit is the wal_block_size. But,
that said, I don't have a strong opinion on it, just a nice to have IMHO.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#68

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

12 months ago

In reply to: Nazir Bilal Yavuz (#66)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, Jan 29, 2025 at 02:57:21PM +0300, Nazir Bilal Yavuz wrote:

I agree with you but it was discussed before in this thread [2]. It
was decided to use both track_wal_io_timing and track_io_timing
because of the overhead that track_wal_io_timing creates but we can
still re-discuss it. Do you think that this discussion needs its own
thread?

If we continue to discuss it in this thread, I am in favor of removing
track_wal_io_timing and using track_io_timing for all types of I/Os.
Like you said, this cross-dependency makes things more complex than
they used to be. Downside of removing track_wal_io_timing is affecting
people who:

1- Want to track timings of only WAL I/Os.
2- Want to track timings of all IOs except WAL I/Os.

I think the first group is more important than the second because
track_io_timing already creates overhead.

I'm -1 of removing track_wal_io_timing. I think that this code path is very
sensible to performance to not add extra overhead when not necessary asked for.

I think that's the main reason why ff99918c625 added this new GUC (looking at
the commit message). I'd feel more comfortable if we keep it.

That said, in:

+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+       if (io_object == IOOBJECT_WAL)
+               return track_wal_io_timing;
+
+       return track_io_timing;
+}

I think it would make sense to return "track_io_timing && track_wal_io_timing"
for the IOOBJECT_WAL case. That way it maintains track_io_timing as the master
switch for all I/O timing in pg_stat_io.

One additional thing is that I think track_io_timing is a general
word. When it exists, I do not expect there to be another GUC like
track_wal_io_timing to track WAL I/Os' timings.

That's true but OTOH track_wal_io_timing is already there since years (it's not
like we are adding it).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#69

Michael Paquier

michael@paquier.xyz

12 months ago

In reply to: Nazir Bilal Yavuz (#66)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Jan 29, 2025 at 02:57:21PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 28 Jan 2025 at 07:23, Michael Paquier <michael@paquier.xyz> wrote:

What you doing in 0001 is a first good step towards this goal, as this
also plugs in a few things for backend statistics with the calls to
pgstat_count_io_op[_time]().

I agree. Do you think that we need to do this simplification in this
thread or does it need its own thread?

As far as I understand, the simplifications in PgStat_PendingWalStats
require the changes of this thread first, so keeping them around for
now sounds OK to me.

I agree with you but it was discussed before in this thread [2]. It
was decided to use both track_wal_io_timing and track_io_timing
because of the overhead that track_wal_io_timing creates but we can
still re-discuss it. Do you think that this discussion needs its own
thread?

Let's decide it on this thread. You have done a benchmark with fsync
disabled for something that only stresses WAL. And it is very
dependent on the clock source. Would you really see a difference
under a normal pgbench workload? For example, should we compare HEAD
and the patch with track_io_timing=on but track_wal_io_timing=off
with a modified version of the patch so as IOOBJECT_WAL timing data is
controlled by track_io_timing=on? The previous results could have
been also influenced by the timings of pg_stat_wal because
track_wal_io_timing was on.

If we continue to discuss it in this thread, I am in favor of removing
track_wal_io_timing and using track_io_timing for all types of I/Os.
Like you said, this cross-dependency makes things more complex than
they used to be. Downside of removing track_wal_io_timing is affecting
people who:

1- Want to track timings of only WAL I/Os.
2- Want to track timings of all IOs except WAL I/Os.

I think the first group is more important than the second because
track_io_timing already creates overhead.

One additional thing is that I think track_io_timing is a general
word. When it exists, I do not expect there to be another GUC like
track_wal_io_timing to track WAL I/Os' timings.

Just to be clear here, I'd be okay to remove entirely the GUC
track_wal_io_timing iff pg_stat_wal has no more need for it if we feed
the data of pg_stat_io to pg_stat_wal. Having track_io_timing be
used for all the timing information in pg_stat_io makes the whole
design leaner, IMO, removing it from the patch and pgstat_io.c
simplifies a lot the user history.
--
Michael

#70

Nazir Bilal Yavuz

byavuz81@gmail.com

12 months ago

In reply to: Bertrand Drouvot (#68)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Wed, Jan 29, 2025 at 02:57:21PM +0300, Nazir Bilal Yavuz wrote:

I agree with you but it was discussed before in this thread [2]. It
was decided to use both track_wal_io_timing and track_io_timing
because of the overhead that track_wal_io_timing creates but we can
still re-discuss it. Do you think that this discussion needs its own
thread?

If we continue to discuss it in this thread, I am in favor of removing
track_wal_io_timing and using track_io_timing for all types of I/Os.
Like you said, this cross-dependency makes things more complex than
they used to be. Downside of removing track_wal_io_timing is affecting
people who:

1- Want to track timings of only WAL I/Os.
2- Want to track timings of all IOs except WAL I/Os.

I think the first group is more important than the second because
track_io_timing already creates overhead.

I'm -1 of removing track_wal_io_timing. I think that this code path is very
sensible to performance to not add extra overhead when not necessary asked for.

I think that's the main reason why ff99918c625 added this new GUC (looking at
the commit message). I'd feel more comfortable if we keep it.

As Michael suggested, I will run a couple of benchmarks to see the
actual effect of this change. Then let's see if this affects anything.

That said, in:
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+       if (io_object == IOOBJECT_WAL)
+               return track_wal_io_timing;
+
+       return track_io_timing;
+}
I think it would make sense to return "track_io_timing && track_wal_io_timing"
for the IOOBJECT_WAL case. That way it maintains track_io_timing as the master
switch for all I/O timing in pg_stat_io.

I do not think that makes sense if we want to take performance into
account. This means if we want to track WAL IO timings, we must track
other IOs timings as well. Or perhaps did you mean that not fetching
pg_stat_wal's timings from pg_stat_io and instead: track_wal_io_timing
will track timings in the pg_stat_wal but it won't track the WAL IO
timings in the pg_stat_io unless track_io_timing is enabled?

One additional thing is that I think track_io_timing is a general
word. When it exists, I do not expect there to be another GUC like
track_wal_io_timing to track WAL I/Os' timings.

That's true but OTOH track_wal_io_timing is already there since years (it's not
like we are adding it).

Yes, this makes sense.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#71

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Nazir Bilal Yavuz (#70)

1 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote:

On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I think that's the main reason why ff99918c625 added this new GUC (looking at
the commit message). I'd feel more comfortable if we keep it.

As Michael suggested, I will run a couple of benchmarks to see the
actual effect of this change. Then let's see if this affects anything.

I've looked at bit at all that today, and something like the attached
is what seems like the best streamlined version to me for the main
feature. I am also planning to run some short benchmarks with
track_io_timing=on on HEAD and with the patch, then see the
difference, without any relationship to track_wal_io_timing.

The comment additions in pgstat_count_io_op_time() were worth a patch
of their own. This part has been applied as b998fedab74c, after a few
tweaks of my own.
--
Michael

Attachments:

v13-0001-Add-WAL-I-O-stats-to-both-pg_stat_io-view-and-pe.patchtext/x-diff; charset=us-asciiDownload

From f3bff6e4a646cb90fd5ba0e178e282d189eda39e Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 27 Jan 2025 14:03:54 +0300
Subject: [PATCH v13] Add WAL I/O stats to both pg_stat_io view and per backend
 I/O statistics

This commit adds WAL I/O stats to both pg_stat_io view and per backend
I/O statistics (pg_stat_get_backend_io()).

This commit introduces a three new I/O concepts:

- IOObject IOOBJECT_WAL is used for tracking all WAL I/Os.
- IOOBJECT_WAL / IOCONTEXT_NORMAL is used for tracking I/O operations
  done on already created wal segments.
- IOOBJECT_WAL / IOCONTEXT_INIT is used for tracking I/O operations done
  while creating the WAL segments.

XXX: Bump catalog version.
XXX: Bump pgstats file version.
---
 src/include/pgstat.h                        |  4 +-
 src/backend/access/transam/xlog.c           | 39 +++++++---
 src/backend/access/transam/xlogreader.c     | 10 +++
 src/backend/access/transam/xlogrecovery.c   | 12 ++++
 src/backend/utils/activity/pgstat_backend.c |  9 +--
 src/backend/utils/activity/pgstat_io.c      | 79 +++++++++++++++------
 src/test/regress/expected/stats.out         | 53 ++++++++++++++
 src/test/regress/sql/stats.sql              | 25 +++++++
 doc/src/sgml/monitoring.sgml                | 19 ++++-
 9 files changed, 212 insertions(+), 38 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 81ec0161c09..a61b488e8d8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -274,14 +274,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 90ade4e7d39..eb19cf5690c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2435,16 +2435,19 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
+				/*
+				 * Measure I/O timing to write WAL data, for pg_stat_wal
+				 * and/or pg_stat_io.
+				 */
+				start = pgstat_prepare_io_time(track_wal_io_timing || track_io_timing);
 
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
+				pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+										IOOP_WRITE, start, 1, written);
+
 				/*
 				 * Increment the I/O timing and the number of times WAL data
 				 * were written out to disk.
@@ -3216,6 +3219,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3259,6 +3263,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* Measure I/O timing when initializing segment */
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3294,6 +3301,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1, wal_segment_size);
+
 	if (save_errno)
 	{
 		/*
@@ -3310,6 +3320,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* Measure I/O timing when flushing segment */
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3322,6 +3335,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1, 0);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8696,11 +8712,11 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	/*
+	 * Measure I/O timing to sync the WAL file for pg_stat_wal and/or
+	 * pg_stat_io.
+	 */
+	start = pgstat_prepare_io_time(track_io_timing || track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8757,6 +8773,9 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
 	}
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							start, 1, 0);
+
 	PendingWalStats.wal_sync++;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 91b6a91767d..6cf5ef90be0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -35,6 +35,7 @@
 
 #ifndef FRONTEND
 #include "pgstat.h"
+#include "storage/bufmgr.h"
 #else
 #include "common/logging.h"
 #endif
@@ -1507,6 +1508,9 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	instr_time	io_start;
+#endif
 
 	p = buf;
 	recptr = startptr;
@@ -1552,6 +1556,9 @@ WALRead(XLogReaderState *state,
 			segbytes = nbytes;
 
 #ifndef FRONTEND
+		/* Measure I/O timing when reading segment */
+		io_start = pgstat_prepare_io_time(track_io_timing);
+
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
 
@@ -1561,6 +1568,9 @@ WALRead(XLogReaderState *state,
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, readbytes);
 #endif
 
 		if (readbytes <= 0)
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806f..473de6710d7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -3306,6 +3307,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3398,6 +3400,9 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	/* Measure I/O timing when reading segment */
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3406,6 +3411,10 @@ retry:
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, r);
+
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
 		if (r < 0)
 		{
@@ -3426,6 +3435,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1, r);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..338da73a9a9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -24,6 +24,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
@@ -43,7 +44,7 @@ void
 pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
 								IOOp io_op, instr_time io_time)
 {
-	Assert(track_io_timing);
+	Assert(track_io_timing || track_wal_io_timing);
 
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return;
@@ -236,9 +237,6 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
 		case B_STARTUP:
@@ -249,7 +247,10 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_SLOTSYNC_WORKER:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 7aee6a396f2..28a431084b8 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -127,21 +127,24 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		if (io_object != IOOBJECT_WAL)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-			if (io_object == IOOBJECT_RELATION)
-				INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
-			else if (io_object == IOOBJECT_TEMP_RELATION)
-				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
-		}
-		else if (io_op == IOOP_READ)
-		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-			if (io_object == IOOBJECT_RELATION)
-				INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
-			else if (io_object == IOOBJECT_TEMP_RELATION)
-				INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+			if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+			{
+				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+				if (io_object == IOOBJECT_RELATION)
+					INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+				else if (io_object == IOOBJECT_TEMP_RELATION)
+					INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+			}
+			else if (io_op == IOOP_READ)
+			{
+				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+				if (io_object == IOOBJECT_RELATION)
+					INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+				else if (io_object == IOOBJECT_TEMP_RELATION)
+					INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+			}
 		}
 
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
@@ -249,6 +252,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -268,6 +273,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -340,8 +347,6 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -362,9 +367,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 			return false;
 
 		case B_AUTOVAC_LAUNCHER:
@@ -376,7 +378,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -402,6 +407,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -468,14 +482,28 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	/*
 	 * Some BackendTypes will not do certain IOOps.
 	 */
-	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+	if (bktype == B_BG_WRITER &&
 		(io_op == IOOP_READ || io_op == IOOP_EVICT || io_op == IOOP_HIT))
 		return false;
 
+	if (bktype == B_CHECKPOINTER &&
+		((io_object != IOOBJECT_WAL && io_op == IOOP_READ) ||
+		 (io_op == IOOP_EVICT || io_op == IOOP_HIT)))
+		return false;
+
 	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes do not perform reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		(bktype == B_WAL_RECEIVER || bktype == B_BG_WRITER ||
+		 bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_WORKER ||
+		 bktype == B_WAL_WRITER))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -500,6 +528,17 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * IOOBJECT_WAL IOObject will not do certain IOOps depending on IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a0317b7208e..9a02481ee7e 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' AND object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' AND object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1256,6 +1283,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1273,6 +1301,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' AND object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1335,6 +1366,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' AND object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 399c72bbcf7..901e7bd56e3 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' AND object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' AND object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -602,6 +616,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -621,6 +636,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' AND object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -649,6 +667,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' AND object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e917f159aa..edc2470bcf9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2581,9 +2581,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2636,6 +2637,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2660,6 +2666,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
-- 
2.47.2

#72

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

11 months ago

In reply to: Michael Paquier (#71)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Mon, Feb 03, 2025 at 01:07:26PM +0900, Michael Paquier wrote:

On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote:

On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I think that's the main reason why ff99918c625 added this new GUC (looking at
the commit message). I'd feel more comfortable if we keep it.

As Michael suggested, I will run a couple of benchmarks to see the
actual effect of this change. Then let's see if this affects anything.

I've looked at bit at all that today, and something like the attached
is what seems like the best streamlined version to me for the main
feature. I am also planning to run some short benchmarks with
track_io_timing=on on HEAD and with the patch, then see the
difference, without any relationship to track_wal_io_timing.

Thanks!

I've a few comments:

=== 1

+       pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+                                                       io_start, 1, wal_segment_size);

In case wal_init_zero is false, then we're only seeking to the end and write a
solitary byte. Then, is reporting "wal_segment_size" correct?

=== 2

+ /*
+  * Measure I/O timing to write WAL data, for pg_stat_wal
+  * and/or pg_stat_io.
+  */
+ start = pgstat_prepare_io_time(track_wal_io_timing || track_io_timing);

I think that makes sense done that way (as track_wal_io_timing does not have
any effect in pgstat_count_io_op_time()). Nit: maybe change the order in the
comment to reflect the code ordering? (I mean to say re-word to "for pg_stat_io
and/or pg_stat_wal). The order is ok in issue_xlog_fsync() though.

=== 3

What about adding a message in the doc as mentioned in [1]/messages/by-id/Z5o9OQ0nwWD9tKTR@ip-10-97-1-34.eu-west-3.compute.internal? (I'd not be surprised
if some people wonder why the "bytes" fields differ).

=== 4

pgstat_tracks_io_object() starts to be hard to read. I wonder if it could be
simplified with switch but that could be done after this one goes in.

=== 5

I think this patch will help simplify the per-backend WAL related patch, that's
nice.

=== 6

I'll also do some benchmark on my side.

[1]: /messages/by-id/Z5o9OQ0nwWD9tKTR@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#73

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

11 months ago

In reply to: Bertrand Drouvot (#72)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, Feb 03, 2025 at 08:50:15AM +0000, Bertrand Drouvot wrote:

=== 6

I'll also do some benchmark on my side.

So, I did some tests using:

c=1 && psql -c checkpoint -c 'select pg_switch_wal()' &&
pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 20000

With 2 message size: 8192 and 10.

Here are the results (outliers removed and tsc clock source):

+------------+-------------+-------------+
| Version    | Msg Size 10 | Msg Size 8K |
+------------+-------------+-------------+
| With PATCH |    980 TPS  |    910 TPS  |
| On Master  |    980 TPS  |    910 TPS  |
+------------+-------------+-------------+

So the patch does not produce perf regression according to those tests.

Out of curiosity I also played a bit with the IO tracking (and hpet clock source)
and got:

+---------------------------------+-------------+-------------+-----------+
| Test Configuration              |    PATCH    |   MASTER    | % Change |
+---------------------------------+-------------+-------------+-----------+
| track_io_timing                 |                                      |
|   Message size 8192             |    805 TPS  |    810 TPS  |   -0.6%  |
|   Message size 10               |    860 TPS  |    860 TPS  |    0.0%  |
+---------------------------------+-------------+-------------+-----------+
| track_wal_io_timing             |                                      |
|   Message size 8192             |    810 TPS  |    810 TPS  |    0.0%  |
|   Message size 10               |    860 TPS  |    860 TPS  |    0.0%  |
+---------------------------------+-------------+-------------+-----------+
| track_wal_io + track_io         |                                      |
|   Message size 8192             |    800 TPS  |    800 TPS  |    0.0%  |
|   Message size 10               |    855 TPS  |    860 TPS  |   -0.6%  |
+---------------------------------+-------------+-------------+-----------+

Based on those results the patch does not show a noticable impact when IO timing
tracking is/are enabled.

FYI, It’s also worth noticing that if hpet is set then it also affect negatively
even if no timing tracking is set. It means that when track IO timing is/are
enabled the perf regression seen above are not fully related to having then
enabled but also (for a large part) to hpet vs tsc.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#74

Nazir Bilal Yavuz

byavuz81@gmail.com

11 months ago

In reply to: Bertrand Drouvot (#72)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Mon, 3 Feb 2025 at 11:50, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Mon, Feb 03, 2025 at 01:07:26PM +0900, Michael Paquier wrote:

On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote:

On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I think that's the main reason why ff99918c625 added this new GUC (looking at
the commit message). I'd feel more comfortable if we keep it.

As Michael suggested, I will run a couple of benchmarks to see the
actual effect of this change. Then let's see if this affects anything.

I've looked at bit at all that today, and something like the attached
is what seems like the best streamlined version to me for the main
feature. I am also planning to run some short benchmarks with
track_io_timing=on on HEAD and with the patch, then see the
difference, without any relationship to track_wal_io_timing.

Thanks!

I've a few comments:

Thank you both for the v13 and the review!

=== 1
+       pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+                                                       io_start, 1, wal_segment_size);
In case wal_init_zero is false, then we're only seeking to the end and write a
solitary byte. Then, is reporting "wal_segment_size" correct?

I think you are right. It would make sense to have two
pgstat_count_io_op_time() calls here. One for wal_segment_size and one
for solitary byte.

=== 2
+ /*
+  * Measure I/O timing to write WAL data, for pg_stat_wal
+  * and/or pg_stat_io.
+  */
+ start = pgstat_prepare_io_time(track_wal_io_timing || track_io_timing);
I think that makes sense done that way (as track_wal_io_timing does not have
any effect in pgstat_count_io_op_time()). Nit: maybe change the order in the
comment to reflect the code ordering? (I mean to say re-word to "for pg_stat_io
and/or pg_stat_wal). The order is ok in issue_xlog_fsync() though.

=== 3

What about adding a message in the doc as mentioned in [1]? (I'd not be surprised
if some people wonder why the "bytes" fields differ).

=== 4

pgstat_tracks_io_object() starts to be hard to read. I wonder if it could be
simplified with switch but that could be done after this one goes in.

=== 5

I think this patch will help simplify the per-backend WAL related patch, that's
nice.

And I agree with the other comments you mentioned.

=== 6

I'll also do some benchmark on my side.

Thanks!

--
Regards,
Nazir Bilal Yavuz
Microsoft

#75

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Bertrand Drouvot (#73)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, Feb 03, 2025 at 10:37:54AM +0000, Bertrand Drouvot wrote:

Based on those results the patch does not show a noticable impact when IO timing
tracking is/are enabled.

I have spent a good portion of my day doing benchmarking with your
scenarios as much as Nazir's scenarios posted at [1]/messages/by-id/CAN55FZ3rb1SPp_17R5nMq+tqLDa0Odb7CxJDsBtW6d4AZo1MGg@mail.gmail.com -- Michael, with
arch_sys_counter as clock source (I am not much into these settings)
on the host I have used. And well, I am not seeing a difference
between the patch and HEAD when enabling the GUC track_io_timing. I
do seem to see a difference when disabling the GUC in the order of
1~2% for the patch and HEAD.

FYI, It’s also worth noticing that if hpet is set then it also affect negatively
even if no timing tracking is set. It means that when track IO timing is/are
enabled the perf regression seen above are not fully related to having then
enabled but also (for a large part) to hpet vs tsc.

Oh, interesting.

[1]: /messages/by-id/CAN55FZ3rb1SPp_17R5nMq+tqLDa0Odb7CxJDsBtW6d4AZo1MGg@mail.gmail.com -- Michael
--
Michael

#76

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Nazir Bilal Yavuz (#74)

Re: Show WAL write and fsync stats in pg_stat_io

On Mon, Feb 03, 2025 at 02:34:29PM +0300, Nazir Bilal Yavuz wrote:

On Mon, 3 Feb 2025 at 11:50, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:
=== 1
+       pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+                                                       io_start, 1, wal_segment_size);
In case wal_init_zero is false, then we're only seeking to the end and write a
solitary byte. Then, is reporting "wal_segment_size" correct?
I think you are right. It would make sense to have two
pgstat_count_io_op_time() calls here. One for wal_segment_size and one
for solitary byte.

Ah, right. We can just use one call with the size written set
depending on wal_init_zero, because this is still a IOOP_WRITE for a
IOCONTEXT_INIT in both cases. Only the size changes as we are in
XLogFileInitInternal().

+ /*
+  * Measure I/O timing to write WAL data, for pg_stat_wal
+  * and/or pg_stat_io.
+  */
+ start = pgstat_prepare_io_time(track_wal_io_timing || track_io_timing);
I think that makes sense done that way (as track_wal_io_timing does not have
any effect in pgstat_count_io_op_time()). Nit: maybe change the order in the
comment to reflect the code ordering? (I mean to say re-word to "for pg_stat_io
and/or pg_stat_wal). The order is ok in issue_xlog_fsync() though.

Sure. Fine by me. This makes things a bit more consistent across the
board.

=== 3

What about adding a message in the doc as mentioned in [1]? (I'd not be surprised
if some people wonder why the "bytes" fields differ).

Not sure about that. Perhaps you have something in mind?

=== 4

pgstat_tracks_io_object() starts to be hard to read. I wonder if it could be
simplified with switch but that could be done after this one goes in.

If you have a proposal, feel free. The current style is something I'm
used to, as well, so that does not bother me much..

At the end, we want this patch and this data, and my benchmarcking is
not showing much differences even if going through a workload with
many pages, so I've used the version relying entirely on
track_io_timing and applied it.

If we split these timings across more GUCs, one thing to consider
would be a third GUC which is neither track_wal_io_timing nor
track_io_timing to keep things independent, but I am not really
convinced that's necessary.

Now, for the rest..
--
Michael

#77

Nazir Bilal Yavuz

byavuz81@gmail.com

11 months ago

In reply to: Michael Paquier (#76)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Tue, 4 Feb 2025 at 10:55, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Feb 03, 2025 at 02:34:29PM +0300, Nazir Bilal Yavuz wrote:

On Mon, 3 Feb 2025 at 11:50, Bertrand Drouvot <

bertranddrouvot.pg@gmail.com> wrote:

=== 1

+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,

IOOP_WRITE,

+ io_start, 1,

wal_segment_size);

In case wal_init_zero is false, then we're only seeking to the end and

write a

solitary byte. Then, is reporting "wal_segment_size" correct?

I think you are right. It would make sense to have two
pgstat_count_io_op_time() calls here. One for wal_segment_size and one
for solitary byte.

Ah, right. We can just use one call with the size written set
depending on wal_init_zero, because this is still a IOOP_WRITE for a
IOCONTEXT_INIT in both cases. Only the size changes as we are in
XLogFileInitInternal().

Yes, that is better. Thanks!

At the end, we want this patch and this data, and my benchmarcking is
not showing much differences even if going through a workload with
many pages, so I've used the version relying entirely on
track_io_timing and applied it.

Thanks! My benchmark results are the same [1]pgbench -n -c8 -j8-T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', ${emit_bytes}));";) with emit bytes being 10 and 8192 for the 'acpi_pm', 'hpet' and 'tsc' clock sources.. I could not see the
noticable difference between master and patched version. I run the
benchmark a couple of times, sometimes the patched version, sometimes the
master version is better. I think the differences are due to margin of
error. Sharing the latest benchmark run at the bottom [1]pgbench -n -c8 -j8-T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', ${emit_bytes}));";) with emit bytes being 10 and 8192 for the 'acpi_pm', 'hpet' and 'tsc' clock sources..

If we split these timings across more GUCs, one thing to consider
would be a third GUC which is neither track_wal_io_timing nor
track_io_timing to keep things independent, but I am not really
convinced that's necessary.

I agree.

[1]: pgbench -n -c8 -j8-T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', ${emit_bytes}));";) with emit bytes being 10 and 8192 for the 'acpi_pm', 'hpet' and 'tsc' clock sources.
pgbench -n -c8 -j8-T60 -f <(echo "SELECT pg_logical_emit_message(true,
'test', repeat('0', ${emit_bytes}));";)
with emit bytes being 10 and 8192 for the 'acpi_pm', 'hpet' and 'tsc' clock
sources.

(I hope that table gets rendered correctly in the mailing list.)

╔══════════════╦════════════╦═════════════════╦═════════════════════╦════════╦═════════╦═══════════╗
║ clock source ║ emit_bytes ║ track_io_timing ║ track_wal_io_timing ║
master ║ patched ║ Change % ║
║ ║ ║ ║ ║ TPS
║ TPS ║ (Positive ║
║ ║ ║ ║ ║
║ ║ means ║
║ ║ ║ ║ ║
║ ║ better) ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ off ║ 2164
║ 2221 ║ 2.63 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2171
║ 2246 ║ 3.45 ║
║ ║ 10
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2243
║ 2178 ║ -2.90 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2188
║ 2148 ║ -1.83 ║
║
╠════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ acpi_pm ║ ║ off ║ off ║ 2226
║ 2172 ║ -2.43 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2219
║ 2203 ║ -0.72 ║
║ ║ 8192
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2168
║ 2223 ║ 2.54 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2229
║ 2204 ║ -1.12 ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ ║ ║
║ ║ ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ ║ ║
║ ║ ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ off ║ 2087
║ 2156 ║ 3.31 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2141
║ 2089 ║ -2.43 ║
║ ║ 10
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2085
║ 2071 ║ -0.67 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2024
║ 2052 ║ 1.38 ║
║
╠════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ hpet ║ ║ off ║ off ║ 2141
║ 2074 ║ -3.13 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2056
║ 2079 ║ 1.12 ║
║ ║ 8192
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2096
║ 2087 ║ -0.43 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2103
║ 2086 ║ -0.81 ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ ║ ║
║ ║ ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ ║ ║
║ ║ ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ off ║ 2276
║ 2252 ║ -1.05 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2213
║ 2124 ║ -4.02 ║
║ ║ 10
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2183
║ 2250 ║ 3.07 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2287
║ 2250 ║ -1.62 ║
║
╠════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ tsc ║ ║ off ║ off ║ 2178
║ 2179 ║ 0.05 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2203
║ 2187 ║ -0.73 ║
║ ║ 8192
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2220
║ 2240 ║ 0.90 ║
║ ║
╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2102
║ 2182 ║ 3.81 ║
╚══════════════╩════════════╩═════════════════╩═════════════════════╩════════╩═════════╩═══════════╝

--
Regards,
Nazir Bilal Yavuz
Microsoft

#78

Nazir Bilal Yavuz

byavuz81@gmail.com

11 months ago

In reply to: Nazir Bilal Yavuz (#77)

1 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Tue, 4 Feb 2025 at 17:29, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Thanks! My benchmark results are the same [1]. I could not see the
noticable difference between master and patched version. I run the
benchmark a couple of times, sometimes the patched version, sometimes the
master version is better. I think the differences are due to margin of
error. Sharing the latest benchmark run at the bottom [1].

[1]
pgbench -n -c8 -j8-T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', ${emit_bytes}));";)
with emit bytes being 10 and 8192 for the 'acpi_pm', 'hpet' and 'tsc' clock sources.

(I hope that table gets rendered correctly in the mailing list.)

It did not get rendered correctly, so attached as a text file.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#79

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Nazir Bilal Yavuz (#78)

Re: Show WAL write and fsync stats in pg_stat_io

On Tue, Feb 04, 2025 at 05:40:06PM +0300, Nazir Bilal Yavuz wrote:

Hi,

On Tue, 4 Feb 2025 at 17:29, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Thanks! My benchmark results are the same [1]. I could not see the
noticable difference between master and patched version. I run the
benchmark a couple of times, sometimes the patched version, sometimes the
master version is better. I think the differences are due to margin of
error. Sharing the latest benchmark run at the bottom [1].

[1]
pgbench -n -c8 -j8-T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', ${emit_bytes}));";)
with emit bytes being 10 and 8192 for the 'acpi_pm', 'hpet' and 'tsc' clock sources.

(I hope that table gets rendered correctly in the mailing list.)

It did not get rendered correctly, so attached as a text file.

Thanks for this summary.

track_io_timing=on and track_wal_io_timing=off are the most
interesting lines to me, for HEAD vs the patch. The number for tsc
with 10 bytes is the most pronounced one, but there is a similar
variance even when using "off" for the two GUCs, where I would not
have expected any differences. So yeah, the variance looks like
mostly noise here.

Please note that I have marked the CF entry at [1]https://commitfest.postgresql.org/51/4950/ -- Michael as committed as the
main goal of this thread has been achieved. Do you want to pursue the
topic about the potential refactorings in pgstat_tracks_io_op() and
pgstat_tracks_io_object()? There is a second one with the elimination
of the write & sync columns for pg_stat_wal. Both should have their
own thread if worked on, and this thread has been already quite long.

[1]: https://commitfest.postgresql.org/51/4950/ -- Michael
--
Michael

#80

Tom Lane

tgl@sss.pgh.pa.us

11 months ago

In reply to: Michael Paquier (#76)

Re: Show WAL write and fsync stats in pg_stat_io

Michael Paquier <michael@paquier.xyz> writes:

At the end, we want this patch and this data, and my benchmarcking is
not showing much differences even if going through a workload with
many pages, so I've used the version relying entirely on
track_io_timing and applied it.

Locally, the test added by this commit fails like so:

diff -U3 /home/postgres/pgsql/src/test/regress/expected/stats.out /home/postgres
/pgsql/src/test/regress/results/stats.out
--- /home/postgres/pgsql/src/test/regress/expected/stats.out    2025-02-04 12:33
:07.456393545 -0500
+++ /home/postgres/pgsql/src/test/regress/results/stats.out     2025-02-05 13:08
:30.605638432 -0500
@@ -886,7 +886,7 @@
   WHERE context = 'normal' AND object = 'wal';
  ?column? 
 ----------
- t
+ f
 (1 row)

-----

This is pretty repeatable (not perfectly so) in a build with
--enable-debug --enable-cassert --enable-tap-tests --with-llvm
but it usually passes without --with-llvm. System is fairly
up-to-date RHEL8 on x86_64. No idea why the buildfarm isn't
unhappy. Any pointers where to look?

regards, tom lane

#81

Nazir Bilal Yavuz

byavuz81@gmail.com

11 months ago

In reply to: Tom Lane (#80)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, 5 Feb 2025 at 21:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael@paquier.xyz> writes:

At the end, we want this patch and this data, and my benchmarcking is
not showing much differences even if going through a workload with
many pages, so I've used the version relying entirely on
track_io_timing and applied it.

Locally, the test added by this commit fails like so:
diff -U3 /home/postgres/pgsql/src/test/regress/expected/stats.out /home/postgres
/pgsql/src/test/regress/results/stats.out
--- /home/postgres/pgsql/src/test/regress/expected/stats.out    2025-02-04 12:33
:07.456393545 -0500
+++ /home/postgres/pgsql/src/test/regress/results/stats.out     2025-02-05 13:08
:30.605638432 -0500
@@ -886,7 +886,7 @@
WHERE context = 'normal' AND object = 'wal';
?column?
----------
- t
+ f
(1 row)
-----

This is pretty repeatable (not perfectly so) in a build with
--enable-debug --enable-cassert --enable-tap-tests --with-llvm
but it usually passes without --with-llvm. System is fairly
up-to-date RHEL8 on x86_64. No idea why the buildfarm isn't
unhappy. Any pointers where to look?

Thanks for the report!

My thoughts when adding this test was that startup process must do the
WAL read I/O while server is starting, i.e.:

'''
startup process ->
InitWalRecovery ->
ReadCheckpointRecord ->
ReadRecord ->
XLogPrefetcherReadRecord ->
lrq_complete_lsn ->
lrq_prefetch ->
lrq->next = XLogPrefetcherNextBlock ->
XLogReadAhead ->
XLogDecodeNextRecord ->
ReadPageInternal ->
state->routine.page_read = XLogPageRead()
'''

Is there a chance that the function chain above does not get triggered
while running the stats.sql test?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#82

Tom Lane

tgl@sss.pgh.pa.us

11 months ago

In reply to: Nazir Bilal Yavuz (#81)

Re: Show WAL write and fsync stats in pg_stat_io

Nazir Bilal Yavuz <byavuz81@gmail.com> writes:

On Wed, 5 Feb 2025 at 21:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Locally, the test added by this commit fails like so:

My thoughts when adding this test was that startup process must do the
WAL read I/O while server is starting, i.e.:

Oh! The problem is much simpler than I thought, then, and not
particularly related to configure options. I was misled by
not trying repeated tests, but:

$ ./configure ...
$ make; make install
$ initdb ...; pg_ctl start
$ make installcheck
... all OK ...
$ make installcheck
... fails ...
$ make installcheck
... fails ...
$ make installcheck
... fails ...
$ make installcheck
... fails ...

Basically, this test is broken because it assumes that the server
is freshly started. That's not a scenario the buildfarm runs
the core tests under, unfortunately, but it is one commonly
used in hand testing.

regards, tom lane

#83

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Tom Lane (#82)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Feb 05, 2025 at 09:06:24PM -0500, Tom Lane wrote:

Basically, this test is broken because it assumes that the server
is freshly started. That's not a scenario the buildfarm runs
the core tests under, unfortunately, but it is one commonly
used in hand testing.

Right. You are pointing at a stability issue related to the reads.
While looking at your case, I am also able to see two additional
failures with io_sum_wal_init_writes and io_sum_wal_init_fsyncs with
the initial context.

The queries for the normal context are not going to have this problem
even if we have a pg_stat_reset_shared('io'), but the init context
gets unstable, unfortunately. I don't see a way through here in the
main regression test suite, so how about moving these into
027_stream_regress.pl. It is possible to query the WAL read on the
standby of this test, and the write part for the init context on the
primary. The syncs are not relevant as TAP usually runs with
fsync=off, so better to remove this part entirely.
--
Michael

#84

Tom Lane

tgl@sss.pgh.pa.us

11 months ago

In reply to: Michael Paquier (#83)

Re: Show WAL write and fsync stats in pg_stat_io

Michael Paquier <michael@paquier.xyz> writes:

The queries for the normal context are not going to have this problem
even if we have a pg_stat_reset_shared('io'), but the init context
gets unstable, unfortunately. I don't see a way through here in the
main regression test suite, so how about moving these into
027_stream_regress.pl. It is possible to query the WAL read on the
standby of this test, and the write part for the init context on the
primary. The syncs are not relevant as TAP usually runs with
fsync=off, so better to remove this part entirely.

Yeah, if we want to assume we can see stats counts left over from
initdb, we have to put this in a TAP test, though I dunno if that is
the most appropriate one.

Now that I've looked at the tests a bit, I'm also distressed
by this test pattern:

SELECT stats_reset AS slru_commit_ts_reset_ts FROM pg_stat_slru WHERE name = 'commit_timestamp' \gset
SELECT pg_stat_reset_slru();
SELECT stats_reset > :'slru_commit_ts_reset_ts'::timestamptz FROM pg_stat_slru WHERE name = 'commit_timestamp';

This assumes that the execution time of pg_stat_reset_slru() is more
than the system clock resolution. I won't be surprised to see that
fail in the future. We did discover recently that gettimeofday is
good to the microsecond on most modern platforms [1]/messages/by-id/be0339cc-1ae1-4892-9445-8e6d8995a44d@eisentraut.org, but it won't
get any better than that, while our machines keep getting faster.
Just for reference, on my hardly-bleeding-edge-anymore workstation:

regression=# select clock_timestamp(), pg_stat_reset_slru(), clock_timestamp();
clock_timestamp | pg_stat_reset_slru | clock_timestamp
-------------------------------+--------------------+-------------------------------
2025-02-05 21:47:54.929221-05 | | 2025-02-05 21:47:54.929223-05
(1 row)

regards, tom lane

[1]: /messages/by-id/be0339cc-1ae1-4892-9445-8e6d8995a44d@eisentraut.org

#85

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Tom Lane (#84)

1 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

On Wed, Feb 05, 2025 at 09:52:14PM -0500, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:
Yeah, if we want to assume we can see stats counts left over from
initdb, we have to put this in a TAP test, though I dunno if that is
the most appropriate one.

A second option I can think of for the reads is a SQL query in
pg_walinspect. We are sure that we have a xlogreader context there,
forcing reads.

Anyway, I would just stick all that to TAP, like the attached in 027,
where we would rely on the startup process to read data, and the
checkpointer to initialize a segment for the primary. Perhaps not the
best position, but we already have similar queries in this test, and
these two are cheap. Thoughts about the attached?

Now that I've looked at the tests a bit, I'm also distressed
by this test pattern:

SELECT stats_reset AS slru_commit_ts_reset_ts FROM pg_stat_slru WHERE name = 'commit_timestamp' \gset
SELECT pg_stat_reset_slru();
SELECT stats_reset > :'slru_commit_ts_reset_ts'::timestamptz FROM pg_stat_slru WHERE name = 'commit_timestamp';

This assumes that the execution time of pg_stat_reset_slru() is more
than the system clock resolution. I won't be surprised to see that
fail in the future. We did discover recently that gettimeofday is
good to the microsecond on most modern platforms [1], but it won't
get any better than that, while our machines keep getting faster.
Just for reference, on my hardly-bleeding-edge-anymore workstation:

Hmm. Interesting.
--
Michael

Attachments:

pgstat_io-tests.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index bab7b28084b..a04ecc4d2db 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -63,6 +63,26 @@ $node_standby_1->append_conf('postgresql.conf',
 	'max_standby_streaming_delay = 600s');
 $node_standby_1->start;
 
+# Check some WAL statistics.  The standby should have done WAL reads in
+# the startup process when starting, and the primary WAL some writes with
+# its checkpointer.
+my $result = $node_primary->safe_psql(
+	'postgres',
+	qq{SELECT object, context, writes > 0 AS writes_done
+  FROM pg_stat_io
+  WHERE context = 'init' AND
+    object = 'wal' AND
+    backend_type = 'checkpointer'});
+is($result, qq(wal|init|t), 'check contents of WAL stats on primary');
+$result = $node_standby_1->safe_psql(
+	'postgres',
+	qq{SELECT object, context, reads > 0 AS reads_done
+  FROM pg_stat_io
+  WHERE context = 'normal' AND
+    object = 'wal' AND
+    backend_type = 'startup'});
+is($result, qq(wal|normal|t), 'check contents of WAL stats on standby');
+
 my $dlpath = dirname($ENV{REGRESS_SHLIB});
 my $outputdir = $PostgreSQL::Test::Utils::tmp_check;
 
@@ -163,7 +183,7 @@ $node_primary->safe_psql('postgres', 'CREATE EXTENSION pg_stat_statements');
 # This gathers data based on the first characters for some common query types,
 # checking that reports are generated for SELECT, DMLs, and DDL queries with
 # CREATE.
-my $result = $node_primary->safe_psql(
+$result = $node_primary->safe_psql(
 	'postgres',
 	qq{WITH select_stats AS
   (SELECT upper(substr(query, 1, 6)) AS select_query
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 9a02481ee7e..7d91f047bb3 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,33 +862,6 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
--- Test pg_stat_io for WAL in an init context, that should do writes
--- and syncs.
-SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
-  FROM pg_stat_io
-  WHERE context = 'init' AND object = 'wal' \gset io_sum_wal_init_
-SELECT :io_sum_wal_init_writes > 0;
- ?column? 
-----------
- t
-(1 row)
-
-SELECT current_setting('fsync') = 'off'
-  OR :io_sum_wal_init_fsyncs > 0;
- ?column? 
-----------
- t
-(1 row)
-
--- Test pg_stat_io for WAL in a normal context, that should do reads as well.
-SELECT SUM(reads) > 0
-  FROM pg_stat_io
-  WHERE context = 'normal' AND object = 'wal';
- ?column? 
-----------
- t
-(1 row)
-
 -----
 -- Test that resetting stats works for reset timestamp
 -----
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 901e7bd56e3..11628ebc8a1 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,20 +442,6 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
--- Test pg_stat_io for WAL in an init context, that should do writes
--- and syncs.
-SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
-  FROM pg_stat_io
-  WHERE context = 'init' AND object = 'wal' \gset io_sum_wal_init_
-SELECT :io_sum_wal_init_writes > 0;
-SELECT current_setting('fsync') = 'off'
-  OR :io_sum_wal_init_fsyncs > 0;
-
--- Test pg_stat_io for WAL in a normal context, that should do reads as well.
-SELECT SUM(reads) > 0
-  FROM pg_stat_io
-  WHERE context = 'normal' AND object = 'wal';
-
 -----
 -- Test that resetting stats works for reset timestamp
 -----

#86

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

11 months ago

In reply to: Michael Paquier (#85)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Thu, Feb 06, 2025 at 12:35:28PM +0900, Michael Paquier wrote:

On Wed, Feb 05, 2025 at 09:52:14PM -0500, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:
Yeah, if we want to assume we can see stats counts left over from
initdb, we have to put this in a TAP test, though I dunno if that is
the most appropriate one.

A second option I can think of for the reads is a SQL query in
pg_walinspect. We are sure that we have a xlogreader context there,
forcing reads.

Anyway, I would just stick all that to TAP, like the attached in 027,
where we would rely on the startup process to read data, and the
checkpointer to initialize a segment for the primary. Perhaps not the
best position, but we already have similar queries in this test, and
these two are cheap. Thoughts about the attached?

That sounds ok, but I wonder if that's the best appropriate place. I could
think of the checkpointer test in 029_stats_restart.pl and the startup/standby
one in one related to standby (030_stats_cleanup_replica.pl?). Though that's
probably just a matter of taste.

s/and the primary WAL some writes/and the primary some WAL writes/

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#87

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Bertrand Drouvot (#86)

Re: Show WAL write and fsync stats in pg_stat_io

On Thu, Feb 06, 2025 at 07:32:01AM +0000, Bertrand Drouvot wrote:

That sounds ok, but I wonder if that's the best appropriate place. I could
think of the checkpointer test in 029_stats_restart.pl and the startup/standby
one in one related to standby (030_stats_cleanup_replica.pl?). Though that's
probably just a matter of taste.

Hmm. Your suggestion of 029_stats_restart.pl is a tempting choice,
indeed, more in line with the fact that we are checking some stats
data. I did not think about that.

Note that I'm OK about using 029 or even 030, as long as both queries
stay together.
--
Michael

#88

Nazir Bilal Yavuz

byavuz81@gmail.com

11 months ago

In reply to: Michael Paquier (#79)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Wed, 5 Feb 2025 at 08:30, Michael Paquier <michael@paquier.xyz> wrote:

Please note that I have marked the CF entry at [1] as committed as the
main goal of this thread has been achieved. Do you want to pursue the
topic about the potential refactorings in pgstat_tracks_io_op() and
pgstat_tracks_io_object()? There is a second one with the elimination
of the write & sync columns for pg_stat_wal. Both should have their
own thread if worked on, and this thread has been already quite long.

Yes, I would like to work on these.

By saying potential refactorings in pgstat_tracks_io_op() and
pgstat_tracks_io_object(), do you mean converting if statements to
switch statements or something else?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#89

Nazir Bilal Yavuz

byavuz81@gmail.com

11 months ago

In reply to: Michael Paquier (#87)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Thu, 6 Feb 2025 at 11:39, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Feb 06, 2025 at 07:32:01AM +0000, Bertrand Drouvot wrote:

That sounds ok, but I wonder if that's the best appropriate place. I could
think of the checkpointer test in 029_stats_restart.pl and the startup/standby
one in one related to standby (030_stats_cleanup_replica.pl?). Though that's
probably just a matter of taste.

Hmm. Your suggestion of 029_stats_restart.pl is a tempting choice,
indeed, more in line with the fact that we are checking some stats
data. I did not think about that.

Note that I'm OK about using 029 or even 030, as long as both queries
stay together.

My vote goes to 029. It already has the '## check checkpoint and wal
stats are incremented due to restart' part and what we are adding is
similar to that.

On Thu, 6 Feb 2025 at 10:32, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

s/and the primary WAL some writes/and the primary some WAL writes/

I am not a native speaker but 'primary writes some WAL with its
checkpointer' sounds better to me.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#90

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Nazir Bilal Yavuz (#89)

Re: Show WAL write and fsync stats in pg_stat_io

On Thu, Feb 06, 2025 at 03:39:05PM +0300, Nazir Bilal Yavuz wrote:

My vote goes to 029. It already has the '## check checkpoint and wal
stats are incremented due to restart' part and what we are adding is
similar to that.

Okay, sold, with a routine grabbing some stats from pg_stat_io based
on an object, a context and a backend type to match with the style of
the test. The stats are checked just after initialization, which
should be enough for our purpose.
--
Michael

#91

Ranier Vilela

ranier.vf@gmail.com

10 months ago

In reply to: Michael Paquier (#71)

1 attachment(s)

Re: Show WAL write and fsync stats in pg_stat_io

Hi.

Em seg., 3 de fev. de 2025 às 01:07, Michael Paquier <michael@paquier.xyz>
escreveu:

On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote:

On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I think that's the main reason why ff99918c625 added this new GUC

(looking at

the commit message). I'd feel more comfortable if we keep it.

As Michael suggested, I will run a couple of benchmarks to see the
actual effect of this change. Then let's see if this affects anything.

I've looked at bit at all that today, and something like the attached
is what seems like the best streamlined version to me for the main
feature. I am also planning to run some short benchmarks with
track_io_timing=on on HEAD and with the patch, then see the
difference, without any relationship to track_wal_io_timing.

The comment additions in pgstat_count_io_op_time() were worth a patch
of their own. This part has been applied as b998fedab74c, after a few
tweaks of my own.

Sorry, I couldn't find the email in this thread, linked to the commit:
a051e71 <http://a051e71e28a12342a4fb39a3c149a197159f9c46>

I think it left an oversight.
Copy and past thinko?

Attached a trivial patch.

best regards,
Ranier Vilela

Attachments:

fix-possible-copy-and-paste-thinko.patchapplication/octet-stream; name=fix-possible-copy-and-paste-thinko.patchDownload

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb57502559..28b3678722 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -511,7 +511,7 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	 */
 	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
 		(bktype == B_WAL_RECEIVER || bktype == B_BG_WRITER ||
-		 bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_WORKER ||
+		 bktype == B_AUTOVAC_LAUNCHER || bktype == B_AUTOVAC_WORKER ||
 		 bktype == B_WAL_WRITER))
 		return false;

#92

Nazir Bilal Yavuz

byavuz81@gmail.com

10 months ago

In reply to: Ranier Vilela (#91)

Re: Show WAL write and fsync stats in pg_stat_io

Hi,

On Tue, 18 Mar 2025 at 15:47, Ranier Vilela <ranier.vf@gmail.com> wrote:

Sorry, I couldn't find the email in this thread, linked to the commit:
a051e71

I think it left an oversight.
Copy and past thinko?

Thanks for the report! Yes, it is an oversight on my part.

Attached a trivial patch.

LGTM.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#93

Michael Paquier

michael@paquier.xyz

10 months ago

In reply to: Nazir Bilal Yavuz (#92)

Re: Show WAL write and fsync stats in pg_stat_io

On Tue, Mar 18, 2025 at 04:13:15PM +0300, Nazir Bilal Yavuz wrote:

Thanks for the report! Yes, it is an oversight on my part.

No worries. It's most likely me while reviewing the whole as I've
reordered these blocks a bit while going through the patch.

This has little consequences for the end user, as the read, read_bytes
and read_time should show up as 0 instead of NULL in the pg_stat_io
view for the two WAL rows. Still it is confusing to show zeros, and
it makes queries of pg_stat_io a bit cheaper. Will fix in a bit.
--
Michael

#94

Ranier Vilela

ranier.vf@gmail.com

10 months ago

In reply to: Michael Paquier (#93)

Re: Show WAL write and fsync stats in pg_stat_io

Em ter., 18 de mar. de 2025 às 20:34, Michael Paquier <michael@paquier.xyz>
escreveu:

On Tue, Mar 18, 2025 at 04:13:15PM +0300, Nazir Bilal Yavuz wrote:

Thanks for the report! Yes, it is an oversight on my part.

No worries. It's most likely me while reviewing the whole as I've
reordered these blocks a bit while going through the patch.

This has little consequences for the end user, as the read, read_bytes
and read_time should show up as 0 instead of NULL in the pg_stat_io
view for the two WAL rows. Still it is confusing to show zeros, and
it makes queries of pg_stat_io a bit cheaper. Will fix in a bit.

Thank you Michael.

best regards,
Ranier Vilela