pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

andres@anarazel.de

almost 6 years ago

In reply to: Magnus Hagander (#2)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
autovacuum et al.

I think it'd make sense to at least split buffers_backend into
buffers_backend_extend,
buffers_backend_write,
buffers_backend_write_strat

but it could also be worthwhile to expand it into
buffers_backend_extend,
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
buffers_{backend,autovacuum}_write_stat

Given that these are individual global counters, I don't really see
any reason not to expand it to the bigger set of counters. It's easy
enough to add them up together later if needed.

Are you agreeing to
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
or are you suggesting further ones?

I think we also count things as writes that aren't writes: mdtruncate()
is AFAICT counted as one backend write for each segment. Which seems
weird to me.

It's at least slightly weird :) Might it be worth counting truncate
events separately?

Is that really something interesting? Feels like it'd have to be done at
a higher level to be useful. E.g. the truncate done by TRUNCATE (when in
same xact as creation) and VACUUM are quite different. I think it'd be
better to just not include it.

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Thanks,

Andres Freund

Magnus Hagander

magnus@hagander.net

almost 6 years ago

In reply to: Andres Freund (#3)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
autovacuum et al.

I think it'd make sense to at least split buffers_backend into
buffers_backend_extend,
buffers_backend_write,
buffers_backend_write_strat

but it could also be worthwhile to expand it into
buffers_backend_extend,
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
buffers_{backend,autovacuum}_write_stat

Given that these are individual global counters, I don't really see
any reason not to expand it to the bigger set of counters. It's easy
enough to add them up together later if needed.

Are you agreeing to
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
or are you suggesting further ones?

The former.

I think we also count things as writes that aren't writes: mdtruncate()
is AFAICT counted as one backend write for each segment. Which seems
weird to me.

It's at least slightly weird :) Might it be worth counting truncate
events separately?

Is that really something interesting? Feels like it'd have to be done at
a higher level to be useful. E.g. the truncate done by TRUNCATE (when in
same xact as creation) and VACUUM are quite different. I think it'd be
better to just not include it.

Yeah, you're probably right. it certainly makes very little sense
where it is now.

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Right. Each process gets to do their own write, but still in shared
memory. But do you need to lock them when reading them (for the
summary)? That's the part where I figured you could just read and
summarize them, and accept the possible loss.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

andres@anarazel.de

almost 6 years ago

In reply to: Magnus Hagander (#4)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:

On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Right. Each process gets to do their own write, but still in shared
memory. But do you need to lock them when reading them (for the
summary)? That's the part where I figured you could just read and
summarize them, and accept the possible loss.

Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
integers can be read / written without a danger of torn values, and I
don't think we need perfect cross counter accuracy. To deal with the few
platforms without 64bit "single copy atomicity", we can just use
pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
fall back to using locked operations for those platforms. So I don't
think there's actually a danger of loss.

Obviously we could also use atomic ops to increment the value, but I'd
rather not add all those atomic operations, even if it's on uncontended
cachelines. It'd allow us to reset the backend values more easily by
just swapping in a 0, which we can't do if the backend increments
non-atomically. But I think we could instead just have one global "bias"
value to implement resets (by subtracting that from the summarized
value, and storing the current sum when resetting). Or use the new
global barrier to trigger a reset. Or something similar.

Greetings,

Andres Freund

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Andres Freund (#5)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hello.

At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

I feel the same on the specific issues brought in upthread.

On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:

On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Right. Each process gets to do their own write, but still in shared
memory. But do you need to lock them when reading them (for the
summary)? That's the part where I figured you could just read and
summarize them, and accept the possible loss.

Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
integers can be read / written without a danger of torn values, and I
don't think we need perfect cross counter accuracy. To deal with the few
platforms without 64bit "single copy atomicity", we can just use
pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
fall back to using locked operations for those platforms. So I don't
think there's actually a danger of loss.

Obviously we could also use atomic ops to increment the value, but I'd
rather not add all those atomic operations, even if it's on uncontended
cachelines. It'd allow us to reset the backend values more easily by
just swapping in a 0, which we can't do if the backend increments
non-atomically. But I think we could instead just have one global "bias"
value to implement resets (by subtracting that from the summarized
value, and storing the current sum when resetting). Or use the new
global barrier to trigger a reset. Or something similar.

Fixed or global stats are suitable for the startar of shared-memory
stats collector. In the case of buffers_*_write, the global stats
entry for each process needs just 8 bytes plus matbe extra 8 bytes for
the bias value. I'm not sure how many counters like this there are,
but is such size of footprint acceptatble? (Each backend already uses
the same amount of local memory for pgstat use, though.)

Anyway I will do something like that as a trial, maybe by adding a
member in PgBackendStatus and one global-shared for the bial value.

int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ PgBackendStatsCounters counters;
} PgBackendStatus;

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

melanieplageman@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#6)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Sun, Jan 26, 2020 at 11:21 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres@anarazel.de> wrote in

On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:

On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:

On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote:

Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

That part has annoyed me as well a few times. +1 for just moving that
into a global shared memory. Given that we don't really care about
things being in sync between those different counters *or* if we loose
a bit of data (which the stats collector is designed to do), we could
even do that without a lock?

I don't think we'd quite want to do it without any (single counter)
synchronization - high concurrency setups would be pretty likely to
loose values that way. I suspect the best would be to have a struct in
shared memory that contains the potential counters for each potential
process. And then sum them up when actually wanting the concrete
value. That way we avoid unnecessary contention, in contrast to having a
single shared memory value for each(which would just pingpong between
different sockets and store buffers). There's a few details like how
exactly to implement resetting the counters, but ...

Right. Each process gets to do their own write, but still in shared
memory. But do you need to lock them when reading them (for the
summary)? That's the part where I figured you could just read and
summarize them, and accept the possible loss.

Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
integers can be read / written without a danger of torn values, and I
don't think we need perfect cross counter accuracy. To deal with the few
platforms without 64bit "single copy atomicity", we can just use
pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
fall back to using locked operations for those platforms. So I don't
think there's actually a danger of loss.

Obviously we could also use atomic ops to increment the value, but I'd
rather not add all those atomic operations, even if it's on uncontended
cachelines. It'd allow us to reset the backend values more easily by
just swapping in a 0, which we can't do if the backend increments
non-atomically. But I think we could instead just have one global "bias"
value to implement resets (by subtracting that from the summarized
value, and storing the current sum when resetting). Or use the new
global barrier to trigger a reset. Or something similar.

Fixed or global stats are suitable for the startar of shared-memory
stats collector. In the case of buffers_*_write, the global stats
entry for each process needs just 8 bytes plus matbe extra 8 bytes for
the bias value. I'm not sure how many counters like this there are,
but is such size of footprint acceptatble? (Each backend already uses
the same amount of local memory for pgstat use, though.)

Anyway I will do something like that as a trial, maybe by adding a
member in PgBackendStatus and one global-shared for the bial value.

int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ PgBackendStatsCounters counters;
} PgBackendStatus;

So, I took a stab at implementing this in PgBackendStatus. The attached
patch is not quite on top of current master, so, alas, don't try and
apply it. I went to rebase today and realized I needed to make some
changes in light of e1025044cd4, however, I wanted to share this WIP so
that I could pose a few questions that I imagine will still be relevant
after I rewrite the patch.

I removed buffers_backend and buffers_backend_fsync from
pg_stat_bgwriter and have created a new view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
BufferAccessStrategy which, were they not to use this strategy,
could perhaps have been avoided if a clean shared buffer was
available
- number of fsyncs done by a backend which could have been done by
checkpointer if sync queue had not been full

This view currently does only track writes and extends that go through
shared buffers and fsyncs of shared buffers (which, AFAIK are the only
things fsync'd though the SyncRequest machinery currently).

BufferAlloc() and SyncOneBuffer() are the main points at which the
tracking is done. I can definitely expand this, but, I want to make sure
that we are tracking the right kind of information.

num_backend_writes and num_backend_fsync were intended (though they were
not accurate) to count buffers that backends had to end up writing
themselves and fsyncs that backends had to end up doing themselves which
could have been avoided with a different configuration (or, I suppose, a
different workload/different data, etc). That is, they were meant to
tell you if checkpointer and bgwriter were keeping up and/or if the
size of shared buffers was adequate.

In implementing this counting per backend, it is easy for all types of
backends to keep track of the number of writes, extends, fsyncs, and
strategy writes they are doing. So, as recommended upthread, I have
added columns in the view for the number of writes for checkpointer and
bgwriter and others. Thus, this view becomes more than just stats on
"avoidable I/O done by backends".

So, my question is, does it makes sense to track all extends -- those to
extend the fsm and visimap and when making a new relation or index? Is
that information useful? If so, is it different than the extends done
through shared buffers? Should it be tracked separately?

Also, if we care about all of the extends, then it seems a bit annoying
to pepper the counting all over the place when it really just needs to
be done when smgrextend() — even though maybe a stats function doesn't
belong in that API.

Another question I have is, should the number of extends be for every
single block extended or should we try to track the initiation of a set
of extends (all of those added in RelationAddExtraBlocks(), in this
case)?

When it comes to fsync counting, I only count the fsyncs counted by the
previous code — that is fsyncs done by backends themselves when the
checkpointer sync request queue was full.
I did the counting in the same place in checkpointer code -- in
ForwardSyncRequest() -- partially because there did not seem to be
another good place to do it since register_dirty_segment() returns void
(thought about having it return a bool to indicate if it fsync'd it or
if it registered the fsync because that seemed alright, but mdextend(),
mdwrite() etc, also return NULL) so there is no way to propagate the
information back up to the bufmgr that the process had to do its own
fsync, so, that means that I would have to muck with the md.c API. and,
since the checkpointer is the one processing these sync requests anyway,
it actually seems okay to do it in the checkpointer code.

I'm not counting fsyncs that are "unavoidable" in the sense that they
couldn't be avoided by changing settings/workload etc -- like those done
when building an index, creating a table/rewriting a table/copying a
table -- is it useful to count these? It seems like it makes the number
of "avoidable fsyncs by backends" less useful if we count the others.
Also, should we count how many fsyncs checkpointer has done (have to
check if there is already a stat for that)? Is that useful in this
context?

Of course, this view, when grown, will begin to overlap with pg_statio,
which is another consideration. What is its identity? I would find
"avoidable I/O" either avoidable entirely or avoidable for that
particular type of process, to be useful.

Or maybe, it should have a more expansive mandate. Maybe it would be
useful to aggregate some of the info from pg_stat_statements at a higher
level -- like maybe shared_blks_read counted across many statements for
a period of time/context in which we expected the relation in shared
buffers becomes potentially interesting.

As for the way I have recorded strategy writes -- it is quite inelegant,
but, I wanted to make sure that I only counted a strategy write as one
in which the backend wrote out the dirty buffer from its strategy ring
but did not check if there was any clean buffer in shared buffers more
generally (so, it is *potentially* an avoidable write). I'm not sure if
this distinction is useful to anyone. I haven't done enough with
BufferAccessStrategies to know what I'd want to know about them when
developing or using Postgres. However, if I don't need to be so careful,
it will make the code much simpler (though, I'm sure I can improve the
code regardless).

As for the implementation of the counters themselves, I appreciate that
it isn't very nice to have a bunch of random members in PgBackendStatus
to count all of these write, extends, fsyncs. I considered if I could
add params that were used for all command types to st_progress_param but
I haven't looked into it yet. Alternatively, I could create an array
just for these kind of stats in PgBackendStatus. Though, I imagine that
I should take a look at the changes that have been made recently to this
area and at the shared memory stats patch.

Oh, also, there should be a way to reset the stats, especially if we add
more extends and fsyncs that happen at the time of relation/index
creation. I, at least, would find it useful to see these numbers once
the database is at some kind of steady state.

Oh and src/test/regress/sql/stats.sql will fail and, of course, I don't
intend to add that SELECT from the view to regress, it was just for
testing purposes to make sure the view was working.

-- Melanie

Attachments:

v1-0001-Add-system-view-tracking-shared-buffers-written.patchapplication/octet-stream; name=v1-0001-Add-system-view-tracking-shared-buffers-written.patchDownload

From 434c1ddf6a37d2ed9f6f93fa6d17c1eb934b0a85 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 16 Mar 2021 11:45:50 -0400
Subject: [PATCH v1] Add system view tracking shared buffers written

Add a system view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
  BufferAccessStrategy which, were they not to use this strategy, could
  perhaps have been avoided if a clean shared buffer was available
- number of fsyncs done by a backend which could have been done by
  checkpointer if sync queue had not been full

All backends, on exit, will update a shared memory array with the
buffers they wrote or extended.
When the view is queried, add all live backend's statuses
to the totals in the shared memory array and return that as the full
total.

TODO:
- Some kind of test?
- Docs change
---
 src/backend/catalog/system_views.sql  |  14 +-
 src/backend/postmaster/checkpointer.c |  17 +-
 src/backend/postmaster/pgstat.c       | 234 +++++++++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c   |  77 +++++++++
 src/backend/storage/buffer/freelist.c |  29 +++-
 src/backend/storage/ipc/ipci.c        |   1 +
 src/backend/storage/smgr/smgr.c       |   2 +-
 src/backend/utils/adt/pgstatfuncs.c   |  54 +++++-
 src/include/catalog/pg_proc.dat       |  18 +-
 src/include/miscadmin.h               |  24 +++
 src/include/pgstat.h                  |  13 +-
 src/include/storage/buf_internals.h   |   3 +
 src/test/regress/expected/rules.out   |  11 +-
 src/test/regress/sql/stats.sql        |   1 +
 14 files changed, 453 insertions(+), 45 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d316..238d0ed7db 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1062,8 +1062,6 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
         pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
@@ -1080,6 +1078,18 @@ CREATE VIEW pg_stat_wal AS
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
+CREATE VIEW pg_stat_buffers_written AS
+    SELECT
+        b.buffers_autovacuum_write,
+        b.buffers_autovacuum_write_strat,
+        b.buffers_backend_extend,
+        b.buffers_backend_write,
+        b.buffers_backend_write_strat,
+        b.buffers_backend_fsync,
+        b.buffers_bgwriter_write,
+        b.buffers_checkpointer_write
+FROM pg_stat_get_buffers_written() b;
+
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a459..3bdfc222ba 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -127,9 +127,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1092,10 +1089,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1109,8 +1102,9 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
+		/* TODO: should we count fsyncs for all types of procs? */
 		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+			pgstat_increment_buffers_written(BA_Fsync);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
@@ -1267,13 +1261,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2f3f378e63..3c506bcc79 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -386,6 +386,10 @@ static void pgstat_recv_connstat(PgStat_MsgConn *msg, int len);
 static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 
+static int
+pgstat_get_index_buffers_written(BufferActionType buffer_action_type,
+								 BackendType backend_type);
+
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
@@ -2923,6 +2927,7 @@ static PgBackendGSSStatus *BackendGssStatusBuffer = NULL;
 #endif
 
 
+static int *BuffersWrittenCountersArray = NULL;
 /*
  * Report shared-memory space needed by CreateSharedBackendStatus.
  */
@@ -2955,6 +2960,19 @@ BackendStatusShmemSize(void)
 	return size;
 }
 
+void
+CreateSharedBuffersWrittenCounters(void)
+{
+	bool		found;
+	Size		size = 0;
+
+	size = mul_size(sizeof(int), BuffersWrittenCountersArrayLength);
+	BuffersWrittenCountersArray = (int *)
+		ShmemInitStruct("Buffers written by various backend types", size, &found);
+	if (!found)
+		MemSet(BuffersWrittenCountersArray, 0, size);
+}
+
 /*
  * Initialize the shared status array and several string buffers
  * during postmaster startup.
@@ -3253,6 +3271,10 @@ pgstat_bestart(void)
 	lbeentry.st_state = STATE_UNDEFINED;
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
+	lbeentry.num_extends = 0;
+	lbeentry.num_writes = 0;
+	lbeentry.num_writes_strat = 0;
+	lbeentry.num_fsyncs = 0;
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -3338,6 +3360,16 @@ pgstat_beshutdown_hook(int code, Datum arg)
 	beentry->st_procpid = 0;	/* mark invalid */
 
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+	/*
+	 * Because the stats tracking shared buffers written and extended do not
+	 * go through the stats collector, it didn't make sense to add them to
+	 * pgstat_report_stat() At least the DatabaseId should be valid. Otherwise
+	 * we can't be sure that the members were zero-initialized (TODO: is that
+	 * true?)
+	 */
+	if (OidIsValid(MyDatabaseId))
+		pgstat_record_dead_backend_buffers_written();
 }
 
 
@@ -6928,8 +6960,6 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
 	globalStats.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
 }
 
@@ -7467,3 +7497,203 @@ pgstat_count_slru_truncate(int slru_idx)
 {
 	slru_entry(slru_idx)->m_truncate += 1;
 }
+
+void
+pgstat_increment_buffers_written(BufferActionType ba_type)
+{
+	volatile PgBackendStatus *beentry   = MyBEEntry;
+	BackendType bt;
+
+	if (!beentry || !pgstat_track_activities)
+		return;
+	bt = beentry->st_backendType;
+	if (bt != B_CHECKPOINTER && bt != B_AUTOVAC_WORKER && bt != B_BG_WRITER && bt != B_BACKEND)
+		return;
+
+	PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+	if (ba_type == BA_Write)
+		beentry->num_writes++;
+	else if (ba_type == BA_Extend)
+		beentry->num_extends++;
+	else if (ba_type == BA_Write_Strat)
+		beentry->num_writes_strat++;
+	else if (ba_type == BA_Fsync)
+		beentry->num_fsyncs++;
+	PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
+
+
+/*
+ * Used for a single backend of a BackendType when needing its stats on the
+ * various BufferActionTypes it has done.
+ */
+/*  TODO: should this be a size_t of some kind? */
+static int
+pgstat_get_index_buffers_written(BufferActionType buffer_action_type, BackendType backend_type)
+{
+	/*
+	 * The order is: BLACK HOLE - 0 buffers_autovacuum_write - 1
+	 * buffers_autovacuum_write_strat - 2 buffers_backend_extend - 3
+	 * buffers_backend_write - 4 buffers_backend_write_strat - 5
+	 * buffers_backend_fsync - 6 buffers_bgwriter_write - 7
+	 * buffers_checkpointer_write - 8
+	 *
+	 * This function is responsible for maintaining
+	 * BuffersWrittenCountersArray in the following order [ BLACK_HOLE,
+	 * buffers_autovaccum_write,buffers_autovacuum_write_strat,
+	 * buffers_backend_extend,buffers_backend_write,
+	 * buffers_backend_write_strat,buffers_backend_fsync,
+	 * buffers_bgwriter_write,buffers_checkpointer_write ]
+	 *
+	 * Note that if a BufferActionType is unimplemented for a particular
+	 * BackendType, B_BUFFERS_WRITTEN_BLACK_HOLE is returned
+	 */
+	Assert(buffer_action_type < BA_NUM_TYPES && buffer_action_type >= 0);
+
+	/* TODO: silence the compiler on -wswitch uncovered cases */
+	switch (backend_type)
+	{
+		case B_AUTOVAC_WORKER:
+			switch (buffer_action_type)
+			{
+				case BA_Write:
+					return B_AUTOVAC_WORKER_BA_WRITE;
+				case BA_Write_Strat:
+					return B_AUTOVAC_WORKER_BA_WRITE_STRAT;
+			}
+			break;
+		case B_BACKEND:
+			switch (buffer_action_type)
+			{
+				case BA_Extend:
+					return B_BACKEND_BA_EXTEND;
+				case BA_Write:
+					return B_BACKEND_BA_WRITE;
+				case BA_Write_Strat:
+					return B_BACKEND_BA_WRITE_STRAT;
+				case BA_Fsync:
+					return B_BACKEND_BA_FSYNC;
+			}
+			break;
+		case B_BG_WRITER:
+			switch (buffer_action_type)
+			{
+				case BA_Write:
+					return B_BG_WRITER_BA_WRITE;
+			}
+			break;
+		case B_CHECKPOINTER:
+			switch (buffer_action_type)
+			{
+				case BA_Write:
+					return B_CHECKPOINTER_WRITE;
+			}
+			break;
+		default:
+
+			/*
+			 * TODO: is this ERROR even a good idea? is it better to only
+			 * return the black hole?
+			 */
+			elog(ERROR, "Unrecognized backend type, %d, for buffers written counting.", backend_type);
+	}
+	return B_BUFFERS_WRITTEN_BLACK_HOLE;
+}
+
+/*
+ * Called for a single backend at the time of death to persist its I/O stats
+ * For now, only used by pgstat_beshutdown_hook(), however, could be of use
+ * elsewhere, so keep it public.
+ */
+void
+pgstat_record_dead_backend_buffers_written(void)
+{
+	BackendType bt;
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+
+	if (beentry->st_procpid != 0)
+		return;
+	bt = beentry->st_backendType;
+	if (bt != B_CHECKPOINTER && bt != B_AUTOVAC_WORKER && bt != B_BG_WRITER && bt != B_BACKEND)
+		return;
+
+	for (;;)
+	{
+		int			before_changecount;
+		int			after_changecount;
+
+		pgstat_begin_read_activity(beentry, before_changecount);
+
+		/*
+		 * It is guaranteed that the index will be within bounds of the array
+		 * because pgstat_get_index_buffers_written() only returns indexes
+		 * within bounds of BuffersWrittenCountersArray
+		 */
+		BuffersWrittenCountersArray[pgstat_get_index_buffers_written(BA_Write_Strat, beentry->st_backendType)] += beentry->num_writes_strat;
+		BuffersWrittenCountersArray[pgstat_get_index_buffers_written(BA_Write, beentry->st_backendType)] += beentry->num_writes;
+		BuffersWrittenCountersArray[pgstat_get_index_buffers_written(BA_Extend, beentry->st_backendType)] += beentry->num_extends;
+		BuffersWrittenCountersArray[pgstat_get_index_buffers_written(BA_Fsync, beentry->st_backendType)] += beentry->num_fsyncs;
+
+		pgstat_end_read_activity(beentry, after_changecount);
+
+		if (pgstat_read_activity_complete(before_changecount, after_changecount))
+			break;
+
+		/* Make sure we can break out of loop if stuck... */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ * Input parameter, length, is the length of the values array passed in
+ * Output parameter is values, an array to be filled
+ */
+void
+pgstat_recount_all_backends_buffers_written(Datum * values, int length)
+{
+	int			beid;
+	int			tot_backends = pgstat_fetch_stat_numbackends();
+
+	Assert(length == BuffersWrittenCountersArrayLength);
+
+	/*
+	 * Add stats from all exited backends
+	 *
+	 * TODO: I thought maybe it is okay to just access this lock-free since it
+	 * is only written to when a process dies in
+	 * pgstat_record_dead_backend_buffers_written() and is read at the time of
+	 * querying the view with the stats. It's okay if we don't have 100%
+	 * up-to-date stats. However, I was wondering about torn values and
+	 * platforms without 64bit "single copy atomicity"
+	 *
+	 * Because the values array is datums and
+	 * BuffersWrittenCountersArrayLength is int64s, can't do a simple memcpy
+	 *
+	 */
+	for (int i = 0; i < BuffersWrittenCountersArrayLength; i++)
+		values[i] += BuffersWrittenCountersArray[i];
+
+	/*
+	 * Loop through all live backends and count their writes
+	 */
+	for (beid = 1; beid <= tot_backends; beid++)
+	{
+		BackendType bt;
+		PgBackendStatus *beentry = pgstat_fetch_stat_beentry(beid);
+
+		if (beentry->st_procpid == 0)
+			continue;
+		bt = beentry->st_backendType;
+		if (bt != B_CHECKPOINTER && bt != B_AUTOVAC_WORKER && bt != B_BG_WRITER && bt != B_BACKEND)
+			continue;
+
+		values[pgstat_get_index_buffers_written(BA_Extend,
+												beentry->st_backendType)] += beentry->num_extends;
+		values[pgstat_get_index_buffers_written(BA_Write,
+												beentry->st_backendType)] += beentry->num_writes;
+		values[pgstat_get_index_buffers_written(BA_Write_Strat,
+												beentry->st_backendType)] += beentry->num_writes_strat;
+		values[pgstat_get_index_buffers_written(BA_Fsync,
+												beentry->st_backendType)] += beentry->num_fsyncs;
+	}
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..f71648fa89 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -891,6 +891,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		/*
+		 * Extends counted here are only those that go through shared buffers
+		 */
+		pgstat_increment_buffers_written(BA_Extend);
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1157,11 +1162,65 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					if (XLogNeedsFlush(lsn) &&
 						StrategyRejectBuffer(strategy, buf))
 					{
+						/*
+						 * Unset the strat write flag, as we will not be writing
+						 * this particular buffer from our ring out and may end
+						 * up having to find a buffer from main shared buffers,
+						 * which, if it is dirty, we may have to write out, which
+						 * could have been prevented by checkpointing and background
+						 * writing
+						 */
+						StrategyUnChooseBufferFromRing(strategy);
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
 						UnpinBuffer(buf, true);
 						continue;
 					}
+
+					/*
+					 * TODO: there is certainly a better way to write this
+					 * logic
+					 */
+
+					/*
+					 * buffers_backend_write, buffers_backend_write_strat,
+					 * buffers_autovacuum_write, or
+					 * buffers_autovacuum_write_strat
+					 */
+					/* are all incremented in the next 20 or so lines */
+
+					/*
+					 * The dirty buffer that will be written out was selected
+					 * from the ring and we did not bother checking the
+					 * freelist or doing a clock sweep to look for a clean
+					 * buffer to use, thus, this write will be counted as a
+					 * strategy write -- one that may be unnecessary without a
+					 * strategy
+					 */
+					if (StrategyIsBufferFromRing(strategy))
+					{
+						pgstat_increment_buffers_written(BA_Write_Strat);
+					}
+
+					/*
+					 * If the dirty buffer was one we grabbed from the
+					 * freelist or through a clock sweep, it could have been
+					 * written out by bgwriter or checkpointer, thus, we will
+					 * count it as a regular write
+					 */
+					else
+						pgstat_increment_buffers_written(BA_Write);
+				}
+				else
+				{
+					/*
+					 * If strategy is NULL, we could only be doing a write.
+					 * Extend operations will be counted in smgrextend. That
+					 * is separate I/O than any flushing of dirty buffers. If
+					 * we add more Backend Access Types, perhaps we will need
+					 * additional checks here
+					 */
+					pgstat_increment_buffers_written(BA_Write);
 				}
 
 				/* OK, do the I/O */
@@ -2471,6 +2530,10 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+	/*
+	 * Increment buffers_bgwriter_write and buffers_checkpointer_write
+	 */
+	pgstat_increment_buffers_written(BA_Write);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
@@ -2823,6 +2886,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
 	 */
+
+	/*
+	 * TODO: consider that if we did not need to distinguish between a buffer
+	 * flushed that was grabbed from the ring buffer and written out as part
+	 * of a strategy which was not from main Shared Buffers (and thus
+	 * preventable by bgwriter or checkpointer), then we could move all calls
+	 * to pgstat_increment_buffers_written() here except for the one for
+	 * extends, which would remain in ReadBuffer_common() before smgrextend()
+	 * (unless we decide to start counting other extends). That includes the
+	 * call to count buffers written by bgwriter and checkpointer which go
+	 * through FlushBuffer() but not BufferAlloc(). That would make it
+	 * simpler. Perhaps instead we can find somewhere else to indicate that
+	 * the buffer is from the ring of buffers to reuse.
+	 */
 	smgrwrite(reln,
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..4fbc7c4619 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -87,6 +87,14 @@ typedef struct BufferAccessStrategyData
 	 */
 	bool		current_was_in_ring;
 
+	/*
+	 * If we could chose a buffer from this list and we end up having to write
+	 * it out because it is dirty when we actually could have found a clean
+	 * buffer in either the freelist or through doing a clock sweep of shared
+	 * buffers, this flag will indicate that
+	 */
+	bool		chose_buffer_in_ring;
+
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -212,8 +220,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		if (buf != NULL) {
+			StrategyChooseBufferBufferFromRing(strategy);
 			return buf;
+		}
 	}
 
 	/*
@@ -702,3 +712,20 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 
 	return true;
 }
+void
+StrategyUnChooseBufferFromRing(BufferAccessStrategy strategy)
+{
+	strategy->chose_buffer_in_ring = false;
+}
+
+void
+StrategyChooseBufferBufferFromRing(BufferAccessStrategy strategy)
+{
+	strategy->chose_buffer_in_ring = true;
+}
+
+bool
+StrategyIsBufferFromRing(BufferAccessStrategy strategy)
+{
+	return strategy->chose_buffer_in_ring;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97..e2e86705e0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -240,6 +240,7 @@ CreateSharedMemoryAndSemaphores(void)
 		InitProcGlobal();
 	CreateSharedProcArray();
 	CreateSharedBackendStatus();
+	CreateSharedBuffersWrittenCounters();
 	TwoPhaseShmemInit();
 	BackgroundWorkerShmemInit();
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..a9a077af80 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -25,7 +25,7 @@
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
-
+#include "miscadmin.h"
 
 /*
  * This struct of function pointers defines the API between smgr.c and
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9ffbca685c..4c2a2b92b1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1774,23 +1774,59 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
+pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
 }
 
 Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
+pg_stat_get_buffers_written(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
-}
+	TupleDesc	tupdesc;
+	Datum		values[BuffersWrittenCountersArrayLength];
 
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+	/*
+	 * Values will be filled from BuffersWrittenCountersArray, which has an
+	 * extra spot for data that is not needed, a black hole
+	 */
+	bool		nulls[BuffersWrittenCountersArrayLength];
+
+	/* Initialise values and NULL flags arrays */
+	MemSet(values, 0, sizeof(values));
+	MemSet(nulls, 0, sizeof(nulls));
+
+	/* Initialise attributes information in the tuple descriptor */
+	tupdesc = CreateTemplateTupleDesc(BuffersWrittenCountersArrayLength - 1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "buffers_autovacuum_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "buffers_autovacuum_write_strat",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "buffers_backend_extend",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "buffers_backend_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "buffers_backend_write_strat",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "buffers_backend_fsync",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "buffers_bgwriter_write",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "buffers_checkpointer_write",
+					   INT8OID, -1, 0);
+
+	BlessTupleDesc(tupdesc);
+
+	/*
+	 * Fill values and NULLs. values will be filled with the number of writes
+	 * by all live regular backends and relevant auxiliary backends as well as
+	 * exited backends
+	 */
+	pgstat_recount_all_backends_buffers_written(values, BuffersWrittenCountersArrayLength);
+	/* Returns the record as Datum */
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values + 1, nulls)));
 }
 
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 69ffd0c3f4..b6c07175dc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5366,6 +5366,15 @@
   proname => 'pg_stat_get_db_numbackends', provolatile => 's',
   proparallel => 'r', prorettype => 'int4', proargtypes => 'oid',
   prosrc => 'pg_stat_get_db_numbackends' },
+
+{ oid => '8459', descr => 'statistics: counts of buffers written by different types of backends',
+  proname => 'pg_stat_get_buffers_written', provolatile => 's', proisstrict => 'f',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int8,int8,int8,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{buffers_autovacuum_write,buffers_autovacuum_write_strat,buffers_backend_extend,buffers_backend_write,buffers_backend_write_strat,buffers_backend_fsync,buffers_bgwriter_write,buffers_checkpointer_write}',
+  prosrc => 'pg_stat_get_buffers_written' },
+
 { oid => '1942', descr => 'statistics: transactions committed',
   proname => 'pg_stat_get_db_xact_commit', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
@@ -5544,15 +5553,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
 { oid => '2859', descr => 'statistics: number of buffer allocations',
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 013850ac28..b52fe74b72 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -326,6 +326,30 @@ typedef enum BackendType
 
 extern BackendType MyBackendType;
 
+typedef enum BufferActionType
+{
+	BA_Extend,
+	BA_Write,
+	BA_Write_Strat,
+	BA_Fsync,
+	BA_NUM_TYPES,
+}			BufferActionType;
+
+/*  TODO: does this belong here? */
+typedef enum BuffersWrittenCountersIndex
+{
+	B_BUFFERS_WRITTEN_BLACK_HOLE = 0,
+	B_AUTOVAC_WORKER_BA_WRITE,
+	B_AUTOVAC_WORKER_BA_WRITE_STRAT,
+	B_BACKEND_BA_EXTEND,
+	B_BACKEND_BA_WRITE,
+	B_BACKEND_BA_WRITE_STRAT,
+	B_BACKEND_BA_FSYNC,
+	B_BG_WRITER_BA_WRITE,
+	B_CHECKPOINTER_WRITE,
+	BuffersWrittenCountersArrayLength,
+}			BuffersWrittenCountersIndex;
+
 extern const char *GetBackendTypeDesc(BackendType backendType);
 
 extern void SetDatabasePath(const char *path);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe6683cf5c..776520098e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -456,8 +456,6 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_checkpoints;
 	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
@@ -830,7 +828,6 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_checkpoints;
 	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
@@ -1263,6 +1260,10 @@ typedef struct PgBackendStatus
 	 */
 	ProgressCommandType st_progress_command;
 	Oid			st_progress_command_target;
+	int num_extends;
+	int num_writes;
+	int num_writes_strat;
+	int num_fsyncs;
 	int64		st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
 } PgBackendStatus;
 
@@ -1411,7 +1412,7 @@ extern SessionEndType pgStatSessionEndCause;
  */
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
-
+extern void CreateSharedBuffersWrittenCounters(void);
 extern void pgstat_init(void);
 extern int	pgstat_start(void);
 extern void pgstat_reset_all(void);
@@ -1634,4 +1635,8 @@ extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int	pgstat_slru_index(const char *name);
 
+extern void pgstat_increment_buffers_written(BufferActionType ba_type);
+extern void pgstat_record_dead_backend_buffers_written(void);
+extern void pgstat_recount_all_backends_buffers_written(Datum *values, int length);
+
 #endif							/* PGSTAT_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..2bec2cee45 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -314,6 +314,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
+extern void StrategyUnChooseBufferFromRing(BufferAccessStrategy strategy);
+extern void StrategyChooseBufferBufferFromRing(BufferAccessStrategy strategy);
+extern bool StrategyIsBufferFromRing(BufferAccessStrategy strategy);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b59a7b4a5..a3d5e60e74 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1823,10 +1823,17 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
     pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers_written| SELECT b.buffers_autovacuum_write,
+    b.buffers_autovacuum_write_strat,
+    b.buffers_backend_extend,
+    b.buffers_backend_write,
+    b.buffers_backend_write_strat,
+    b.buffers_backend_fsync,
+    b.buffers_bgwriter_write,
+    b.buffers_checkpointer_write
+   FROM pg_stat_get_buffers_written() b(buffers_autovacuum_write, buffers_autovacuum_write_strat, buffers_backend_extend, buffers_backend_write, buffers_backend_write_strat, buffers_backend_fsync, buffers_bgwriter_write, buffers_checkpointer_write);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..737b813c15 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,5 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffers_written;
 -- End of Stats Test
-- 
2.25.0

andres@anarazel.de

over 4 years ago

In reply to: Melanie Plageman (#7)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2021-04-12 19:49:36 -0700, Melanie Plageman wrote:

So, I took a stab at implementing this in PgBackendStatus.

Cool!

The attached patch is not quite on top of current master, so, alas,
don't try and apply it. I went to rebase today and realized I needed
to make some changes in light of e1025044cd4, however, I wanted to
share this WIP so that I could pose a few questions that I imagine
will still be relevant after I rewrite the patch.

I removed buffers_backend and buffers_backend_fsync from
pg_stat_bgwriter and have created a new view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
BufferAccessStrategy which, were they not to use this strategy,
could perhaps have been avoided if a clean shared buffer was
available
- number of fsyncs done by a backend which could have been done by
checkpointer if sync queue had not been full

I wonder if leaving buffers_alloc in pg_stat_bgwriter makes sense after
this? I'm tempted to move that to pg_stat_buffers or such...

I'm not quite convinced by having separate columns for checkpointer,
bgwriter, etc. That doesn't seem to scale all that well. What if we
instead made it a view that has one row for each BackendType?

In implementing this counting per backend, it is easy for all types of
backends to keep track of the number of writes, extends, fsyncs, and
strategy writes they are doing. So, as recommended upthread, I have
added columns in the view for the number of writes for checkpointer and
bgwriter and others. Thus, this view becomes more than just stats on
"avoidable I/O done by backends".

So, my question is, does it makes sense to track all extends -- those to
extend the fsm and visimap and when making a new relation or index? Is
that information useful? If so, is it different than the extends done
through shared buffers? Should it be tracked separately?

I don't fully understand what you mean with "extends done through shared
buffers"?

Another question I have is, should the number of extends be for every
single block extended or should we try to track the initiation of a set
of extends (all of those added in RelationAddExtraBlocks(), in this
case)?

I think it should be 8k blocks, i.e. RelationAddExtraBlocks() should be
tracked as many individual extends. It's implemented that way, but more
importantly, it should be in BLCKSZ units. If we later add some actually
batched operations, we can have separate stats for that.

Of course, this view, when grown, will begin to overlap with pg_statio,
which is another consideration. What is its identity? I would find
"avoidable I/O" either avoidable entirely or avoidable for that
particular type of process, to be useful.

I think it's fine to overlap with pg_statio_* - those are for individual
objects, so it seems to be expected to overlap with coarser stats.

Or maybe, it should have a more expansive mandate. Maybe it would be
useful to aggregate some of the info from pg_stat_statements at a higher
level -- like maybe shared_blks_read counted across many statements for
a period of time/context in which we expected the relation in shared
buffers becomes potentially interesting.

Let's do something more basic first...

Greetings,

Andres Freund

melanieplageman@gmail.com

over 4 years ago

In reply to: Andres Freund (#8)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Apr 15, 2021 at 7:59 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-04-12 19:49:36 -0700, Melanie Plageman wrote:

So, I took a stab at implementing this in PgBackendStatus.

Cool!

Just a note on v2 of the patch -- the diff for the changes I made to
pgstatfuncs.c is pretty atrocious and hard to read. I tried using a
different diff algorithm, to no avail.

The attached patch is not quite on top of current master, so, alas,
don't try and apply it. I went to rebase today and realized I needed
to make some changes in light of e1025044cd4, however, I wanted to
share this WIP so that I could pose a few questions that I imagine
will still be relevant after I rewrite the patch.

Regarding the refactor done in e1025044cd4:
Most of the functions I've added access variables in PgBackendStatus, so
I put most of them in backend_status.h/c. However, technically, these
are stats which are aggregated over time, which e1025044cd4 says should
go in pgstat.c/h. I could move some of it, but I hadn't tried to do so,
as it made a few things inconvenient, and, I wasn't sure if it was the
right thing to do anyway.

I removed buffers_backend and buffers_backend_fsync from
pg_stat_bgwriter and have created a new view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
BufferAccessStrategy which, were they not to use this strategy,
could perhaps have been avoided if a clean shared buffer was
available
- number of fsyncs done by a backend which could have been done by
checkpointer if sync queue had not been full

I wonder if leaving buffers_alloc in pg_stat_bgwriter makes sense after
this? I'm tempted to move that to pg_stat_buffers or such...

I've gone ahead and moved buffers_alloc out of pg_stat_bgwriter and into
pg_stat_buffer_actions (I've renamed it from pg_stat_buffers_written).

I'm not quite convinced by having separate columns for checkpointer,
bgwriter, etc. That doesn't seem to scale all that well. What if we
instead made it a view that has one row for each BackendType?

I've changed the view to have one row for each backend type for which we
would like to report stats and one column for each buffer action type.

To make the code easier to write, I record buffer actions for all
backend types -- even if we don't have any buffer actions we care about
for that backend type. I thought it was okay because when I actually
aggregate the counters across backends, I only do so for the backend
types we care about -- thus there shouldn't be much accessing of shared
memory by multiple different processes.

Also, I copy-pasted most of the code in pg_stat_get_buffer_actions() to
set up the result tuplestore from pg_stat_get_activity() without totally
understanding all the parts of it, so I'm not sure if all of it is
required here.

In implementing this counting per backend, it is easy for all types of
backends to keep track of the number of writes, extends, fsyncs, and
strategy writes they are doing. So, as recommended upthread, I have
added columns in the view for the number of writes for checkpointer and
bgwriter and others. Thus, this view becomes more than just stats on
"avoidable I/O done by backends".

So, my question is, does it makes sense to track all extends -- those to
extend the fsm and visimap and when making a new relation or index? Is
that information useful? If so, is it different than the extends done
through shared buffers? Should it be tracked separately?

I don't fully understand what you mean with "extends done through shared
buffers"?

By "extends done through shared buffers", I just mean when an extend of
a relation is done and the data that will be written to the new block is
written into a shared buffer (as opposed to a local one or local memory
or a strategy buffer).

Random note:
I added a length member to the BackendType enum (BACKEND_NUM_TYPES),
which led to this compiler warning:

miscinit.c: In function ‘GetBackendTypeDesc’:
miscinit.c:236:2: warning: enumeration value ‘BACKEND_NUM_TYPES’ not
handled in switch [-Wswitch]
236 | switch (backendType)
| ^~~~~~

I tried using pg_attribute_unused() for BACKEND_NUM_TYPES, but, it
didn't seem to have the desired effect. As such, I just threw a case
into GetBackendTypeDesc() which does nothing (as opposed to erroring
out), since the backendDesc already is initialized to "unknown process
type", erroring out doesn't seem to be expected.

- Melanie

Attachments:

v2-0001-Add-system-view-tracking-shared-buffer-actions.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-system-view-tracking-shared-buffer-actions.patchDownload

From 8bfacd947641554c6f1e8a24df4950e915c2c264 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 4 Jun 2021 17:07:48 -0400
Subject: [PATCH v2] Add system view tracking shared buffer actions

Add a system view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
  BufferAccessStrategy which, were they not to use this strategy, could
  perhaps have been avoided if a clean shared buffer was available
- number of fsyncs done by a backend which could have been done by
  checkpointer if sync queue had not been full
- number of buffers allocated by a regular backend or autovacuum worker
  for either a new block or an existing block of a relation which is not
  currently in a buffer

All of these stats which were in the system view pg_stat_bgwriter have
been removed from that view.

All backends, on exit, will update a shared memory array with the
buffers they wrote or extended.

When the view is queried, add all live backend's statuses
to the totals in the shared memory array and return that as the full
total.

Each row of the view is for a particular backend type and each column is
the number of a particular kind of buffer action taken by the various
backends.

TODO:
- Some kind of test?
- Docs change
---
 src/backend/catalog/system_views.sql        |  14 +-
 src/backend/postmaster/checkpointer.c       |  27 +---
 src/backend/postmaster/pgstat.c             |   3 -
 src/backend/storage/buffer/bufmgr.c         |  73 +++++++++-
 src/backend/storage/buffer/freelist.c       |  37 +++--
 src/backend/storage/ipc/ipci.c              |   1 +
 src/backend/utils/activity/backend_status.c | 144 ++++++++++++++++++++
 src/backend/utils/adt/pgstatfuncs.c         |  53 +++++--
 src/backend/utils/init/miscinit.c           |   2 +
 src/include/catalog/pg_proc.dat             |  21 ++-
 src/include/miscadmin.h                     |  12 ++
 src/include/pgstat.h                        |   6 -
 src/include/storage/buf_internals.h         |   3 +
 src/include/utils/backend_status.h          |   9 +-
 src/test/regress/expected/rules.out         |  10 +-
 src/test/regress/sql/stats.sql              |   1 +
 16 files changed, 341 insertions(+), 75 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984068..e77ed42352 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
         pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_wal AS
@@ -1085,6 +1082,17 @@ CREATE VIEW pg_stat_wal AS
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
+CREATE VIEW pg_stat_buffer_actions AS
+SELECT
+       b.backend_type,
+       b.buffers_alloc,
+       b.buffers_extend,
+       b.buffers_fsync,
+       b.buffers_write,
+       b.buffers_write_strat
+FROM pg_stat_get_buffer_actions() b;
+
+
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 75a95f3de7..8eecd35965 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -93,17 +93,8 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -127,9 +118,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1092,10 +1080,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1109,8 +1093,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
+		/* TODO: should we count fsyncs for all types of procs? */
 		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+			pgstat_increment_buffer_action(BA_Fsync);
+
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
@@ -1267,13 +1253,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b0d07c0e0b..7396202da2 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5352,9 +5352,6 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
 	globalStats.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-	globalStats.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4b296a22c4..4c753c1e02 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -969,6 +969,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		/*
+		 * Extends counted here are only those that go through shared buffers
+		 */
+		pgstat_increment_buffer_action(BA_Extend);
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1235,11 +1240,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					if (XLogNeedsFlush(lsn) &&
 						StrategyRejectBuffer(strategy, buf))
 					{
+						/*
+						 * Unset the strat write flag, as we will not be writing
+						 * this particular buffer from our ring out and may end
+						 * up having to find a buffer from main shared buffers,
+						 * which, if it is dirty, we may have to write out, which
+						 * could have been prevented by checkpointing and background
+						 * writing
+						 */
+						StrategyUnChooseBufferFromRing(strategy);
+
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
 						UnpinBuffer(buf, true);
 						continue;
 					}
+
+					/*
+					 * TODO: there is certainly a better way to write this
+					 * logic
+					 */
+
+					/*
+					 * The dirty buffer that will be written out was selected
+					 * from the ring and we did not bother checking the
+					 * freelist or doing a clock sweep to look for a clean
+					 * buffer to use, thus, this write will be counted as a
+					 * strategy write -- one that may be unnecessary without a
+					 * strategy
+					 */
+					if (StrategyIsBufferFromRing(strategy))
+					{
+						pgstat_increment_buffer_action(BA_Write_Strat);
+					}
+
+						/*
+						 * If the dirty buffer was one we grabbed from the
+						 * freelist or through a clock sweep, it could have been
+						 * written out by bgwriter or checkpointer, thus, we will
+						 * count it as a regular write
+						 */
+					else
+						pgstat_increment_buffer_action(BA_Write);
+				}
+				else
+				{
+					/*
+					 * If strategy is NULL, we could only be doing a write.
+					 * Extend operations will be counted in smgrextend. That
+					 * is separate I/O than any flushing of dirty buffers. If
+					 * we add more Backend Access Types, perhaps we will need
+					 * additional checks here
+					 */
+					pgstat_increment_buffer_action(BA_Write);
+
 				}
 
 				/* OK, do the I/O */
@@ -2252,9 +2306,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	BgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2549,6 +2600,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+	pgstat_increment_buffer_action(BA_Write);
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
@@ -2901,6 +2954,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
 	 */
+
+	/*
+	 * TODO: consider that if we did not need to distinguish between a buffer
+	 * flushed that was grabbed from the ring buffer and written out as part
+	 * of a strategy which was not from main Shared Buffers (and thus
+	 * preventable by bgwriter or checkpointer), then we could move all calls
+	 * to pgstat_increment_buffer_action() here except for the one for
+	 * extends, which would remain in ReadBuffer_common() before smgrextend()
+	 * (unless we decide to start counting other extends). That includes the
+	 * call to count buffers written by bgwriter and checkpointer which go
+	 * through FlushBuffer() but not BufferAlloc(). That would make it
+	 * simpler. Perhaps instead we can find somewhere else to indicate that
+	 * the buffer is from the ring of buffers to reuse.
+	 */
 	smgrwrite(reln,
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..523b024992 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -51,7 +52,6 @@ typedef struct
 	 * overflow during a single bgwriter cycle.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
-	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
@@ -86,6 +86,13 @@ typedef struct BufferAccessStrategyData
 	 * ring already.
 	 */
 	bool		current_was_in_ring;
+	/*
+	 * If we could chose a buffer from this list and we end up having to write
+	 * it out because it is dirty when we actually could have found a clean
+	 * buffer in either the freelist or through doing a clock sweep of shared
+	 * buffers, this flag will indicate that
+	 */
+	bool		chose_buffer_in_ring;
 
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
@@ -213,7 +220,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			StrategyChooseBufferBufferFromRing(strategy);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pgstat_increment_buffer_action(BA_Alloc);
 
 	/*
 	 * First check, without acquiring the lock, whether there's buffers in the
@@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		 */
 		*complete_passes += nextVictimBuffer / NBuffers;
 	}
-
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 	return result;
 }
@@ -517,7 +522,6 @@ StrategyInitialize(bool init)
 
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
@@ -702,3 +706,20 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 
 	return true;
 }
+void
+StrategyUnChooseBufferFromRing(BufferAccessStrategy strategy)
+{
+	strategy->chose_buffer_in_ring = false;
+}
+
+void
+StrategyChooseBufferBufferFromRing(BufferAccessStrategy strategy)
+{
+	strategy->chose_buffer_in_ring = true;
+}
+
+bool
+StrategyIsBufferFromRing(BufferAccessStrategy strategy)
+{
+	return strategy->chose_buffer_in_ring;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97..c662853423 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -240,6 +240,7 @@ CreateSharedMemoryAndSemaphores(void)
 		InitProcGlobal();
 	CreateSharedProcArray();
 	CreateSharedBackendStatus();
+	CreateBufferActionStatsCounters();
 	TwoPhaseShmemInit();
 	BackgroundWorkerShmemInit();
 
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 2901f9f5a9..97dac2d41f 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -55,6 +55,7 @@ static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
 static Size BackendActivityBufferSize = 0;
+static int *BufferActionStatsArray    = NULL;
 #ifdef USE_SSL
 static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
 #endif
@@ -75,6 +76,7 @@ static MemoryContext backendStatusSnapContext;
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_read_current_status(void);
 static void pgstat_setup_backend_status_context(void);
+static void pgstat_record_dead_backend_buffer_actions(void);
 
 
 /*
@@ -236,6 +238,22 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+void
+CreateBufferActionStatsCounters(void)
+{
+	bool		found;
+	Size		size = 0;
+	int length;
+
+	length = BACKEND_NUM_TYPES * BUFFER_ACTION_NUM_TYPES;
+	size                   = mul_size(sizeof(int), length);
+	BufferActionStatsArray = (int *)
+		ShmemInitStruct("Buffer actions taken by each backend type", size, &found);
+	if (!found)
+		MemSet(BufferActionStatsArray, 0, size);
+}
+
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
@@ -399,6 +417,11 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	lbeentry.num_allocs = 0;
+	lbeentry.num_extends = 0;
+	lbeentry.num_fsyncs = 0;
+	lbeentry.num_writes = 0;
+	lbeentry.num_writes_strat = 0;
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -469,6 +492,16 @@ pgstat_beshutdown_hook(int code, Datum arg)
 	beentry->st_procpid = 0;	/* mark invalid */
 
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+	/*
+	 * Because the stats tracking shared buffers written and extended do not
+	 * go through the stats collector, it didn't make sense to add them to
+	 * pgstat_report_stat() At least the DatabaseId should be valid. Otherwise
+	 * we can't be sure that the members were zero-initialized (TODO: is that
+	 * true?)
+	 */
+	if (OidIsValid(MyDatabaseId))
+		pgstat_record_dead_backend_buffer_actions();
 }
 
 /*
@@ -1041,6 +1074,117 @@ pgstat_get_my_query_id(void)
 	 */
 	return MyBEEntry->st_query_id;
 }
+void
+pgstat_increment_buffer_action(BufferActionType ba_type)
+{
+	volatile PgBackendStatus *beentry   = MyBEEntry;
+
+	if (!beentry || !pgstat_track_activities)
+		return;
+
+	PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+	if (ba_type == BA_Alloc)
+		beentry->num_allocs++;
+	else if (ba_type == BA_Extend)
+		beentry->num_extends++;
+	else if (ba_type == BA_Fsync)
+		beentry->num_fsyncs++;
+	else if (ba_type == BA_Write)
+		beentry->num_writes++;
+	else if (ba_type == BA_Write_Strat)
+		beentry->num_writes_strat++;
+	PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
+
+/*
+ * Called for a single backend at the time of death to persist its I/O stats
+ */
+void
+pgstat_record_dead_backend_buffer_actions(void)
+{
+	BackendType bt;
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+
+	if (beentry->st_procpid != 0)
+		return;
+	bt = beentry->st_backendType;
+
+	for (;;)
+	{
+		int			before_changecount;
+		int			after_changecount;
+
+		pgstat_begin_read_activity(beentry, before_changecount);
+		BufferActionStatsArray[(bt * BUFFER_ACTION_NUM_TYPES) + BA_Alloc] += beentry->num_allocs;
+		BufferActionStatsArray[(bt * BUFFER_ACTION_NUM_TYPES) + BA_Extend] += beentry->num_extends;
+		BufferActionStatsArray[(bt * BUFFER_ACTION_NUM_TYPES) + BA_Fsync] += beentry->num_fsyncs;
+		BufferActionStatsArray[(bt * BUFFER_ACTION_NUM_TYPES) + BA_Write] += beentry->num_writes;
+		BufferActionStatsArray[(bt * BUFFER_ACTION_NUM_TYPES) + BA_Write_Strat] += beentry->num_writes_strat;
+		pgstat_end_read_activity(beentry, after_changecount);
+
+		if (pgstat_read_activity_complete(before_changecount, after_changecount))
+			break;
+
+		/* Make sure we can break out of loop if stuck... */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ * Fill the provided values array with the accumulated counts of buffer actions
+ * taken by all backends of type backend_type (input parameter), both alive and
+ * dead. This is currently only used by pg_stat_get_buffer_actions() to create
+ * the rows in the pg_stat_buffer_actions system view.
+ */
+void
+pgstat_recount_all_buffer_actions(BackendType backend_type, Datum * values)
+{
+	int			beid;
+	int			tot_backends = pgstat_fetch_stat_numbackends();
+
+	/*
+	 * Add stats from all exited backends
+	 *
+	 * TODO: I thought maybe it is okay to just access this lock-free since it
+	 * is only written to when a process dies in
+	 * pgstat_record_dead_backend_buffer_actions() and is read at the time of
+	 * querying the view with the stats. It's okay if we don't have 100%
+	 * up-to-date stats. However, I was wondering about torn values and
+	 * platforms without 64bit "single copy atomicity"
+	 *
+	 * Because the values array is datums and
+	 * BufferActionStatsArray is int64s, can't do a simple memcpy
+	 *
+	 */
+	values[BA_Alloc] = BufferActionStatsArray[(backend_type * BUFFER_ACTION_NUM_TYPES) + BA_Alloc];
+	values[BA_Extend] = BufferActionStatsArray[(backend_type * BUFFER_ACTION_NUM_TYPES) + BA_Extend];
+	values[BA_Fsync] = BufferActionStatsArray[(backend_type * BUFFER_ACTION_NUM_TYPES) + BA_Fsync];
+	values[BA_Write] = BufferActionStatsArray[(backend_type * BUFFER_ACTION_NUM_TYPES) + BA_Write];
+	values[BA_Write_Strat] = BufferActionStatsArray[(backend_type * BUFFER_ACTION_NUM_TYPES) + BA_Write_Strat];
+
+	/*
+	 * Loop through all live backends and count their buffer actions
+	 */
+	// TODO: is there a more efficient to do this, since we will potentially loop
+	//  through all backends for each backend type
+	for (beid = 1; beid <= tot_backends; beid++)
+	{
+		BackendType bt;
+		PgBackendStatus *beentry = pgstat_fetch_stat_beentry(beid);
+
+		if (beentry->st_procpid == 0)
+			continue;
+		bt = beentry->st_backendType;
+		if (bt != backend_type)
+			continue;
+
+		values[BA_Alloc] += beentry->num_allocs;
+		values[BA_Extend] += beentry->num_extends;
+		values[BA_Fsync] += beentry->num_fsyncs;
+		values[BA_Write] += beentry->num_writes;
+		values[BA_Write_Strat] += beentry->num_writes_strat;
+	}
+}
 
 
 /* ----------
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 14056f5347..e0dade2c52 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1780,21 +1780,50 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
+pg_stat_get_buffer_actions(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
-}
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
 
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
-}
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
 
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+	for (size_t i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		/*
+		 * Currently, the only supported backend types for stats are the following.
+		 * If this were to change, pg_proc.dat would need to be changed as well
+		 * to reflect the new expected number of rows.
+		 */
+		Datum values[BUFFER_ACTION_NUM_TYPES];
+		bool nulls[BUFFER_ACTION_NUM_TYPES];
+		if (!(i == B_BG_WRITER || i == B_CHECKPOINTER || i == B_AUTOVAC_WORKER || i == B_BACKEND))
+			continue;
+
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = CStringGetTextDatum(GetBackendTypeDesc(i));
+		pgstat_recount_all_buffer_actions(i, values);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
 }
 
 /*
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 8b73850d0d..d0923407ff 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -277,6 +277,8 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_LOGGER:
 			backendDesc = "logger";
 			break;
+		case BACKEND_NUM_TYPES:
+			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index acbcae4607..7fc3711d52 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5561,18 +5561,15 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
+
+  { oid => '8459', descr => 'statistics: counts of buffer actions taken by each backend type',
+  proname => 'pg_stat_get_buffer_actions', provolatile => 's', proisstrict => 'f',
+  prorows => '4', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffers_alloc,buffers_extend,buffers_fsync,buffers_write,buffers_write_strat}',
+  prosrc => 'pg_stat_get_buffer_actions' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 4dc343cbc5..b31369bd7d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,8 +336,20 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_STATS_COLLECTOR,
 	B_LOGGER,
+	BACKEND_NUM_TYPES,
 } BackendType;
 
+typedef enum BufferActionType
+{
+	BA_Invalid = 0,
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+	BA_Write_Strat,
+	BUFFER_ACTION_NUM_TYPES,
+}			BufferActionType;
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..9d0c2a5e1f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -475,9 +475,6 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_checkpoints;
 	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
-	PgStat_Counter m_buf_alloc;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -854,9 +851,6 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_checkpoints;
 	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..2bec2cee45 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -314,6 +314,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
+extern void StrategyUnChooseBufferFromRing(BufferAccessStrategy strategy);
+extern void StrategyChooseBufferBufferFromRing(BufferAccessStrategy strategy);
+extern bool StrategyIsBufferFromRing(BufferAccessStrategy strategy);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..4b35b3fe15 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -168,6 +168,11 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	int num_allocs;
+	int num_extends;
+	int num_fsyncs;
+	int num_writes;
+	int num_writes_strat;
 } PgBackendStatus;
 
 
@@ -282,7 +287,7 @@ extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
  */
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
-
+extern void CreateBufferActionStatsCounters(void);
 
 /* ----------
  * Functions called from backends
@@ -305,6 +310,8 @@ extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
 extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
+extern void pgstat_increment_buffer_action(BufferActionType ba_type);
+extern void pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values);
 
 
 /* ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..609ccf3b7b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1824,10 +1824,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
     pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffer_actions| SELECT b.backend_type,
+    b.buffers_alloc,
+    b.buffers_extend,
+    b.buffers_fsync,
+    b.buffers_write,
+    b.buffers_write_strat
+   FROM pg_stat_get_buffer_actions() b(backend_type, buffers_alloc, buffers_extend, buffers_fsync, buffers_write, buffers_write_strat);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..fb4b613d4b 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,5 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffer_actions;
 -- End of Stats Test
-- 
2.27.0

#10

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Melanie Plageman (#7)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On 2021-Apr-12, Melanie Plageman wrote:

As for the way I have recorded strategy writes -- it is quite inelegant,
but, I wanted to make sure that I only counted a strategy write as one
in which the backend wrote out the dirty buffer from its strategy ring
but did not check if there was any clean buffer in shared buffers more
generally (so, it is *potentially* an avoidable write). I'm not sure if
this distinction is useful to anyone. I haven't done enough with
BufferAccessStrategies to know what I'd want to know about them when
developing or using Postgres. However, if I don't need to be so careful,
it will make the code much simpler (though, I'm sure I can improve the
code regardless).

I was bitten last year by REFRESH MATERIALIZED VIEW counting its writes
via buffers_backend, and I was very surprised/confused about it. So it
seems definitely worthwhile to count writes via strategy separately.
For a DBA tuning the server configuration it is very useful.

The main thing is to *not* let these writes end up regular
buffers_backend (or whatever you call these now). I didn't read your
patch, but the way you have described it seems okay to me.

--
ï¿½lvaro Herrera 39ï¿½49'30"S 73ï¿½17'W

#11

melanieplageman@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#10)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Jun 4, 2021 at 5:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2021-Apr-12, Melanie Plageman wrote:

As for the way I have recorded strategy writes -- it is quite inelegant,
but, I wanted to make sure that I only counted a strategy write as one
in which the backend wrote out the dirty buffer from its strategy ring
but did not check if there was any clean buffer in shared buffers more
generally (so, it is *potentially* an avoidable write). I'm not sure if
this distinction is useful to anyone. I haven't done enough with
BufferAccessStrategies to know what I'd want to know about them when
developing or using Postgres. However, if I don't need to be so careful,
it will make the code much simpler (though, I'm sure I can improve the
code regardless).

I was bitten last year by REFRESH MATERIALIZED VIEW counting its writes
via buffers_backend, and I was very surprised/confused about it. So it
seems definitely worthwhile to count writes via strategy separately.
For a DBA tuning the server configuration it is very useful.

The main thing is to *not* let these writes end up regular
buffers_backend (or whatever you call these now). I didn't read your
patch, but the way you have described it seems okay to me.

Thanks for the feedback!

I agree it makes sense to count strategy writes separately.

I thought about this some more, and I don't know if it makes sense to
only count "avoidable" strategy writes.

This would mean that a backend writing out a buffer from the strategy
ring when no clean shared buffers (as well as no clean strategy buffers)
are available would not count that write as a strategy write (even
though it is writing out a buffer from its strategy ring). But, it
obviously doesn't make sense to count it as a regular buffer being
written out. So, I plan to change this code.

On another note, I've updated the patch with more correct concurrency
control control mechanisms (had some data races and other problems
before). Now, I am using atomics for the buffer action counters, though
the code includes several #TODO questions around the correctness of what
I have now too.

I also wrapped the buffer action types in a struct to make them easier
to work with.

The most substantial missing piece of the patch right now is persisting
the data across reboots.

The two places in the code I can see to persist the buffer action stats
data are:
1) using the stats collector code (like in
pgstat_read/write_statsfiles()
2) using a before_shmem_exit() hook which writes the data structure to a
file and then read from it when making the shared memory array initially

It feels a bit weird to me to wedge the buffer actions stats into the
stats collector code--since the stats collector isn't receiving and
aggregating the buffer action stats.

Also, I'm unsure how writing the buffer action stats out in
pgstat_write_statsfiles() will work, since I think that backends can
update their buffer action stats after we would have already persisted
the data from the BufferActionStatsArray -- causing us to lose those
updates.

And, I don't think I can use pgstat_read_statsfiles() since the
BufferActionStatsArray should have the data from the file as soon as the
view containing the buffer action stats can be queried. Thus, it seems
like I would need to read the file while initializing the array in
CreateBufferActionStatsCounters().

I am registering the patch for September commitfest but plan to update
the stats persistence before then (and docs, etc).

-- Melanie

Attachments:

v3-0001-Add-system-view-tracking-shared-buffer-actions.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Add-system-view-tracking-shared-buffer-actions.patchDownload

From 2753bf0dc3ff54a515bc0729b51ef56b6715a703 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 2 Aug 2021 17:56:07 -0400
Subject: [PATCH v3] Add system view tracking shared buffer actions

Add a system view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
  BufferAccessStrategy which, were they not to use this strategy, could
  perhaps have been avoided if a clean shared buffer was available
- number of fsyncs done by a backend which could have been done by
  checkpointer if sync queue had not been full
- number of buffers allocated by a regular backend or autovacuum worker
  for either a new block or an existing block of a relation which is not
  currently in a buffer

All of these stats which were in the system view pg_stat_bgwriter have
been removed from that view.

All backends, on exit, will update a shared memory array with the
buffers they wrote or extended.

When the view is queried, add all live backend's statuses
to the totals in the shared memory array and return that as the full
total.

Each row of the view is for a particular backend type and each column is
the number of a particular kind of buffer action taken by the various
backends.

TODO:
- Some kind of test?
- Docs change
---
 src/backend/catalog/system_views.sql        |  14 ++-
 src/backend/postmaster/checkpointer.c       |  27 +---
 src/backend/postmaster/pgstat.c             |   3 -
 src/backend/storage/buffer/bufmgr.c         |  73 ++++++++++-
 src/backend/storage/buffer/freelist.c       |  37 ++++--
 src/backend/storage/ipc/ipci.c              |   1 +
 src/backend/utils/activity/backend_status.c | 131 ++++++++++++++++++++
 src/backend/utils/adt/pgstatfuncs.c         |  55 ++++++--
 src/backend/utils/init/miscinit.c           |   2 +
 src/include/catalog/pg_proc.dat             |  21 ++--
 src/include/miscadmin.h                     |  12 ++
 src/include/pgstat.h                        |   6 -
 src/include/storage/buf_internals.h         |   3 +
 src/include/utils/backend_status.h          |  16 ++-
 src/test/regress/expected/rules.out         |  10 +-
 src/test/regress/sql/stats.sql              |   1 +
 16 files changed, 337 insertions(+), 75 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..96cac0a74e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
         pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_wal AS
@@ -1085,6 +1082,17 @@ CREATE VIEW pg_stat_wal AS
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
+CREATE VIEW pg_stat_buffer_actions AS
+SELECT
+       b.backend_type,
+       b.buffers_alloc,
+       b.buffers_extend,
+       b.buffers_fsync,
+       b.buffers_write,
+       b.buffers_write_strat
+FROM pg_stat_get_buffer_actions() b;
+
+
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index bc9ac7ccfa..cbe4889fb6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,8 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +115,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
+		/* TODO: should we count fsyncs for all types of procs? */
 		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+			pgstat_increment_buffer_action(BA_Fsync);
+
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
@@ -1264,13 +1250,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 11702f2a80..03d8e13c3a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5352,9 +5352,6 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
 	globalStats.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-	globalStats.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 33d99f604a..3bfbb48b1f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -963,6 +963,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		/*
+		 * Extends counted here are only those that go through shared buffers
+		 */
+		pgstat_increment_buffer_action(BA_Extend);
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					if (XLogNeedsFlush(lsn) &&
 						StrategyRejectBuffer(strategy, buf))
 					{
+						/*
+						 * Unset the strat write flag, as we will not be writing
+						 * this particular buffer from our ring out and may end
+						 * up having to find a buffer from main shared buffers,
+						 * which, if it is dirty, we may have to write out, which
+						 * could have been prevented by checkpointing and background
+						 * writing
+						 */
+						StrategyUnChooseBufferFromRing(strategy);
+
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
 						UnpinBuffer(buf, true);
 						continue;
 					}
+
+					/*
+					 * TODO: there is certainly a better way to write this
+					 * logic
+					 */
+
+					/*
+					 * The dirty buffer that will be written out was selected
+					 * from the ring and we did not bother checking the
+					 * freelist or doing a clock sweep to look for a clean
+					 * buffer to use, thus, this write will be counted as a
+					 * strategy write -- one that may be unnecessary without a
+					 * strategy
+					 */
+					if (StrategyIsBufferFromRing(strategy))
+					{
+						pgstat_increment_buffer_action(BA_Write_Strat);
+					}
+
+						/*
+						 * If the dirty buffer was one we grabbed from the
+						 * freelist or through a clock sweep, it could have been
+						 * written out by bgwriter or checkpointer, thus, we will
+						 * count it as a regular write
+						 */
+					else
+						pgstat_increment_buffer_action(BA_Write);
+				}
+				else
+				{
+					/*
+					 * If strategy is NULL, we could only be doing a write.
+					 * Extend operations will be counted in smgrextend. That
+					 * is separate I/O than any flushing of dirty buffers. If
+					 * we add more Backend Access Types, perhaps we will need
+					 * additional checks here
+					 */
+					pgstat_increment_buffer_action(BA_Write);
+
 				}
 
 				/* OK, do the I/O */
@@ -2246,9 +2300,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	BgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2543,6 +2594,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+	pgstat_increment_buffer_action(BA_Write);
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
@@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
 	 */
+
+	/*
+	 * TODO: consider that if we did not need to distinguish between a buffer
+	 * flushed that was grabbed from the ring buffer and written out as part
+	 * of a strategy which was not from main Shared Buffers (and thus
+	 * preventable by bgwriter or checkpointer), then we could move all calls
+	 * to pgstat_increment_buffer_action() here except for the one for
+	 * extends, which would remain in ReadBuffer_common() before smgrextend()
+	 * (unless we decide to start counting other extends). That includes the
+	 * call to count buffers written by bgwriter and checkpointer which go
+	 * through FlushBuffer() but not BufferAlloc(). That would make it
+	 * simpler. Perhaps instead we can find somewhere else to indicate that
+	 * the buffer is from the ring of buffers to reuse.
+	 */
 	smgrwrite(reln,
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..523b024992 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -51,7 +52,6 @@ typedef struct
 	 * overflow during a single bgwriter cycle.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
-	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
@@ -86,6 +86,13 @@ typedef struct BufferAccessStrategyData
 	 * ring already.
 	 */
 	bool		current_was_in_ring;
+	/*
+	 * If we could chose a buffer from this list and we end up having to write
+	 * it out because it is dirty when we actually could have found a clean
+	 * buffer in either the freelist or through doing a clock sweep of shared
+	 * buffers, this flag will indicate that
+	 */
+	bool		chose_buffer_in_ring;
 
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
@@ -213,7 +220,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			StrategyChooseBufferBufferFromRing(strategy);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pgstat_increment_buffer_action(BA_Alloc);
 
 	/*
 	 * First check, without acquiring the lock, whether there's buffers in the
@@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		 */
 		*complete_passes += nextVictimBuffer / NBuffers;
 	}
-
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 	return result;
 }
@@ -517,7 +522,6 @@ StrategyInitialize(bool init)
 
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
-		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
@@ -702,3 +706,20 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 
 	return true;
 }
+void
+StrategyUnChooseBufferFromRing(BufferAccessStrategy strategy)
+{
+	strategy->chose_buffer_in_ring = false;
+}
+
+void
+StrategyChooseBufferBufferFromRing(BufferAccessStrategy strategy)
+{
+	strategy->chose_buffer_in_ring = true;
+}
+
+bool
+StrategyIsBufferFromRing(BufferAccessStrategy strategy)
+{
+	return strategy->chose_buffer_in_ring;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97..c662853423 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -240,6 +240,7 @@ CreateSharedMemoryAndSemaphores(void)
 		InitProcGlobal();
 	CreateSharedProcArray();
 	CreateSharedBackendStatus();
+	CreateBufferActionStatsCounters();
 	TwoPhaseShmemInit();
 	BackgroundWorkerShmemInit();
 
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 2901f9f5a9..ec1a7d4c3a 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -55,6 +55,7 @@ static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
 static Size BackendActivityBufferSize = 0;
+static PgBackendBufferActionStats *BufferActionStatsArray    = NULL;
 #ifdef USE_SSL
 static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
 #endif
@@ -75,6 +76,7 @@ static MemoryContext backendStatusSnapContext;
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_read_current_status(void);
 static void pgstat_setup_backend_status_context(void);
+static void pgstat_record_dead_backend_buffer_actions(void);
 
 
 /*
@@ -236,6 +238,35 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+void
+CreateBufferActionStatsCounters(void)
+{
+	bool		found;
+	Size		size;
+	int i;
+	PgBackendBufferActionStats *ba_stats;
+
+	size = BACKEND_NUM_TYPES * sizeof(PgBackendBufferActionStats);
+	BufferActionStatsArray = (PgBackendBufferActionStats *)
+		ShmemInitStruct("Buffer actions taken by each backend type", size, &found);
+	if (!found)
+		MemSet(BufferActionStatsArray, 0, size);
+
+	// TODO: do I want a lock on this while initializing the members?
+	ba_stats = BufferActionStatsArray;
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		pg_atomic_init_u64(&ba_stats->allocs, 0);
+		pg_atomic_init_u64(&ba_stats->extends, 0);
+		pg_atomic_init_u64(&ba_stats->fsyncs, 0);
+		pg_atomic_init_u64(&ba_stats->writes, 0);
+		pg_atomic_init_u64(&ba_stats->writes_strat, 0);
+
+		ba_stats++;
+	}
+}
+
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
@@ -399,6 +430,11 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.allocs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.extends, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.fsyncs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes_strat, 0);
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -469,6 +505,16 @@ pgstat_beshutdown_hook(int code, Datum arg)
 	beentry->st_procpid = 0;	/* mark invalid */
 
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+	/*
+	 * Because the stats tracking shared buffers written and extended do not
+	 * go through the stats collector, it didn't make sense to add them to
+	 * pgstat_report_stat() At least the DatabaseId should be valid. Otherwise
+	 * we can't be sure that the members were zero-initialized (TODO: is that
+	 * true?)
+	 */
+	if (OidIsValid(MyDatabaseId))
+		pgstat_record_dead_backend_buffer_actions();
 }
 
 /*
@@ -1041,6 +1087,91 @@ pgstat_get_my_query_id(void)
 	 */
 	return MyBEEntry->st_query_id;
 }
+void
+pgstat_increment_buffer_action(BufferActionType ba_type)
+{
+	volatile PgBackendStatus *beentry   = MyBEEntry;
+
+	if (!beentry || !pgstat_track_activities)
+		return;
+
+	if (ba_type == BA_Alloc)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1);
+	else if (ba_type == BA_Extend)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1);
+	else if (ba_type == BA_Fsync)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1);
+	else if (ba_type == BA_Write)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1);
+	else if (ba_type == BA_Write_Strat)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1);
+}
+
+/*
+ * Called for a single backend at the time of death to persist its I/O stats
+ */
+void
+pgstat_record_dead_backend_buffer_actions(void)
+{
+	volatile PgBackendBufferActionStats *ba_stats;
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+
+	if (beentry->st_procpid != 0)
+		return;
+
+	// TODO: is this correct? could there be a data race? do I need a lock?
+	ba_stats = &BufferActionStatsArray[beentry->st_backendType];
+	pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs));
+	pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends));
+	pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs));
+	pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes));
+	pg_atomic_add_fetch_u64(&ba_stats->writes_strat, pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat));
+}
+
+/*
+ * Fill the provided values array with the accumulated counts of buffer actions
+ * taken by all backends of type backend_type (input parameter), both alive and
+ * dead. This is currently only used by pg_stat_get_buffer_actions() to create
+ * the rows in the pg_stat_buffer_actions system view.
+ */
+void
+pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values)
+{
+	int			i;
+	volatile PgBackendStatus *beentry;
+
+	/*
+	 * Add stats from all exited backends
+	 */
+	values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs);
+	values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends);
+	values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs);
+	values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes);
+	values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat);
+
+	/*
+	 * Loop through all live backends and count their buffer actions
+	 */
+	// TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method
+
+	beentry = BackendStatusArray;
+	for (i = 1; i <= MaxBackends; i++)
+	{
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+		if (beentry->st_backendType != backend_type)
+			continue;
+
+		values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+		values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+		values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+		values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+		values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+
+		beentry++;
+	}
+}
 
 
 /* ----------
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f0e09eae4d..ce4d97e5a4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1780,21 +1780,52 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
+pg_stat_get_buffer_actions(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
-}
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
 
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
-}
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
 
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+	// TODO: doing the loop like this means we will loop through all backends up to BACKEND_NUM_TYPES times
+	// could preallocate the values arrays and then loop through the backends once, filling in the appropriate values array
+	for (size_t i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		/*
+		 * Currently, the only supported backend types for stats are the following.
+		 * If this were to change, pg_proc.dat would need to be changed as well
+		 * to reflect the new expected number of rows.
+		 */
+		Datum values[BUFFER_ACTION_NUM_TYPES];
+		bool nulls[BUFFER_ACTION_NUM_TYPES];
+		if (!(i == B_BG_WRITER || i == B_CHECKPOINTER || i == B_AUTOVAC_WORKER || i == B_BACKEND))
+			continue;
+
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = CStringGetTextDatum(GetBackendTypeDesc(i));
+		pgstat_recount_all_buffer_actions(i, values);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
 }
 
 /*
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 8b73850d0d..d0923407ff 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -277,6 +277,8 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_LOGGER:
 			backendDesc = "logger";
 			break;
+		case BACKEND_NUM_TYPES:
+			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8cd0252082..3d3a0eea3f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5565,18 +5565,15 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
+
+  { oid => '8459', descr => 'statistics: counts of buffer actions taken by each backend type',
+  proname => 'pg_stat_get_buffer_actions', provolatile => 's', proisstrict => 'f',
+  prorows => '4', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffers_alloc,buffers_extend,buffers_fsync,buffers_write,buffers_write_strat}',
+  prosrc => 'pg_stat_get_buffer_actions' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 68d840d699..24d2943d9c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,8 +336,20 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_STATS_COLLECTOR,
 	B_LOGGER,
+	BACKEND_NUM_TYPES,
 } BackendType;
 
+typedef enum BufferActionType
+{
+	BA_Invalid = 0,
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+	BA_Write_Strat,
+	BUFFER_ACTION_NUM_TYPES,
+}			BufferActionType;
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..9d0c2a5e1f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -475,9 +475,6 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_checkpoints;
 	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
-	PgStat_Counter m_buf_alloc;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -854,9 +851,6 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_checkpoints;
 	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..2bec2cee45 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -314,6 +314,9 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
+extern void StrategyUnChooseBufferFromRing(BufferAccessStrategy strategy);
+extern void StrategyChooseBufferBufferFromRing(BufferAccessStrategy strategy);
+extern bool StrategyIsBufferFromRing(BufferAccessStrategy strategy);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..f5360b5aff 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -79,6 +80,15 @@ typedef struct PgBackendGSSStatus
 
 } PgBackendGSSStatus;
 
+typedef struct PgBackendBufferActionStats
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+	pg_atomic_uint64 writes_strat;
+} PgBackendBufferActionStats;
+
 
 /* ----------
  * PgBackendStatus
@@ -168,6 +178,8 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	// TODO: do its members need to be atomics when in the PgBackendStatus since only this backend will write to them?
+	PgBackendBufferActionStats buffer_action_stats;
 } PgBackendStatus;
 
 
@@ -282,7 +294,7 @@ extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
  */
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
-
+extern void CreateBufferActionStatsCounters(void);
 
 /* ----------
  * Functions called from backends
@@ -305,6 +317,8 @@ extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
 extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
+extern void pgstat_increment_buffer_action(BufferActionType ba_type);
+extern void pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values);
 
 
 /* ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..609ccf3b7b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1824,10 +1824,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
     pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffer_actions| SELECT b.backend_type,
+    b.buffers_alloc,
+    b.buffers_extend,
+    b.buffers_fsync,
+    b.buffers_write,
+    b.buffers_write_strat
+   FROM pg_stat_get_buffer_actions() b(backend_type, buffers_alloc, buffers_extend, buffers_fsync, buffers_write, buffers_write_strat);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..fb4b613d4b 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,5 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffer_actions;
 -- End of Stats Test
-- 
2.27.0

#12

andres@anarazel.de

over 4 years ago

In reply to: Melanie Plageman (#11)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote:

Thanks for the feedback!

I agree it makes sense to count strategy writes separately.

I thought about this some more, and I don't know if it makes sense to
only count "avoidable" strategy writes.

This would mean that a backend writing out a buffer from the strategy
ring when no clean shared buffers (as well as no clean strategy buffers)
are available would not count that write as a strategy write (even
though it is writing out a buffer from its strategy ring). But, it
obviously doesn't make sense to count it as a regular buffer being
written out. So, I plan to change this code.

What do you mean with "no clean shared buffers ... are available"?

The most substantial missing piece of the patch right now is persisting
the data across reboots.

The two places in the code I can see to persist the buffer action stats
data are:
1) using the stats collector code (like in
pgstat_read/write_statsfiles()
2) using a before_shmem_exit() hook which writes the data structure to a
file and then read from it when making the shared memory array initially

I think it's pretty clear that we should go for 1. Having two mechanisms for
persisting stats data is a bad idea.

Also, I'm unsure how writing the buffer action stats out in
pgstat_write_statsfiles() will work, since I think that backends can
update their buffer action stats after we would have already persisted
the data from the BufferActionStatsArray -- causing us to lose those
updates.

I was thinking it'd work differently. Whenever a connection ends, it reports
its data up to pgstats.c (otherwise we'd loose those stats). By the time
shutdown happens, they all need to have already have reported their stats - so
we don't need to do anything to get the data to pgstats.c during shutdown
time.

And, I don't think I can use pgstat_read_statsfiles() since the
BufferActionStatsArray should have the data from the file as soon as the
view containing the buffer action stats can be queried. Thus, it seems
like I would need to read the file while initializing the array in
CreateBufferActionStatsCounters().

Why would backends need to read that data back?

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..96cac0a74e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;

Material for a separate patch, not this. But if we're going to break
monitoring queries anyway, I think we should consider also renaming
maxwritten_clean (and perhaps a few others), because nobody understands what
that is supposed to mean.

@@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)

LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);

-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
/*
* If the checkpointer isn't running or the request queue is full, the
* backend will have to perform its own fsync request.  But before forcing
@@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
* Count the subset of writes where backends have to do their own
* fsync
*/
+		/* TODO: should we count fsyncs for all types of procs? */
if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+			pgstat_increment_buffer_action(BA_Fsync);
+

Yes, I think that'd make sense. Now that we can disambiguate the different
types of syncs between procs, I don't see a point of having a process-type
filter here. We just loose data...

/* don't set checksum for all-zero page */
@@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (XLogNeedsFlush(lsn) &&
StrategyRejectBuffer(strategy, buf))
{
+						/*
+						 * Unset the strat write flag, as we will not be writing
+						 * this particular buffer from our ring out and may end
+						 * up having to find a buffer from main shared buffers,
+						 * which, if it is dirty, we may have to write out, which
+						 * could have been prevented by checkpointing and background
+						 * writing
+						 */
+						StrategyUnChooseBufferFromRing(strategy);
+
/* Drop lock/pin and loop around for another buffer */
LWLockRelease(BufferDescriptorGetContentLock(buf));
UnpinBuffer(buf, true);
continue;
}

Could we combine this with StrategyRejectBuffer()? It seems a bit wasteful to
have two function calls into freelist.c when the second happens exactly when
the first returns true?

+
+					/*
+					 * TODO: there is certainly a better way to write this
+					 * logic
+					 */
+
+					/*
+					 * The dirty buffer that will be written out was selected
+					 * from the ring and we did not bother checking the
+					 * freelist or doing a clock sweep to look for a clean
+					 * buffer to use, thus, this write will be counted as a
+					 * strategy write -- one that may be unnecessary without a
+					 * strategy
+					 */
+					if (StrategyIsBufferFromRing(strategy))
+					{
+						pgstat_increment_buffer_action(BA_Write_Strat);
+					}
+
+						/*
+						 * If the dirty buffer was one we grabbed from the
+						 * freelist or through a clock sweep, it could have been
+						 * written out by bgwriter or checkpointer, thus, we will
+						 * count it as a regular write
+						 */
+					else
+						pgstat_increment_buffer_action(BA_Write);

It seems this would be better solved by having an "bool *from_ring" or
GetBufferSource* parameter to StrategyGetBuffer().

@@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* bufToWrite is either the shared buffer or a copy, as appropriate.
*/
+
+	/*
+	 * TODO: consider that if we did not need to distinguish between a buffer
+	 * flushed that was grabbed from the ring buffer and written out as part
+	 * of a strategy which was not from main Shared Buffers (and thus
+	 * preventable by bgwriter or checkpointer), then we could move all calls
+	 * to pgstat_increment_buffer_action() here except for the one for
+	 * extends, which would remain in ReadBuffer_common() before smgrextend()
+	 * (unless we decide to start counting other extends). That includes the
+	 * call to count buffers written by bgwriter and checkpointer which go
+	 * through FlushBuffer() but not BufferAlloc(). That would make it
+	 * simpler. Perhaps instead we can find somewhere else to indicate that
+	 * the buffer is from the ring of buffers to reuse.
+	 */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,

Can we just add a parameter to FlushBuffer indicating what the source of the
write is?

@@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
* the rate of buffer consumption.  Note that buffers recycled by a
* strategy object are intentionally not counted here.
*/
-	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+	pgstat_increment_buffer_action(BA_Alloc);

/*
* First check, without acquiring the lock, whether there's buffers in the

@@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
*/
*complete_passes += nextVictimBuffer / NBuffers;
}
-
-	if (num_buf_alloc)
-	{
-		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
return result;
}

Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
suspect this patch shouldn't get rid of numBufferAllocs at the same time as
overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
that that's the case / how we can make that work.

+void
+pgstat_increment_buffer_action(BufferActionType ba_type)
+{
+	volatile PgBackendStatus *beentry   = MyBEEntry;
+
+	if (!beentry || !pgstat_track_activities)
+		return;
+
+	if (ba_type == BA_Alloc)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1);
+	else if (ba_type == BA_Extend)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1);
+	else if (ba_type == BA_Fsync)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1);
+	else if (ba_type == BA_Write)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1);
+	else if (ba_type == BA_Write_Strat)
+		pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1);
+}

I don't think we want to use atomic increments here - they're *slow*. And
there only ever can be a single writer to a backend's stats. So just doing
something like
pg_atomic_write_u64(&var, pg_atomic_read_u64(&var) + 1)
should do the trick.

+/*
+ * Called for a single backend at the time of death to persist its I/O stats
+ */
+void
+pgstat_record_dead_backend_buffer_actions(void)
+{
+	volatile PgBackendBufferActionStats *ba_stats;
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+
+	if (beentry->st_procpid != 0)
+		return;
+
+	// TODO: is this correct? could there be a data race? do I need a lock?
+	ba_stats = &BufferActionStatsArray[beentry->st_backendType];
+	pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs));
+	pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends));
+	pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs));
+	pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes));
+	pg_atomic_add_fetch_u64(&ba_stats->writes_strat, pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat));
+}

I don't see a race, FWIW.

This is where I propose that we instead report the values up to the stats
collector, instead of having a separate array that we need to persist

+/*
+ * Fill the provided values array with the accumulated counts of buffer actions
+ * taken by all backends of type backend_type (input parameter), both alive and
+ * dead. This is currently only used by pg_stat_get_buffer_actions() to create
+ * the rows in the pg_stat_buffer_actions system view.
+ */
+void
+pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values)
+{
+	int			i;
+	volatile PgBackendStatus *beentry;
+
+	/*
+	 * Add stats from all exited backends
+	 */
+	values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs);
+	values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends);
+	values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs);
+	values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes);
+	values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat);
+
+	/*
+	 * Loop through all live backends and count their buffer actions
+	 */
+	// TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method
+
+	beentry = BackendStatusArray;
+	for (i = 1; i <= MaxBackends; i++)
+	{
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+		if (beentry->st_backendType != backend_type)
+			continue;
+
+		values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+		values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+		values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+		values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+		values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+
+		beentry++;
+	}
+}

It seems to make a bit more sense to have this sum up the stats for all
backend types at once.

+		/*
+		 * Currently, the only supported backend types for stats are the following.
+		 * If this were to change, pg_proc.dat would need to be changed as well
+		 * to reflect the new expected number of rows.
+		 */
+		Datum values[BUFFER_ACTION_NUM_TYPES];
+		bool nulls[BUFFER_ACTION_NUM_TYPES];

Ah ;)

Greetings,

Andres Freund

#13

melanieplageman@gmail.com

over 4 years ago

In reply to: Andres Freund (#12)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote:

Thanks for the feedback!

I agree it makes sense to count strategy writes separately.

I thought about this some more, and I don't know if it makes sense to
only count "avoidable" strategy writes.

This would mean that a backend writing out a buffer from the strategy
ring when no clean shared buffers (as well as no clean strategy buffers)
are available would not count that write as a strategy write (even
though it is writing out a buffer from its strategy ring). But, it
obviously doesn't make sense to count it as a regular buffer being
written out. So, I plan to change this code.

What do you mean with "no clean shared buffers ... are available"?

I think I was talking about the scenario in which a backend using a
strategy does not find a clean buffer in the strategy ring and goes to
look in the freelist for a clean shared buffer and doesn't find one.

I was probably talking in circles up there. I think the current
patch counts the right writes in the right way, though.

The most substantial missing piece of the patch right now is persisting
the data across reboots.

The two places in the code I can see to persist the buffer action stats
data are:
1) using the stats collector code (like in
pgstat_read/write_statsfiles()
2) using a before_shmem_exit() hook which writes the data structure to a
file and then read from it when making the shared memory array initially

I think it's pretty clear that we should go for 1. Having two mechanisms for
persisting stats data is a bad idea.

New version uses the stats collector.

Also, I'm unsure how writing the buffer action stats out in
pgstat_write_statsfiles() will work, since I think that backends can
update their buffer action stats after we would have already persisted
the data from the BufferActionStatsArray -- causing us to lose those
updates.

I was thinking it'd work differently. Whenever a connection ends, it reports
its data up to pgstats.c (otherwise we'd loose those stats). By the time
shutdown happens, they all need to have already have reported their stats - so
we don't need to do anything to get the data to pgstats.c during shutdown
time.

When you say "whenever a connection ends", what part of the code are you
referring to specifically?

Also, when you say "shutdown", do you mean a backend shutting down or
all backends shutting down (including postmaster) -- like pg_ctl stop?

And, I don't think I can use pgstat_read_statsfiles() since the
BufferActionStatsArray should have the data from the file as soon as the
view containing the buffer action stats can be queried. Thus, it seems
like I would need to read the file while initializing the array in
CreateBufferActionStatsCounters().

Why would backends need to read that data back?

To get totals across restarts, but, doesn't matter now that I am using
stats collector.

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..96cac0a74e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;

Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?

@@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)

LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-     /* Count all backend writes regardless of if they fit in the queue */
-     if (!AmBackgroundWriterProcess())
-             CheckpointerShmem->num_backend_writes++;
-
/*
* If the checkpointer isn't running or the request queue is full, the
* backend will have to perform its own fsync request.  But before forcing
@@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
* Count the subset of writes where backends have to do their own
* fsync
*/
+             /* TODO: should we count fsyncs for all types of procs? */
if (!AmBackgroundWriterProcess())
-                     CheckpointerShmem->num_backend_fsync++;
+                     pgstat_increment_buffer_action(BA_Fsync);
+
Yes, I think that'd make sense. Now that we can disambiguate the different
types of syncs between procs, I don't see a point of having a process-type
filter here. We just loose data...

Done

/* don't set checksum for all-zero page */
@@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (XLogNeedsFlush(lsn) &&
StrategyRejectBuffer(strategy, buf))
{
+                                             /*
+                                              * Unset the strat write flag, as we will not be writing
+                                              * this particular buffer from our ring out and may end
+                                              * up having to find a buffer from main shared buffers,
+                                              * which, if it is dirty, we may have to write out, which
+                                              * could have been prevented by checkpointing and background
+                                              * writing
+                                              */
+                                             StrategyUnChooseBufferFromRing(strategy);
+
/* Drop lock/pin and loop around for another buffer */
LWLockRelease(BufferDescriptorGetContentLock(buf));
UnpinBuffer(buf, true);
continue;
}

Could we combine this with StrategyRejectBuffer()? It seems a bit wasteful to
have two function calls into freelist.c when the second happens exactly when
the first returns true?

+
+                                     /*
+                                      * TODO: there is certainly a better way to write this
+                                      * logic
+                                      */
+
+                                     /*
+                                      * The dirty buffer that will be written out was selected
+                                      * from the ring and we did not bother checking the
+                                      * freelist or doing a clock sweep to look for a clean
+                                      * buffer to use, thus, this write will be counted as a
+                                      * strategy write -- one that may be unnecessary without a
+                                      * strategy
+                                      */
+                                     if (StrategyIsBufferFromRing(strategy))
+                                     {
+                                             pgstat_increment_buffer_action(BA_Write_Strat);
+                                     }
+
+                                             /*
+                                              * If the dirty buffer was one we grabbed from the
+                                              * freelist or through a clock sweep, it could have been
+                                              * written out by bgwriter or checkpointer, thus, we will
+                                              * count it as a regular write
+                                              */
+                                     else
+                                             pgstat_increment_buffer_action(BA_Write);

It seems this would be better solved by having an "bool *from_ring" or
GetBufferSource* parameter to StrategyGetBuffer().

I've addressed both of these in the new version.

@@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* bufToWrite is either the shared buffer or a copy, as appropriate.
*/
+
+     /*
+      * TODO: consider that if we did not need to distinguish between a buffer
+      * flushed that was grabbed from the ring buffer and written out as part
+      * of a strategy which was not from main Shared Buffers (and thus
+      * preventable by bgwriter or checkpointer), then we could move all calls
+      * to pgstat_increment_buffer_action() here except for the one for
+      * extends, which would remain in ReadBuffer_common() before smgrextend()
+      * (unless we decide to start counting other extends). That includes the
+      * call to count buffers written by bgwriter and checkpointer which go
+      * through FlushBuffer() but not BufferAlloc(). That would make it
+      * simpler. Perhaps instead we can find somewhere else to indicate that
+      * the buffer is from the ring of buffers to reuse.
+      */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,

Can we just add a parameter to FlushBuffer indicating what the source of the
write is?

I just noticed this comment now, so I'll address that in the next
version. I rebased today and noticed merge conflicts, so, it looks like
v5 will be on its way soon anyway.

@@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
* the rate of buffer consumption.  Note that buffers recycled by a
* strategy object are intentionally not counted here.
*/
-     pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
+     pgstat_increment_buffer_action(BA_Alloc);
/*
* First check, without acquiring the lock, whether there's buffers in the
@@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
*/
*complete_passes += nextVictimBuffer / NBuffers;
}
-
-     if (num_buf_alloc)
-     {
-             *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-     }
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
return result;
}
Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
suspect this patch shouldn't get rid of numBufferAllocs at the same time as
overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
that that's the case / how we can make that work.

I initially meant to add a function to the patch like
pg_stat_get_buffer_actions() but which took a BufferActionType and
BackendType as parameters and returned a single value which is the
number of buffer action types of that type for that type of backend.

let's say I defined it like this:
uint64
pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
BufferActionType ba_type)

Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
by subtracting the value of StrategyControl->numBufferAllocs from the
value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
BA_Alloc), val, then adding that value, val, to
StrategyControl->numBufferAllocs.

I think that would have the same behavior as current, though I'm not
sure if the performance would end up being better or worse. It wouldn't
be atomically incrementing StrategyControl->numBufferAllocs, but it
would do a few additional atomic operations in StrategySyncStart() than
before. Also, we would do all the work done by
pg_stat_get_buffer_actions() in StrategySyncStart().

But that is called comparatively infrequently, right?

+void
+pgstat_increment_buffer_action(BufferActionType ba_type)
+{
+     volatile PgBackendStatus *beentry   = MyBEEntry;
+
+     if (!beentry || !pgstat_track_activities)
+             return;
+
+     if (ba_type == BA_Alloc)
+             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1);
+     else if (ba_type == BA_Extend)
+             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1);
+     else if (ba_type == BA_Fsync)
+             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1);
+     else if (ba_type == BA_Write)
+             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1);
+     else if (ba_type == BA_Write_Strat)
+             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1);
+}

Done

+/*
+ * Called for a single backend at the time of death to persist its I/O stats
+ */
+void
+pgstat_record_dead_backend_buffer_actions(void)
+{
+     volatile PgBackendBufferActionStats *ba_stats;
+     volatile        PgBackendStatus *beentry = MyBEEntry;
+
+     if (beentry->st_procpid != 0)
+             return;
+
+     // TODO: is this correct? could there be a data race? do I need a lock?
+     ba_stats = &BufferActionStatsArray[beentry->st_backendType];
+     pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs));
+     pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends));
+     pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs));
+     pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes));
+     pg_atomic_add_fetch_u64(&ba_stats->writes_strat, pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat));
+}

I don't see a race, FWIW.

This is where I propose that we instead report the values up to the stats
collector, instead of having a separate array that we need to persist

Changed

+/*
+ * Fill the provided values array with the accumulated counts of buffer actions
+ * taken by all backends of type backend_type (input parameter), both alive and
+ * dead. This is currently only used by pg_stat_get_buffer_actions() to create
+ * the rows in the pg_stat_buffer_actions system view.
+ */
+void
+pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values)
+{
+     int                     i;
+     volatile PgBackendStatus *beentry;
+
+     /*
+      * Add stats from all exited backends
+      */
+     values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs);
+     values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends);
+     values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs);
+     values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes);
+     values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat);
+
+     /*
+      * Loop through all live backends and count their buffer actions
+      */
+     // TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method
+
+     beentry = BackendStatusArray;
+     for (i = 1; i <= MaxBackends; i++)
+     {
+             /* Don't count dead backends. They should already be counted */
+             if (beentry->st_procpid == 0)
+                     continue;
+             if (beentry->st_backendType != backend_type)
+                     continue;
+
+             values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+             values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+             values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+             values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+             values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+
+             beentry++;
+     }
+}

It seems to make a bit more sense to have this sum up the stats for all
backend types at once.

Changed.

+             /*
+              * Currently, the only supported backend types for stats are the following.
+              * If this were to change, pg_proc.dat would need to be changed as well
+              * to reflect the new expected number of rows.
+              */
+             Datum values[BUFFER_ACTION_NUM_TYPES];
+             bool nulls[BUFFER_ACTION_NUM_TYPES];

Ah ;)

I just went ahead and made a row for each backend type.

- Melanie

Attachments:

v4-0001-Add-system-view-tracking-shared-buffer-actions.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Add-system-view-tracking-shared-buffer-actions.patchDownload

From ab751bdbc96c8c52a341d9ced3f9e1fe929e2010 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 2 Aug 2021 17:56:07 -0400
Subject: [PATCH v4] Add system view tracking shared buffer actions

Add a system view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
  BufferAccessStrategy which, were they not to use this strategy, could
  perhaps have been avoided if a clean shared buffer was available
- number of fsyncs done by a backend which could have been done by
  checkpointer if sync queue had not been full
- number of buffers allocated by a regular backend or autovacuum worker
  for either a new block or an existing block of a relation which is not
  currently in a buffer

All of these stats which were in the system view pg_stat_bgwriter have
been removed from that view.

All backends, on exit, will update a shared memory array with the
buffers they wrote or extended.

When the view is queried, add all live backend's statuses
to the totals in the shared memory array and return that as the full
total.

Each row of the view is for a particular backend type and each column is
the number of a particular kind of buffer action taken by the various
backends.

TODO:
- Some kind of test?
- Docs change
---
 src/backend/catalog/system_views.sql        | 14 +++-
 src/backend/postmaster/checkpointer.c       | 27 +-----
 src/backend/postmaster/pgstat.c             | 40 ++++++++-
 src/backend/storage/buffer/bufmgr.c         | 30 +++++--
 src/backend/storage/buffer/freelist.c       | 16 +++-
 src/backend/utils/activity/backend_status.c | 62 ++++++++++++++
 src/backend/utils/adt/pgstatfuncs.c         | 91 ++++++++++++++++++---
 src/backend/utils/init/miscinit.c           |  2 +
 src/include/catalog/pg_proc.dat             | 21 ++---
 src/include/miscadmin.h                     | 12 +++
 src/include/pgstat.h                        | 26 ++++--
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 15 +++-
 src/test/regress/expected/rules.out         | 10 ++-
 src/test/regress/sql/stats.sql              |  1 +
 15 files changed, 297 insertions(+), 74 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..96cac0a74e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
         pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_wal AS
@@ -1085,6 +1082,17 @@ CREATE VIEW pg_stat_wal AS
         w.stats_reset
     FROM pg_stat_get_wal() w;
 
+CREATE VIEW pg_stat_buffer_actions AS
+SELECT
+       b.backend_type,
+       b.buffers_alloc,
+       b.buffers_extend,
+       b.buffers_fsync,
+       b.buffers_write,
+       b.buffers_write_strat
+FROM pg_stat_get_buffer_actions() b;
+
+
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index bc9ac7ccfa..db1c6c45c2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,8 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +115,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1106,8 +1090,8 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+		pgstat_increment_buffer_action(BA_Fsync);
+
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
@@ -1264,13 +1248,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 11702f2a80..0db6cd0587 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -129,6 +129,7 @@ char	   *pgstat_stat_tmpname = NULL;
  * without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter BgWriterStats;
+PgStat_MsgBufferActions BufferActionsStats;
 PgStat_MsgWal WalStats;
 
 /*
@@ -348,6 +349,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
+static void pgstat_recv_buffer_actions(PgStat_MsgBufferActions *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -3040,6 +3042,16 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
+void
+pgstat_send_buffer_actions(void)
+{
+	pgstat_setheader(&BufferActionsStats.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&BufferActionsStats, sizeof(BufferActionsStats));
+
+	// TODO: not needed because backends only call this when exiting?
+	MemSet(&BufferActionsStats, 0, sizeof(BufferActionsStats));
+}
+
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3382,6 +3394,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_actions(&msg.msg_buffer_actions, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -4056,6 +4072,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5352,9 +5370,25 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
 	globalStats.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-	globalStats.buf_alloc += msg->m_buf_alloc;
+}
+
+
+static void
+pgstat_recv_buffer_actions(PgStat_MsgBufferActions *msg, int len)
+{
+	globalStats.buffer_actions[msg->backend_type].backend_type = msg->backend_type;
+	globalStats.buffer_actions[msg->backend_type].allocs += msg->allocs;
+	globalStats.buffer_actions[msg->backend_type].extends += msg->extends;
+	globalStats.buffer_actions[msg->backend_type].fsyncs += msg->fsyncs;
+	globalStats.buffer_actions[msg->backend_type].writes += msg->writes;
+	globalStats.buffer_actions[msg->backend_type].writes_strat += msg->writes_strat;
+
+}
+
+PgStat_MsgBufferActions *
+pgstat_get_buffer_action_stats(BackendType backend_type)
+{
+	return &globalStats.buffer_actions[backend_type];
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 33d99f604a..8bfdf848a4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -963,6 +963,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		/*
+		 * Extends counted here are only those that go through shared buffers
+		 */
+		pgstat_increment_buffer_action(BA_Extend);
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1163,6 +1168,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring = false;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1173,7 +1179,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1210,6 +1216,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				BufferActionType buffer_action;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1227,7 +1234,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1236,6 +1243,20 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 * When a strategy is not in use, at this point, the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				buffer_action = from_ring ? BA_Write_Strat : BA_Write;
+				pgstat_increment_buffer_action(buffer_action);
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2246,9 +2267,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	BgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2543,6 +2561,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+	pgstat_increment_buffer_action(BA_Write);
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..e8a8d9f788 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -213,7 +214,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +251,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_increment_buffer_action(BA_Alloc);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,7 +688,7 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
@@ -700,5 +705,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that we do not count this write as a "strategy
+	 * write" and can do proper bookkeeping for pg_stat_buffer_actions.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 2901f9f5a9..d720c73e70 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -75,6 +75,7 @@ static MemoryContext backendStatusSnapContext;
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_read_current_status(void);
 static void pgstat_setup_backend_status_context(void);
+static void pgstat_record_dead_backend_buffer_actions(void);
 
 
 /*
@@ -399,6 +400,11 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.allocs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.extends, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.fsyncs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes_strat, 0);
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -469,6 +475,11 @@ pgstat_beshutdown_hook(int code, Datum arg)
 	beentry->st_procpid = 0;	/* mark invalid */
 
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+	// TODO: should this go in pgstat_report_stat() instead
+	// TODO: should this check be here? Is it possible that members were zero-initialized if database ID is not valid?
+	if (OidIsValid(MyDatabaseId))
+		pgstat_record_dead_backend_buffer_actions();
 }
 
 /*
@@ -1041,6 +1052,57 @@ pgstat_get_my_query_id(void)
 	 */
 	return MyBEEntry->st_query_id;
 }
+void
+pgstat_increment_buffer_action(BufferActionType ba_type)
+{
+	volatile PgBackendStatus *beentry   = MyBEEntry;
+
+	if (!beentry || !pgstat_track_activities)
+		return;
+
+	if (ba_type == BA_Alloc)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs) + 1);
+	else if (ba_type == BA_Extend)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends) + 1);
+	else if (ba_type == BA_Fsync)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs) + 1);
+	else if (ba_type == BA_Write)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes) + 1);
+	else if (ba_type == BA_Write_Strat)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.writes_strat, pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat) + 1);
+
+}
+
+/*
+ * Called for a single backend at the time of death to persist its I/O stats
+ */
+void
+pgstat_record_dead_backend_buffer_actions(void)
+{
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+
+	if (beentry->st_procpid != 0)
+		return;
+
+	// TODO: should I add this or just set it -- seems like it would only happen once -
+	BufferActionsStats.backend_type = beentry->st_backendType;
+	BufferActionsStats.allocs = pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+	BufferActionsStats.extends = pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+	BufferActionsStats.fsyncs = pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+	BufferActionsStats.writes = pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+	BufferActionsStats.writes_strat = pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+	pgstat_send_buffer_actions();
+}
+
+// TODO: this is clearly no good, but I'm not sure if I have to/want to/can use
+// the below pgstat_fetch_stat_beentry and doing the loop that is in
+// pg_stat_get_buffer_actions() into this file will likely mean having to pass a
+// two-dimensional array as a parameter which is unappealing to me
+volatile PgBackendStatus *
+pgstat_access_backend_status_array(void)
+{
+	return BackendStatusArray;
+}
 
 
 /* ----------
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f0e09eae4d..163679f60b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1780,21 +1780,88 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
+pg_stat_get_buffer_actions(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
-}
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	PgStat_MsgBufferActions *buffer_actions;
+	int i;
+	volatile PgBackendStatus *beentry;
+	Datum all_values[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES];
+	bool all_nulls[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES];
 
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
-}
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
 
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+	pgstat_fetch_global();
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+
+		MemSet(values, 0, sizeof(Datum[BUFFER_ACTION_NUM_TYPES]));
+		MemSet(nulls, 0, sizeof(Datum[BUFFER_ACTION_NUM_TYPES]));
+
+		values[0] = CStringGetTextDatum(GetBackendTypeDesc(i));
+		/*
+		* Add stats from all exited backends
+		*/
+		buffer_actions = pgstat_get_buffer_action_stats(i);
+
+		values[BA_Alloc] += buffer_actions->allocs;
+		values[BA_Extend] += buffer_actions->extends;
+		values[BA_Fsync] += buffer_actions->fsyncs;
+		values[BA_Write] += buffer_actions->writes;
+		values[BA_Write_Strat] += buffer_actions->writes_strat;
+	}
+
+	/*
+	 * Loop through all live backends and count their buffer actions
+	 */
+
+	beentry = pgstat_access_backend_status_array();
+	for (i = 0; i <= MaxBackends; i++)
+	{
+		Datum *values;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+		values = all_values[beentry->st_backendType];
+
+
+		values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+		values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+		values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+		values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+		values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+
+	}
+
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
 }
 
 /*
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 8b73850d0d..d0923407ff 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -277,6 +277,8 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_LOGGER:
 			backendDesc = "logger";
 			break;
+		case BACKEND_NUM_TYPES:
+			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8cd0252082..32257dcde8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5565,18 +5565,15 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
+
+  { oid => '8459', descr => 'statistics: counts of buffer actions taken by each backend type',
+  proname => 'pg_stat_get_buffer_actions', provolatile => 's', proisstrict => 'f',
+  prorows => '13', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffers_alloc,buffers_extend,buffers_fsync,buffers_write,buffers_write_strat}',
+  prosrc => 'pg_stat_get_buffer_actions' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 68d840d699..74b18dad0f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,8 +336,20 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_STATS_COLLECTOR,
 	B_LOGGER,
+	BACKEND_NUM_TYPES,
 } BackendType;
 
+typedef enum BufferActionType
+{
+	BA_Invalid = 0,
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+	BA_Write_Strat,
+	BUFFER_ACTION_NUM_TYPES,
+}	BufferActionType;
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..ee545a9d63 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ANL_ANCESTORS,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -475,13 +476,22 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_checkpoints;
 	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
-	PgStat_Counter m_buf_alloc;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
+typedef struct PgStat_MsgBufferActions
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	uint64 allocs;
+	uint64 extends;
+	uint64 fsyncs;
+	uint64 writes;
+	uint64 writes_strat;
+} PgStat_MsgBufferActions;
+
 /* ----------
  * PgStat_MsgWal			Sent by backends and background processes to update WAL statistics.
  * ----------
@@ -700,6 +710,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAnlAncestors msg_anl_ancestors;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
+	PgStat_MsgBufferActions msg_buffer_actions;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -854,9 +865,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_checkpoints;
 	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
-	PgStat_Counter buf_alloc;
+	PgStat_MsgBufferActions buffer_actions[BACKEND_NUM_TYPES];
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
@@ -941,6 +950,8 @@ extern char *pgstat_stat_filename;
  */
 extern PgStat_MsgBgWriter BgWriterStats;
 
+extern PgStat_MsgBufferActions BufferActionsStats;
+
 /*
  * WAL statistics counter is updated by backends and background processes
  */
@@ -1091,6 +1102,9 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffer_actions(void);
+
+extern PgStat_MsgBufferActions * pgstat_get_buffer_action_stats(BackendType backend_type);
 extern void pgstat_send_wal(bool force);
 
 /* ----------
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..0aeac79184 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -79,6 +80,15 @@ typedef struct PgBackendGSSStatus
 
 } PgBackendGSSStatus;
 
+typedef struct PgBackendBufferActionStats
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+	pg_atomic_uint64 writes_strat;
+} PgBackendBufferActionStats;
+
 
 /* ----------
  * PgBackendStatus
@@ -168,6 +178,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBackendBufferActionStats buffer_action_stats;
 } PgBackendStatus;
 
 
@@ -282,7 +293,7 @@ extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
  */
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
-
+extern void CreateBufferActionStatsCounters(void);
 
 /* ----------
  * Functions called from backends
@@ -305,7 +316,9 @@ extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
 extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
+extern void pgstat_increment_buffer_action(BufferActionType ba_type);
 
+extern volatile PgBackendStatus *pgstat_access_backend_status_array(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..609ccf3b7b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1824,10 +1824,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
     pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffer_actions| SELECT b.backend_type,
+    b.buffers_alloc,
+    b.buffers_extend,
+    b.buffers_fsync,
+    b.buffers_write,
+    b.buffers_write_strat
+   FROM pg_stat_get_buffer_actions() b(backend_type, buffers_alloc, buffers_extend, buffers_fsync, buffers_write, buffers_write_strat);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..fb4b613d4b 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,5 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffer_actions;
 -- End of Stats Test
-- 
2.27.0

#14

melanieplageman@gmail.com

over 4 years ago

In reply to: Melanie Plageman (#13)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Aug 11, 2021 at 4:11 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:

@@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* bufToWrite is either the shared buffer or a copy, as appropriate.
*/
+
+     /*
+      * TODO: consider that if we did not need to distinguish between a buffer
+      * flushed that was grabbed from the ring buffer and written out as part
+      * of a strategy which was not from main Shared Buffers (and thus
+      * preventable by bgwriter or checkpointer), then we could move all calls
+      * to pgstat_increment_buffer_action() here except for the one for
+      * extends, which would remain in ReadBuffer_common() before smgrextend()
+      * (unless we decide to start counting other extends). That includes the
+      * call to count buffers written by bgwriter and checkpointer which go
+      * through FlushBuffer() but not BufferAlloc(). That would make it
+      * simpler. Perhaps instead we can find somewhere else to indicate that
+      * the buffer is from the ring of buffers to reuse.
+      */
smgrwrite(reln,
buf->tag.forkNum,
buf->tag.blockNum,

Can we just add a parameter to FlushBuffer indicating what the source of the
write is?

I just noticed this comment now, so I'll address that in the next
version. I rebased today and noticed merge conflicts, so, it looks like
v5 will be on its way soon anyway.

Actually, after moving the code around like you suggested, calling
pgstat_increment_buffer_action() before smgrwrite() in FlushBuffer() and
using a parameter to indicate if it is a strategy write or not would
only save us one other call to pgstat_increment_buffer_action() -- the
one in SyncOneBuffer(). We would end up moving the one in BufferAlloc()
to FlushBuffer() and removing the one in SyncOneBuffer().
Do you think it is still worth it?

Rebased v5 attached.

Attachments:

v5-0001-Add-system-view-tracking-shared-buffer-actions.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Add-system-view-tracking-shared-buffer-actions.patchDownload

From c5feb44585e1e073927c8d45230aa2eabc178c7e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 11 Aug 2021 17:49:08 -0400
Subject: [PATCH v5] Add system view tracking shared buffer actions

Add a system view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
  BufferAccessStrategy which, were they not to use this strategy, could
  perhaps have been avoided if a clean shared buffer was available
- number of fsyncs done by a backend which could have been done by
  checkpointer if sync queue had not been full
- number of buffers allocated by a regular backend or autovacuum worker
  for either a new block or an existing block of a relation which is not
  currently in a buffer

All of these stats which were in the system view pg_stat_bgwriter have
been removed from that view.

All backends, on exit, will update a shared memory array with the
buffers they wrote or extended.

When the view is queried, add all live backend's statuses
to the totals in the shared memory array and return that as the full
total.

Each row of the view is for a particular backend type and each column is
the number of a particular kind of buffer action taken by the various
backends.

TODO:
- Some kind of test?
- Docs change
---
 src/backend/catalog/system_views.sql        | 13 ++-
 src/backend/postmaster/checkpointer.c       | 27 +------
 src/backend/postmaster/pgstat.c             | 43 +++++++++-
 src/backend/storage/buffer/bufmgr.c         | 30 +++++--
 src/backend/storage/buffer/freelist.c       | 17 +++-
 src/backend/utils/activity/backend_status.c | 57 +++++++++++++
 src/backend/utils/adt/pgstatfuncs.c         | 90 ++++++++++++++++++---
 src/backend/utils/init/miscinit.c           |  2 +
 src/include/catalog/pg_proc.dat             | 22 +++--
 src/include/miscadmin.h                     | 13 +++
 src/include/pgstat.h                        | 24 ++++--
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 14 ++++
 src/test/regress/expected/rules.out         | 10 ++-
 src/test/regress/sql/stats.sql              |  1 +
 15 files changed, 293 insertions(+), 74 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..f51e7938fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,11 +1067,18 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
         pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffer_actions AS
+SELECT
+       b.backend_type,
+       b.buffers_alloc,
+       b.buffers_extend,
+       b.buffers_fsync,
+       b.buffers_write,
+       b.buffers_write_strat
+FROM pg_stat_get_buffer_actions() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..fca78fa4ef 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1102,8 +1087,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
+		pgstat_increment_buffer_action(BA_Fsync);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
@@ -1260,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1b54ef74eb..04a6fec18e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -70,6 +70,7 @@
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
+#include "utils/backend_status.h"
 
 /* ----------
  * Timer definitions.
@@ -128,6 +129,7 @@ char	   *pgstat_stat_tmpname = NULL;
  * Stored directly in a stats message structure so they can be sent
  * without needing to copy things around.  We assume these init to zeroes.
  */
+PgStat_MsgBufferActions BufferActionsStats;
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
 PgStat_MsgWal WalStats;
@@ -360,6 +362,7 @@ static void pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_buffer_actions(PgStat_MsgBufferActions *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -866,6 +869,8 @@ pgstat_report_stat(bool disconnect)
 
 	pgstat_assert_is_up();
 
+
+
 	/*
 	 * Don't expend a clock check if nothing to do.
 	 *
@@ -971,6 +976,10 @@ pgstat_report_stat(bool disconnect)
 	/* Now, send function statistics */
 	pgstat_send_funcstats();
 
+	// TODO: not really sure if this is the right place to do this
+	if (pgstat_record_dying_backend_buffer_actions())
+		pgstat_send_buffer_actions();
+
 	/* Send WAL statistics */
 	pgstat_send_wal(true);
 
@@ -3137,6 +3146,16 @@ pgstat_send_checkpointer(void)
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
 }
 
+void
+pgstat_send_buffer_actions(void)
+{
+	pgstat_setheader(&BufferActionsStats.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&BufferActionsStats, sizeof(BufferActionsStats));
+
+	// TODO: not needed because backends only call this when exiting?
+	MemSet(&BufferActionsStats, 0, sizeof(BufferActionsStats));
+}
+
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3483,6 +3502,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_actions(&msg.msg_buffer_actions, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5465,7 +5488,6 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
 	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5482,8 +5504,23 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
 	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
+}
+
+static void
+pgstat_recv_buffer_actions(PgStat_MsgBufferActions *msg, int len)
+{
+	globalStats.buffer_actions[msg->backend_type].backend_type = msg->backend_type;
+	globalStats.buffer_actions[msg->backend_type].allocs += msg->allocs;
+	globalStats.buffer_actions[msg->backend_type].extends += msg->extends;
+	globalStats.buffer_actions[msg->backend_type].fsyncs += msg->fsyncs;
+	globalStats.buffer_actions[msg->backend_type].writes += msg->writes;
+	globalStats.buffer_actions[msg->backend_type].writes_strat += msg->writes_strat;
+}
+
+PgStat_MsgBufferActions *
+pgstat_get_buffer_action_stats(BackendType backend_type)
+{
+	return &globalStats.buffer_actions[backend_type];
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3b485de067..74f88a918e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -963,6 +963,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		/*
+		 * Extends counted here are only those that go through shared buffers
+		 */
+		pgstat_increment_buffer_action(BA_Extend);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1163,6 +1167,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring = false;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1173,7 +1178,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1210,6 +1215,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				BufferActionType buffer_action;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1227,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1236,6 +1242,21 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 * When a strategy is not in use, at this point, the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				buffer_action = from_ring ? BA_Write_Strat : BA_Write;
+				pgstat_increment_buffer_action(buffer_action);
+
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2246,9 +2267,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2543,6 +2561,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_increment_buffer_action(BA_Write);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..20bba546fe 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -213,7 +214,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +251,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_increment_buffer_action(BA_Alloc);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,7 +688,7 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
@@ -700,5 +705,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that we do not count this write as a "strategy
+	 * write" and can do proper bookkeeping for pg_stat_buffer_actions.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 2901f9f5a9..b2dbc4dc28 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -399,6 +399,11 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.allocs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.extends, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.fsyncs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes_strat, 0);
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -459,6 +464,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 {
 	volatile PgBackendStatus *beentry = MyBEEntry;
 
+
 	/*
 	 * Clear my status entry, following the protocol of bumping st_changecount
 	 * before and after.  We use a volatile pointer here to ensure the
@@ -1042,6 +1048,57 @@ pgstat_get_my_query_id(void)
 	return MyBEEntry->st_query_id;
 }
 
+void
+pgstat_increment_buffer_action(BufferActionType ba_type)
+{
+	volatile PgBackendStatus *beentry   = MyBEEntry;
+
+	if (!beentry || !pgstat_track_activities)
+		return;
+
+	if (ba_type == BA_Alloc)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs) + 1);
+	else if (ba_type == BA_Extend)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends) + 1);
+	else if (ba_type == BA_Fsync)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs) + 1);
+	else if (ba_type == BA_Write)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes) + 1);
+	else if (ba_type == BA_Write_Strat)
+		pg_atomic_write_u64(&beentry->buffer_action_stats.writes_strat, pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat) + 1);
+
+}
+
+/*
+ * Called for a single backend at the time of death to persist its I/O stats
+ */
+bool
+pgstat_record_dying_backend_buffer_actions(void)
+{
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+	if (!beentry)
+		return false;
+
+	// TODO: should I add this or just set it -- seems like it would only happen once -
+	BufferActionsStats.backend_type = beentry->st_backendType;
+	BufferActionsStats.allocs = pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+	BufferActionsStats.extends = pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+	BufferActionsStats.fsyncs = pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+	BufferActionsStats.writes = pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+	BufferActionsStats.writes_strat = pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+	return true;
+}
+
+// TODO: this is clearly no good, but I'm not sure if I have to/want to/can use
+// the below pgstat_fetch_stat_beentry and doing the loop that is in
+// pg_stat_get_buffer_actions() into this file will likely mean having to pass a
+// two-dimensional array as a parameter which is unappealing to me
+volatile PgBackendStatus *
+pgstat_access_backend_status_array(void)
+{
+	return BackendStatusArray;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..17d3fa942d 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1779,21 +1779,87 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
+pg_stat_get_buffer_actions(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	PgStat_MsgBufferActions *buffer_actions;
+	int i;
+	volatile PgBackendStatus *beentry;
+	Datum all_values[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES];
+	bool all_nulls[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES];
 
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
 
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+	pgstat_fetch_global();
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+
+		MemSet(values, 0, sizeof(Datum[BUFFER_ACTION_NUM_TYPES]));
+		MemSet(nulls, 0, sizeof(Datum[BUFFER_ACTION_NUM_TYPES]));
+
+		values[0] = CStringGetTextDatum(GetBackendTypeDesc(i));
+		/*
+		* Add stats from all exited backends
+		*/
+		buffer_actions = pgstat_get_buffer_action_stats(i);
+
+		values[BA_Alloc] += buffer_actions->allocs;
+		values[BA_Extend] += buffer_actions->extends;
+		values[BA_Fsync] += buffer_actions->fsyncs;
+		values[BA_Write] += buffer_actions->writes;
+		values[BA_Write_Strat] += buffer_actions->writes_strat;
+	}
+
+	/*
+	* Loop through all live backends and count their buffer actions
+	*/
+
+	beentry = pgstat_access_backend_status_array();
+	for (i = 0; i <= MaxBackends; i++)
+	{
+		Datum *values;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+		values = all_values[beentry->st_backendType];
+
+
+		values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+		values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+		values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+		values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+		values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+
+	}
+
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
 }
 
 /*
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..cbeaa9ab94 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -294,6 +294,8 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_LOGGER:
 			backendDesc = "logger";
 			break;
+		case BACKEND_NUM_TYPES:
+			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b603700ed9..1f1a97ba48 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5629,18 +5629,16 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
+
+
+{ oid => '8459', descr => 'statistics: counts of buffer actions taken by each backend type',
+  proname => 'pg_stat_get_buffer_actions', provolatile => 's', proisstrict => 'f',
+  prorows => '13', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffers_alloc,buffers_extend,buffers_fsync,buffers_write,buffers_write_strat}',
+  prosrc => 'pg_stat_get_buffer_actions' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2e2e9a364a..03d5e464a9 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,8 +336,21 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_STATS_COLLECTOR,
 	B_LOGGER,
+	BACKEND_NUM_TYPES,
 } BackendType;
 
+typedef enum BufferActionType
+{
+	BA_Invalid = 0,
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+	BA_Write_Strat,
+	BUFFER_ACTION_NUM_TYPES,
+}	BufferActionType;
+
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2068a68a5f..1948c0bc04 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -73,6 +73,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -473,7 +474,6 @@ typedef struct PgStat_MsgBgWriter
 
 	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -487,12 +487,22 @@ typedef struct PgStat_MsgCheckpointer
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
 	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
 
+typedef struct PgStat_MsgBufferActions
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	uint64 allocs;
+	uint64 extends;
+	uint64 fsyncs;
+	uint64 writes;
+	uint64 writes_strat;
+} PgStat_MsgBufferActions;
+
 /* ----------
  * PgStat_MsgWal			Sent by backends and background processes to update WAL statistics.
  * ----------
@@ -712,6 +722,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgBufferActions msg_buffer_actions;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -860,7 +871,6 @@ typedef struct PgStat_BgWriterStats
 {
 	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -875,8 +885,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
 	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
@@ -888,6 +896,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_MsgBufferActions buffer_actions[BACKEND_NUM_TYPES];
 } PgStat_GlobalStats;
 
 /*
@@ -977,6 +986,7 @@ extern PgStat_MsgBgWriter PendingBgWriterStats;
  */
 extern PgStat_MsgCheckpointer PendingCheckpointerStats;
 
+extern PgStat_MsgBufferActions BufferActionsStats;
 /*
  * WAL statistics counter is updated by backends and background processes
  */
@@ -1128,6 +1138,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_checkpointer(void);
+extern void pgstat_send_buffer_actions(void);
+extern PgStat_MsgBufferActions * pgstat_get_buffer_action_stats(BackendType backend_type);
 extern void pgstat_send_wal(bool force);
 
 /* ----------
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..ecb1076ffd 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -79,6 +80,14 @@ typedef struct PgBackendGSSStatus
 
 } PgBackendGSSStatus;
 
+typedef struct PgBackendBufferActionStats
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+	pg_atomic_uint64 writes_strat;
+} PgBackendBufferActionStats;
 
 /* ----------
  * PgBackendStatus
@@ -168,6 +177,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBackendBufferActionStats buffer_action_stats;
 } PgBackendStatus;
 
 
@@ -306,6 +316,10 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+extern void pgstat_increment_buffer_action(BufferActionType ba_type);
+extern volatile PgBackendStatus *pgstat_access_backend_status_array(void);
+extern bool pgstat_record_dying_backend_buffer_actions(void);
+
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e5ab11275d..609ccf3b7b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1824,10 +1824,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
     pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffer_actions| SELECT b.backend_type,
+    b.buffers_alloc,
+    b.buffers_extend,
+    b.buffers_fsync,
+    b.buffers_write,
+    b.buffers_write_strat
+   FROM pg_stat_get_buffer_actions() b(backend_type, buffers_alloc, buffers_extend, buffers_fsync, buffers_write, buffers_write_strat);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..fb4b613d4b 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,5 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffer_actions;
 -- End of Stats Test
-- 
2.27.0

#15

andres@anarazel.de

over 4 years ago

In reply to: Melanie Plageman (#13)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:

On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:

Also, I'm unsure how writing the buffer action stats out in
pgstat_write_statsfiles() will work, since I think that backends can
update their buffer action stats after we would have already persisted
the data from the BufferActionStatsArray -- causing us to lose those
updates.

I was thinking it'd work differently. Whenever a connection ends, it reports
its data up to pgstats.c (otherwise we'd loose those stats). By the time
shutdown happens, they all need to have already have reported their stats - so
we don't need to do anything to get the data to pgstats.c during shutdown
time.

When you say "whenever a connection ends", what part of the code are you
referring to specifically?

pgstat_beshutdown_hook()

Also, when you say "shutdown", do you mean a backend shutting down or
all backends shutting down (including postmaster) -- like pg_ctl stop?

Admittedly our language is very imprecise around this :(. What I meant
is that backends would report their own stats up to the stats collector
when the connection ends (in pgstat_beshutdown_hook()). That means that
when the whole server (pgstat and then postmaster, potentially via
pg_ctl stop) shuts down, all the per-connection stats have already been
reported up to pgstat.

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..96cac0a74e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;

Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?

No - I just meant that now that we're breaking pg_stat_bgwriter queries,
we should also rename the columns to be easier to understand. But that
it should be a separate patch / commit...

@@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
*/
*complete_passes += nextVictimBuffer / NBuffers;
}
-
-     if (num_buf_alloc)
-     {
-             *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-     }
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
return result;
}
Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
suspect this patch shouldn't get rid of numBufferAllocs at the same time as
overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
that that's the case / how we can make that work.
I initially meant to add a function to the patch like
pg_stat_get_buffer_actions() but which took a BufferActionType and
BackendType as parameters and returned a single value which is the
number of buffer action types of that type for that type of backend.

let's say I defined it like this:
uint64
pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
BufferActionType ba_type)

Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
by subtracting the value of StrategyControl->numBufferAllocs from the
value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
BA_Alloc), val, then adding that value, val, to
StrategyControl->numBufferAllocs.

I don't think you could restrict this to B_BG_WRITER? The whole point of
this logic is that bgwriter uses the stats for *all* backends to get the
"usage rate" for buffers, which it then uses to control how many buffers
to clean.

I think that would have the same behavior as current, though I'm not
sure if the performance would end up being better or worse. It wouldn't
be atomically incrementing StrategyControl->numBufferAllocs, but it
would do a few additional atomic operations in StrategySyncStart() than
before. Also, we would do all the work done by
pg_stat_get_buffer_actions() in StrategySyncStart().

I think it'd be better to separate changing the bgwriter pacing logic
(and thus numBufferAllocs) from changing the stats reporting.

But that is called comparatively infrequently, right?

Depending on the workload not that rarely. I'm afraid this might be a
bit too expensive. It's possible we can work around that however.

Greetings,

Andres Freund

#16

melanieplageman@gmail.com

over 4 years ago

In reply to: Andres Freund (#15)

2 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:
On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..96cac0a74e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
Material for a separate patch, not this. But if we're going to break
monitoring queries anyway, I think we should consider also renaming
maxwritten_clean (and perhaps a few others), because nobody understands what
that is supposed to mean.
Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?

No - I just meant that now that we're breaking pg_stat_bgwriter queries,
we should also rename the columns to be easier to understand. But that
it should be a separate patch / commit...

I separated the removal of some redundant stats from pg_stat_bgwriter
into a different commit but haven't removed or clarified any additional
columns in pg_stat_bgwriter.

@@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
*/
*complete_passes += nextVictimBuffer / NBuffers;
}
-
-     if (num_buf_alloc)
-     {
-             *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-     }
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
return result;
}
Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
suspect this patch shouldn't get rid of numBufferAllocs at the same time as
overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
that that's the case / how we can make that work.
I initially meant to add a function to the patch like
pg_stat_get_buffer_actions() but which took a BufferActionType and
BackendType as parameters and returned a single value which is the
number of buffer action types of that type for that type of backend.

let's say I defined it like this:
uint64
pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
BufferActionType ba_type)

Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
by subtracting the value of StrategyControl->numBufferAllocs from the
value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
BA_Alloc), val, then adding that value, val, to
StrategyControl->numBufferAllocs.
I don't think you could restrict this to B_BG_WRITER? The whole point of
this logic is that bgwriter uses the stats for *all* backends to get the
"usage rate" for buffers, which it then uses to control how many buffers
to clean.

I think that would have the same behavior as current, though I'm not
sure if the performance would end up being better or worse. It wouldn't
be atomically incrementing StrategyControl->numBufferAllocs, but it
would do a few additional atomic operations in StrategySyncStart() than
before. Also, we would do all the work done by
pg_stat_get_buffer_actions() in StrategySyncStart().

I think it'd be better to separate changing the bgwriter pacing logic
(and thus numBufferAllocs) from changing the stats reporting.

But that is called comparatively infrequently, right?

Depending on the workload not that rarely. I'm afraid this might be a
bit too expensive. It's possible we can work around that however.

I've restored StrategyControl->numBuffersAlloc.

Attached is v6 of the patchset.

I have made several small updates to the patch, including user docs
updates, comment clarifications, various changes related to how
structures are initialized, code simplications, small details like
alphabetizing of #includes, etc.

Below are details on the remaining TODOs and open questions for this
patch and why I haven't done them yet:

1) performance testing (initial tests done, but need to do some further
investigation before sharing)

2) stats_reset
Because pg_stat_buffer_actions fields were added to the globalStats
structure, they get reset when the target RESET_BGWRITER is reset.
Depending on whether or not these commits remove columns from the
pg_stat_bgwriter view, I would approach adding stats_reset to
pg_stat_buffer_actions differently. If removing all of pg_stat_bgwriter,
I would just rename the target to apply to pg_stat_buffer_actions. If
not removing all of pg_stat_bgwriter, I would add a new target for
pg_stat_buffer_actions to reset those stats and then either remove them
from globalStats or MemSet() only the relevant parts of the struct in
pgstat_recv_resetsharedcounter().
I haven't done this yet because I want to get input on what should
happen to pg_stat_bgwriter first (all of it goes, all of it stays, some
goes, etc).

3) what to count
Currently, the patch counts allocs, extends, fsyncs and writes of shared
buffers and writes done when using a buffer access strategy. So, it is a
mix of mostly shared buffers and a few non-shared buffers. I am
wondering if it makes sense to also count extends with smgrextend()
other than those using shared buffers--for example when building an
index or when extending the free space map or visibility map. For
fsyncs, the patch does not count checkpointer fsyncs or fsyncs done from
XLogWrite().
On a related note, depending on what the view counts, the name
buffer_actions may or may not be too general.

I also feel like the BackendType B_BACKEND is a bit confusing when we
are tracking buffer actions for different backend types -- this name
makes it seem like other types of backends are not backends.

I'm not sure what the view should track and can see arguments for
excluding certain extends or separating them into another stat. I
haven't made the changes because I am looking for other peoples'
opinions.

4) Adding some sort of protection against regressions when code is added
that adds additional buffer actions but doesn't count them -- more
likely if we are counting all users of smgrextend() but not doing the
counter incrementing there.

I'm not sure how I would even do this, so, that's why I haven't done it.

5) It seems like the code to create a tuplestore used by various stats
functions like pg_stat_get_progress_info(), pg_stat_get_activity, and
pg_stat_get_slru could be refactored into a helper function since it is
quite redundant (maybe returning a ReturnSetInfo).

I haven't done this because I wasn't sure if it was a good idea, and, if
it is, if I should do it in a separate commit.

6) Cleaning up of commit message, running pgindent, and, eventually,
catalog bump (waiting until the patch is done to do this).

7) Additional testing to ensure all codepaths added are hit (one-off
testing, not added to regression test suite). I am waiting to do this
until all of the types of buffer actions that will be done are
finalized.

- Melanie

Attachments:

v6-0002-Remove-superfluous-bgwriter-stats.patchapplication/octet-stream; name=v6-0002-Remove-superfluous-bgwriter-stats.patchDownload

From 2887f6949fd4671e75062045f5d3e09621a273cd Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:47:41 -0400
Subject: [PATCH v6 2/2] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffer_actions.
---
 doc/src/sgml/monitoring.sgml          | 29 ---------------------------
 src/backend/catalog/system_views.sql  |  3 ---
 src/backend/postmaster/checkpointer.c | 26 ------------------------
 src/backend/postmaster/pgstat.c       |  3 ---
 src/backend/storage/buffer/bufmgr.c   |  3 ---
 src/backend/utils/adt/pgstatfuncs.c   | 18 -----------------
 src/include/catalog/pg_proc.dat       | 12 -----------
 src/include/pgstat.h                  |  6 ------
 src/test/regress/expected/rules.out   |  3 ---
 9 files changed, 103 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index edd19368be..087d3993c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3444,35 +3444,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7ba54d1119..f51e7938fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
         pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffer_actions AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 23f2ffccd9..fca78fa4ef 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1102,8 +1087,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_increment_buffer_action(BA_Fsync);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 244e5a7e44..74e1d32766 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5426,7 +5426,6 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
 	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5443,8 +5442,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
 	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ef83c576b0..b1bd528856 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2267,9 +2267,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 278257de80..2d8ab68a50 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1778,24 +1778,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffer_actions(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ee3c11db06..26aa700ff8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5629,18 +5629,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of buffer actions taken by each backend type',
   proname => 'pg_stat_get_buffer_actions', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57c642aeca..22beef81ff 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -454,7 +454,6 @@ typedef struct PgStat_MsgBgWriter
 
 	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -468,8 +467,6 @@ typedef struct PgStat_MsgCheckpointer
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
 	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -852,7 +849,6 @@ typedef struct PgStat_BgWriterStats
 {
 	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -867,8 +863,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
 	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5c5445bcd7..667f4444b3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1824,9 +1824,6 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
     pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffer_actions| SELECT b.backend_type,
     b.buffers_alloc,
-- 
2.32.0

v6-0001-Add-system-view-tracking-shared-buffer-actions.patchapplication/octet-stream; name=v6-0001-Add-system-view-tracking-shared-buffer-actions.patchDownload

From 3c7ba68d18c43bc0d0c6d0873ab477bec263c1dd Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:33:59 -0400
Subject: [PATCH v6 1/2] Add system view tracking shared buffer actions

Add a system view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
  BufferAccessStrategy which, were they not to use this strategy, could
  perhaps have been avoided if a clean shared buffer was available
- number of fsyncs done by a backend which could have been done by
  checkpointer if sync queue had not been full
- number of buffers allocated by a regular backend or autovacuum worker
  for either a new block or an existing block of a relation which is not
  currently in a buffer

All backends increment a counter in their PgBackendStatus when
performing one of these buffer actions. On exit, backends send these
stats to the stats collector to be persisted.

When pg_stat_buffer_actions view is queried, add all live backend's
statuses to the saved stats kept by the stats collector (since the last
stats reset) and return that as the total.

Each row of the view is for a particular backend type and each column is
the number of a particular kind of buffer action taken by the various
backends.

TODO:
- Some kind of test to protect against regressions in counting these
  (and remove unstable pg_stats test)
- stats reset refactor
- when finished, catalog bump
- pgindent
---
 doc/src/sgml/monitoring.sgml                | 94 +++++++++++++++++++++
 src/backend/catalog/system_views.sql        | 10 +++
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/postmaster/pgstat.c             | 71 +++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         | 27 +++++-
 src/backend/storage/buffer/freelist.c       | 22 ++++-
 src/backend/utils/activity/backend_status.c | 49 +++++++++++
 src/backend/utils/adt/pgstatfuncs.c         | 93 ++++++++++++++++++++
 src/backend/utils/init/miscinit.c           |  2 +
 src/include/catalog/pg_proc.dat             |  9 ++
 src/include/miscadmin.h                     | 13 +++
 src/include/pgstat.h                        | 18 ++++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 13 +++
 src/test/regress/expected/rules.out         |  7 ++
 src/test/regress/sql/stats.sql              |  1 +
 16 files changed, 425 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2281ba120f..edd19368be 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffer_actions</structname><indexterm><primary>pg_stat_buffer_actions</primary></indexterm></entry>
+      <entry>One row for each backend type showing statistics about
+      backend buffer activity. See
+       <link linkend="monitoring-pg-stat-buffer-actions-view">
+       <structname>pg_stat_buffer_actions</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3478,6 +3487,91 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffer-actions-view">
+  <title><structname>pg_stat_buffer_actions</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffer_actions</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffer_actions</structname> view has a row for each
+   backend type, containing global data for the cluster for that backend type.
+  </para>
+
+  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffer_actions">
+   <title><structname>pg_stat_buffer_actions</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced. TODO: is this only shared buffers?
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_write_strat</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written as part of a buffer access strategy.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..7ba54d1119 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1072,6 +1072,16 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffer_actions AS
+SELECT
+       b.backend_type,
+       b.buffers_alloc,
+       b.buffers_extend,
+       b.buffers_fsync,
+       b.buffers_write,
+       b.buffers_write_strat
+FROM pg_stat_get_buffer_actions() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..23f2ffccd9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
+		pgstat_increment_buffer_action(BA_Fsync);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3450a10129..244e5a7e44 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -63,6 +63,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
+#include "utils/backend_status.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -124,12 +125,16 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various buffer actions are tracked in PgBackendStatus while a
+ * backend is alive and then sent to stats collector before a backend exits in
+ * a PgStat_MsgBufferActions.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
+PgStat_MsgBufferActions BufferActionsStats;
 PgStat_MsgWal WalStats;
 
 /*
@@ -359,6 +364,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_buffer_actions(PgStat_MsgBufferActions *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -970,6 +976,8 @@ pgstat_report_stat(bool disconnect)
 	/* Now, send function statistics */
 	pgstat_send_funcstats();
 
+	pgstat_send_buffer_actions();
+
 	/* Send WAL statistics */
 	pgstat_send_wal(true);
 
@@ -3085,6 +3093,35 @@ pgstat_send_checkpointer(void)
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
 }
 
+/*
+ * Called for a single backend at the time of death to send its I/O stats to
+ * the stats collector so that they may be persisted.
+ */
+void
+pgstat_send_buffer_actions(void)
+{
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+	if (!beentry)
+		return;
+
+	BufferActionsStats = (PgStat_MsgBufferActions) {
+		.backend_type = beentry->st_backendType,
+		.allocs = pg_atomic_read_u64(&beentry->buffer_action_stats.allocs),
+		.extends = pg_atomic_read_u64(&beentry->buffer_action_stats.extends),
+		.fsyncs = pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs),
+		.writes = pg_atomic_read_u64(&beentry->buffer_action_stats.writes),
+		.writes_strat = pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat)
+	};
+
+	pgstat_setheader(&BufferActionsStats.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&BufferActionsStats, sizeof(BufferActionsStats));
+
+	/*
+	 * pgstat_send_buffer_actions() is only called before a backend exits, so
+	 * BufferActionsStats should not be reused.
+	 */
+}
+
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3427,6 +3464,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_actions(&msg.msg_buffer_actions, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5406,6 +5447,30 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_buffer_actions(PgStat_MsgBufferActions *msg, int len)
+{
+	/*
+	 * No users will likely need PgStat_MsgBufferActions->backend_type when
+	 * accessing it from globalStats since its place in the
+	 * globalStats.buffer_actions array indicates backend_type. However,
+	 * leaving it undefined seemed like an invitation for unnecessary future
+	 * bugs.
+	 */
+	globalStats.buffer_actions[msg->backend_type].backend_type = msg->backend_type;
+	globalStats.buffer_actions[msg->backend_type].allocs += msg->allocs;
+	globalStats.buffer_actions[msg->backend_type].extends += msg->extends;
+	globalStats.buffer_actions[msg->backend_type].fsyncs += msg->fsyncs;
+	globalStats.buffer_actions[msg->backend_type].writes += msg->writes;
+	globalStats.buffer_actions[msg->backend_type].writes_strat += msg->writes_strat;
+}
+
+const PgStat_MsgBufferActions *
+pgstat_get_buffer_action_stats(BackendType backend_type)
+{
+	return &globalStats.buffer_actions[backend_type];
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bc1753ae91..ef83c576b0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -963,6 +963,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		/*
+		 * Extends counted here are only those that go through shared buffers
+		 */
+		pgstat_increment_buffer_action(BA_Extend);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1163,6 +1167,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1173,7 +1178,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1210,6 +1215,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				BufferActionType buffer_action;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1227,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1236,6 +1242,21 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 * When a strategy is not in use, at this point, the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				buffer_action = from_ring ? BA_Write_Strat : BA_Write;
+				pgstat_increment_buffer_action(buffer_action);
+
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2543,6 +2564,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_increment_buffer_action(BA_Write);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..17b76e9c2c 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,6 +213,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
+		*from_ring = buf == NULL ? false : true;
 		if (buf != NULL)
 			return buf;
 	}
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_increment_buffer_action(BA_Alloc);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer, then
+	 * ensure that we count it as such in pg_stat_buffer_actions view.
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping for
+	 * pg_stat_buffer_actions.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index e19c4506ef..6f1e1c30d2 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -399,6 +399,11 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.allocs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.extends, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.fsyncs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes_strat, 0);
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -1045,6 +1050,50 @@ pgstat_get_my_query_id(void)
 	return MyBEEntry->st_query_id;
 }
 
+volatile PgBackendStatus *
+pgstat_access_backend_status_array(void)
+{
+	return BackendStatusArray;
+}
+
+void
+pgstat_increment_buffer_action(BufferActionType ba_type)
+{
+	volatile PgBackendStatus *beentry   = MyBEEntry;
+
+	if (!beentry || !pgstat_track_activities)
+		return;
+
+	switch (ba_type)
+	{
+		case BA_Alloc:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.allocs,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.allocs) + 1);
+			break;
+		case BA_Extend:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.extends,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.extends) + 1);
+			break;
+		case BA_Fsync:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.fsyncs,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs) + 1);
+			break;
+		case BA_Write:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.writes,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.writes) + 1);
+			break;
+		case BA_Write_Strat:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.writes_strat,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat) + 1);
+			break;
+		default:
+			ereport(LOG,
+					(errmsg(
+							"Statistics on Buffer Action Type, %d, are not currently collected.",
+							ba_type)));
+	}
+
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..278257de80 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,99 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffer_actions(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	const PgStat_MsgBufferActions *buffer_actions;
+	int i;
+	volatile PgBackendStatus *beentry;
+	Datum all_values[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES];
+	bool all_nulls[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES];
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	MemSet(all_values, 0, sizeof(Datum[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES]));
+	MemSet(all_nulls, 0, sizeof(bool[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES]));
+
+	/* Add stats from all exited backends */
+	pgstat_fetch_global();
+	/* 0 is not a valid BackendType */
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		Datum *values = all_values[i];
+
+		values[0] = CStringGetTextDatum(GetBackendTypeDesc(i));
+		buffer_actions = pgstat_get_buffer_action_stats(i);
+
+		values[BA_Alloc] += buffer_actions->allocs;
+		values[BA_Extend] += buffer_actions->extends;
+		values[BA_Fsync] += buffer_actions->fsyncs;
+		values[BA_Write] += buffer_actions->writes;
+		values[BA_Write_Strat] += buffer_actions->writes_strat;
+	}
+
+	/*
+	 * Loop through all live backends and count their buffer actions
+	 */
+	beentry = pgstat_access_backend_status_array();
+	for (i = 0; i <= MaxBackends; i++)
+	{
+		Datum *values;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+		values = all_values[beentry->st_backendType];
+
+
+		values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+		values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+		values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+		values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+		values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+
+	}
+
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..cbeaa9ab94 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -294,6 +294,8 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_LOGGER:
 			backendDesc = "logger";
 			break;
+		case BACKEND_NUM_TYPES:
+			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..ee3c11db06 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of buffer actions taken by each backend type',
+  proname => 'pg_stat_get_buffer_actions', provolatile => 's', proisstrict => 'f',
+  prorows => '13', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffers_alloc,buffers_extend,buffers_fsync,buffers_write,buffers_write_strat}',
+  prosrc => 'pg_stat_get_buffer_actions' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2e2e9a364a..03d5e464a9 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,8 +336,21 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_STATS_COLLECTOR,
 	B_LOGGER,
+	BACKEND_NUM_TYPES,
 } BackendType;
 
+typedef enum BufferActionType
+{
+	BA_Invalid = 0,
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+	BA_Write_Strat,
+	BUFFER_ACTION_NUM_TYPES,
+}	BufferActionType;
+
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 509849c7ff..57c642aeca 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -473,6 +474,18 @@ typedef struct PgStat_MsgCheckpointer
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
 
+typedef struct PgStat_MsgBufferActions
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	uint64 allocs;
+	uint64 extends;
+	uint64 fsyncs;
+	uint64 writes;
+	uint64 writes_strat;
+} PgStat_MsgBufferActions;
+
 /* ----------
  * PgStat_MsgWal			Sent by backends and background processes to update WAL statistics.
  * ----------
@@ -691,6 +704,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgBufferActions msg_buffer_actions;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -866,6 +880,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_MsgBufferActions buffer_actions[BACKEND_NUM_TYPES];
 } PgStat_GlobalStats;
 
 /*
@@ -955,6 +970,7 @@ extern PgStat_MsgBgWriter PendingBgWriterStats;
  */
 extern PgStat_MsgCheckpointer PendingCheckpointerStats;
 
+extern PgStat_MsgBufferActions BufferActionsStats;
 /*
  * WAL statistics counter is updated by backends and background processes
  */
@@ -1105,6 +1121,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_checkpointer(void);
+extern void pgstat_send_buffer_actions(void);
+extern const PgStat_MsgBufferActions * pgstat_get_buffer_action_stats(BackendType backend_type);
 extern void pgstat_send_wal(bool force);
 
 /* ----------
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..c23b74b4a6 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -79,6 +80,14 @@ typedef struct PgBackendGSSStatus
 
 } PgBackendGSSStatus;
 
+typedef struct PgBackendBufferActionStats
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+	pg_atomic_uint64 writes_strat;
+} PgBackendBufferActionStats;
 
 /* ----------
  * PgBackendStatus
@@ -168,6 +177,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBackendBufferActionStats buffer_action_stats;
 } PgBackendStatus;
 
 
@@ -306,6 +316,9 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+extern volatile PgBackendStatus *pgstat_access_backend_status_array(void);
+extern void pgstat_increment_buffer_action(BufferActionType ba_type);
+
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..5c5445bcd7 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,13 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffer_actions| SELECT b.backend_type,
+    b.buffers_alloc,
+    b.buffers_extend,
+    b.buffers_fsync,
+    b.buffers_write,
+    b.buffers_write_strat
+   FROM pg_stat_get_buffer_actions() b(backend_type, buffers_alloc, buffers_extend, buffers_fsync, buffers_write, buffers_write_strat);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..fb4b613d4b 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,5 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffer_actions;
 -- End of Stats Test
-- 
2.32.0

#17

melanieplageman@gmail.com

over 4 years ago

In reply to: Andres Freund (#15)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:

On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:

Also, I'm unsure how writing the buffer action stats out in
pgstat_write_statsfiles() will work, since I think that backends can
update their buffer action stats after we would have already persisted
the data from the BufferActionStatsArray -- causing us to lose those
updates.

I was thinking it'd work differently. Whenever a connection ends, it reports
its data up to pgstats.c (otherwise we'd loose those stats). By the time
shutdown happens, they all need to have already have reported their stats - so
we don't need to do anything to get the data to pgstats.c during shutdown
time.

When you say "whenever a connection ends", what part of the code are you
referring to specifically?

pgstat_beshutdown_hook()

Also, when you say "shutdown", do you mean a backend shutting down or
all backends shutting down (including postmaster) -- like pg_ctl stop?

Admittedly our language is very imprecise around this :(. What I meant
is that backends would report their own stats up to the stats collector
when the connection ends (in pgstat_beshutdown_hook()). That means that
when the whole server (pgstat and then postmaster, potentially via
pg_ctl stop) shuts down, all the per-connection stats have already been
reported up to pgstat.

So, I realized that the patch has a problem. I added the code to send
buffer actions stats to the stats collector
(pgstat_send_buffer_actions()) to pgstat_report_stat() and this isn't
getting called when all types of backends exit.

I originally thought to add pgstat_send_buffer_actions() to
pgstat_beshutdown_hook() (as suggested), but, this is called after
pgstat_shutdown_hook(), so, we aren't able to send stats to the stats
collector at that time. (pgstat_shutdown_hook() sets pgstat_is_shutdown
to true and then in pgstat_beshutdown_hook() (called after), if we call
pgstat_send_buffer_actions(), it calls pgstat_send() which calls
pgstat_assert_is_up() which trips when pgstat_is_shutdown is true.)

After calling pgstat_send_buffer_actions() from pgstat_report_stat(), it
seems to miss checkpointer stats entirely. I did find that if I
sprinkled pgstat_send_buffer_actions() around in the various places that
pgstat_send_checkpointer() is called, I could get checkpointer stats
(see attached patch, capture_checkpointer_buffer_actions.patch), but,
that seems a little bit haphazard since pgstat_send_buffer_actions() is
supposed to capture stats for all backend types. Is there somewhere else
I can call it that is exercised by all backend types before
pgstat_shutdown_hook() is called but after they would have finished any
relevant buffer actions?

- Melanie

Attachments:

capture_checkpointer_buffer_actions.patchapplication/octet-stream; name=capture_checkpointer_buffer_actions.patchDownload

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fca78fa4ef..74e9b9373a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -488,6 +488,7 @@ CheckpointerMain(void)
 
 		/* Send WAL statistics to the stats collector. */
 		pgstat_send_wal(true);
+		pgstat_send_buffer_actions();
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -566,6 +567,7 @@ HandleCheckpointerInterrupts(void)
 		ShutdownXLOG(0, 0);
 		pgstat_send_checkpointer();
 		pgstat_send_wal(true);
+		pgstat_send_buffer_actions();
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
@@ -707,6 +709,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Report interim activity statistics.
 		 */
 		pgstat_send_checkpointer();
+		pgstat_send_buffer_actions();
 
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 74e1d32766..f2d29336e1 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3116,10 +3116,7 @@ pgstat_send_buffer_actions(void)
 	pgstat_setheader(&BufferActionsStats.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
 	pgstat_send(&BufferActionsStats, sizeof(BufferActionsStats));
 
-	/*
-	 * pgstat_send_buffer_actions() is only called before a backend exits, so
-	 * BufferActionsStats should not be reused.
-	 */
+	MemSet(&BufferActionsStats, 0, sizeof(BufferActionsStats));
 }
 
 /* ----------
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 6f1e1c30d2..477e73340d 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -474,6 +474,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 	beentry->st_procpid = 0;	/* mark invalid */
 
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
+	pgstat_send_buffer_actions();
 
 	/* so that functions can check if backend_status.c is up via MyBEEntry */
 	MyBEEntry = NULL;

#18

melanieplageman@gmail.com

over 4 years ago

In reply to: Melanie Plageman (#17)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Sep 8, 2021 at 9:28 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote:

On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:

Also, I'm unsure how writing the buffer action stats out in
pgstat_write_statsfiles() will work, since I think that backends can
update their buffer action stats after we would have already persisted
the data from the BufferActionStatsArray -- causing us to lose those
updates.

I was thinking it'd work differently. Whenever a connection ends, it reports
its data up to pgstats.c (otherwise we'd loose those stats). By the time
shutdown happens, they all need to have already have reported their stats - so
we don't need to do anything to get the data to pgstats.c during shutdown
time.

When you say "whenever a connection ends", what part of the code are you
referring to specifically?

pgstat_beshutdown_hook()

Also, when you say "shutdown", do you mean a backend shutting down or
all backends shutting down (including postmaster) -- like pg_ctl stop?

Admittedly our language is very imprecise around this :(. What I meant
is that backends would report their own stats up to the stats collector
when the connection ends (in pgstat_beshutdown_hook()). That means that
when the whole server (pgstat and then postmaster, potentially via
pg_ctl stop) shuts down, all the per-connection stats have already been
reported up to pgstat.

So, I realized that the patch has a problem. I added the code to send
buffer actions stats to the stats collector
(pgstat_send_buffer_actions()) to pgstat_report_stat() and this isn't
getting called when all types of backends exit.

I originally thought to add pgstat_send_buffer_actions() to
pgstat_beshutdown_hook() (as suggested), but, this is called after
pgstat_shutdown_hook(), so, we aren't able to send stats to the stats
collector at that time. (pgstat_shutdown_hook() sets pgstat_is_shutdown
to true and then in pgstat_beshutdown_hook() (called after), if we call
pgstat_send_buffer_actions(), it calls pgstat_send() which calls
pgstat_assert_is_up() which trips when pgstat_is_shutdown is true.)

After calling pgstat_send_buffer_actions() from pgstat_report_stat(), it
seems to miss checkpointer stats entirely. I did find that if I
sprinkled pgstat_send_buffer_actions() around in the various places that
pgstat_send_checkpointer() is called, I could get checkpointer stats
(see attached patch, capture_checkpointer_buffer_actions.patch), but,
that seems a little bit haphazard since pgstat_send_buffer_actions() is
supposed to capture stats for all backend types. Is there somewhere else
I can call it that is exercised by all backend types before
pgstat_shutdown_hook() is called but after they would have finished any
relevant buffer actions?

I realized that putting these additional calls in checkpointer code and
not clearing out PgBackendStatus counters for buffer actions results in
a lot of duplicate stats. I was wondering if
pgstat_send_buffer_actions() is needed, however, in
HandleCheckpointerInterrupts() before the proc_exit().

It does seem like additional calls to pgstat_send_buffer_actions()
shouldn't be needed since most processes register
pgstat_shutdown_hook(). However, since MyDatabaseId isn't valid for the
auxiliary processes, even though the pgstat_shutdown_hook() is
registered from BaseInit(), pgstat_report_stat() never gets called for
them, so their stats aren't persisted using the current method.

It seems like the best solution to persisting all processes' stats would
be to have all processes register pgstat_shutdown_hook() and to still
call pgstat_report_stat() even if MyDatabaseId is not valid if the
process is not a regular backend (I assume that it is only a problem
that MyDatabaseId is InvalidOid for backends that have had it set to a
valid oid at some point). For the stats that rely on database OID,
perhaps those can be reported based on whether or not MyDatabaseId is
valid from within pgstat_report_stat().

I also realized that I am not collecting stats from live auxiliary
processes in pg_stat_get_buffer_actions(). I need to change the loop to
for (i = 0; i <= MaxBackends + NUM_AUXPROCTYPES; i++) to actually get
stats from live auxiliary processes when querying the view.

On an unrelated note, I am planning to remove buffers_clean and
buffers_checkpoint from the pg_stat_bgwriter view since those are also
redundant. When I was removing them, I noticed that buffers_checkpoint
and buffers_clean count buffers as having been written even when
FlushBuffer() "does nothing" because someone else wrote out the dirty
buffer before the bgwriter or checkpointer had a chance to do it. This
seems like it would result in an incorrect count. Am I missing
something?

- Melanie

#19

melanieplageman@gmail.com

over 4 years ago

In reply to: Melanie Plageman (#18)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

I've attached the v7 patch set.

Changes from v6:
- removed unnecessary global variable BufferActionsStats
- fixed the loop condition in pg_stat_get_buffer_actions()
- updated some comments
- removed buffers_checkpoint and buffers_clean from pg_stat_bgwriter
view (now pg_stat_bgwriter view is mainly checkpointer statistics,
which isn't great)
- instead of calling pgstat_send_buffer_actions() in
pgstat_report_stat(), I renamed pgstat_send_buffer_actions() to
pgstat_report_buffers() and call it directly from
pgstat_shutdown_hook() for all types of processes (including processes
with invalid MyDatabaseId [like auxiliary processes])

I began changing the code to add the stats reset timestamp to the
pg_stat_buffer_actions view, but, I realized that it will be kind of
distracting to have every row for every backend type have a stats reset
timestamp (since it will be the same timestamp over and over). If,
however, you could reset buffer stats for each backend type
individually, then, I could see having it. Otherwise, we could add a
function like pg_stat_get_stats_reset_time(viewname) where viewname
would be pg_stat_buffer_actions in our case. Though, maybe that is
annoying and not very usable--I'm not sure.

I also think it makes sense to rename the pg_stat_buffer_actions view to
pg_stat_buffers and to name the columns using both the buffer action
type and buffer type -- e.g. shared, strategy, local. This leaves open
the possibility of counting buffer actions done on other non-shared
buffers -- like those done while building indexes or those using local
buffers. The third patch in the set does this (I wanted to see if it
made sense before fixing it up into the first patch in the set).

This naming convention (BufferType_BufferActionType) made me think that
it might make sense to have two enumerations: one being the current
BufferActionType (which could also be called BufferAccessType though
that might get confusing with BufferAccessStrategyType and buffer access
strategies in general) and the other being BufferType (which would be
one of shared, local, index, etc).

I attached a patch with the outline of this idea
(buffer_type_enum_addition.patch). It doesn't work because
pg_stat_get_buffer_actions() uses the BufferActionType as an index into
the values array returned. If I wanted to use a combination of the two
enums as an indexing mechanism (BufferActionType and BufferType), we
would end up with a tuple having every combination of the two
enums--some of which aren't valid. It might not make sense to implement
this. I do think it is useful to think of these stats as a combination
of a buffer action and a type of buffer.

- Melanie

Attachments:

v7-0001-Add-system-view-tracking-shared-buffer-actions.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Add-system-view-tracking-shared-buffer-actions.patchDownload

From 17bf27ad0a6ae54a6a898e96c630d36867e9d943 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:33:59 -0400
Subject: [PATCH v7 1/3] Add system view tracking shared buffer actions

Add a system view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
  BufferAccessStrategy which, were they not to use this strategy, could
  perhaps have been avoided if a clean shared buffer was available
- number of fsyncs done by a backend which could have been done by
  checkpointer if sync queue had not been full
- number of buffers allocated by a regular backend or autovacuum worker
  for either a new block or an existing block of a relation which is not
  currently in a buffer

All backends increment a counter in their PgBackendStatus when
performing one of these buffer actions. On exit, backends send these
stats to the stats collector to be persisted.

When pg_stat_buffer_actions view is queried, add all live backend's
statuses to the saved stats kept by the stats collector (since the last
stats reset) and return that as the total.

Each row of the view is for a particular backend type and each column is
the number of a particular kind of buffer action taken by the various
backends.

TODO:
- Some kind of test to protect against regressions in counting these
  (and remove unstable pg_stats test)
- stats reset refactor
- when finished, catalog bump
- pgindent
---
 doc/src/sgml/monitoring.sgml                | 94 +++++++++++++++++++++
 src/backend/catalog/system_views.sql        | 10 +++
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/postmaster/pgstat.c             | 72 +++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         | 27 +++++-
 src/backend/storage/buffer/freelist.c       | 22 ++++-
 src/backend/utils/activity/backend_status.c | 51 ++++++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 93 ++++++++++++++++++++
 src/backend/utils/init/miscinit.c           |  2 +
 src/include/catalog/pg_proc.dat             |  9 ++
 src/include/miscadmin.h                     | 13 +++
 src/include/pgstat.h                        | 17 ++++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 13 +++
 src/test/regress/expected/rules.out         |  7 ++
 src/test/regress/sql/stats.sql              |  1 +
 16 files changed, 426 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2281ba120f..edd19368be 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffer_actions</structname><indexterm><primary>pg_stat_buffer_actions</primary></indexterm></entry>
+      <entry>One row for each backend type showing statistics about
+      backend buffer activity. See
+       <link linkend="monitoring-pg-stat-buffer-actions-view">
+       <structname>pg_stat_buffer_actions</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3478,6 +3487,91 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffer-actions-view">
+  <title><structname>pg_stat_buffer_actions</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffer_actions</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffer_actions</structname> view has a row for each
+   backend type, containing global data for the cluster for that backend type.
+  </para>
+
+  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffer_actions">
+   <title><structname>pg_stat_buffer_actions</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced. TODO: is this only shared buffers?
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffers_write_strat</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written as part of a buffer access strategy.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..7ba54d1119 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1072,6 +1072,16 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffer_actions AS
+SELECT
+       b.backend_type,
+       b.buffers_alloc,
+       b.buffers_extend,
+       b.buffers_fsync,
+       b.buffers_write,
+       b.buffers_write_strat
+FROM pg_stat_get_buffer_actions() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..23f2ffccd9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
+		pgstat_increment_buffer_action(BA_Fsync);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3450a10129..75db7a1995 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -63,6 +63,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
+#include "utils/backend_status.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -124,9 +125,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various buffer actions are tracked in PgBackendStatus while a
+ * backend is alive and then sent to stats collector before a backend exits in
+ * a PgStat_MsgBufferActions.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -359,6 +363,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_buffer_actions(PgStat_MsgBufferActions *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -970,6 +975,7 @@ pgstat_report_stat(bool disconnect)
 	/* Now, send function statistics */
 	pgstat_send_funcstats();
 
+
 	/* Send WAL statistics */
 	pgstat_send_wal(true);
 
@@ -2903,6 +2909,13 @@ static void
 pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
+	/*
+	 * Only need to send stats on buffer actions when a process exits, as
+	 * pg_stat_get_buffer_actions() will read from live backends'
+	 * PgBackendStatus and then sum this with totals from exited backends
+	 * persisted by the stats collector.
+	 */
+	pgstat_report_buffers();
 
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
@@ -3085,6 +3098,31 @@ pgstat_send_checkpointer(void)
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
 }
 
+/*
+ * Called for a single backend at the time of death to send its I/O stats to
+ * the stats collector so that they may be persisted.
+ */
+void
+pgstat_report_buffers(void)
+{
+	PgStat_MsgBufferActions buffer_actions_stats;
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+	if (!beentry)
+		return;
+
+	buffer_actions_stats = (PgStat_MsgBufferActions) {
+		.backend_type = beentry->st_backendType,
+		.allocs = pg_atomic_read_u64(&beentry->buffer_action_stats.allocs),
+		.extends = pg_atomic_read_u64(&beentry->buffer_action_stats.extends),
+		.fsyncs = pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs),
+		.writes = pg_atomic_read_u64(&beentry->buffer_action_stats.writes),
+		.writes_strat = pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat)
+	};
+
+	pgstat_setheader(&buffer_actions_stats.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&buffer_actions_stats, sizeof(buffer_actions_stats));
+}
+
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3427,6 +3465,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_actions(&msg.msg_buffer_actions, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5406,6 +5448,30 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_buffer_actions(PgStat_MsgBufferActions *msg, int len)
+{
+	/*
+	 * No users will likely need PgStat_MsgBufferActions->backend_type when
+	 * accessing it from globalStats since its place in the
+	 * globalStats.buffer_actions array indicates backend_type. However,
+	 * leaving it undefined seemed like an invitation for unnecessary future
+	 * bugs.
+	 */
+	globalStats.buffer_actions[msg->backend_type].backend_type = msg->backend_type;
+	globalStats.buffer_actions[msg->backend_type].allocs += msg->allocs;
+	globalStats.buffer_actions[msg->backend_type].extends += msg->extends;
+	globalStats.buffer_actions[msg->backend_type].fsyncs += msg->fsyncs;
+	globalStats.buffer_actions[msg->backend_type].writes += msg->writes;
+	globalStats.buffer_actions[msg->backend_type].writes_strat += msg->writes_strat;
+}
+
+const PgStat_MsgBufferActions *
+pgstat_get_buffer_action_stats(BackendType backend_type)
+{
+	return &globalStats.buffer_actions[backend_type];
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bc1753ae91..ef83c576b0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -963,6 +963,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		/*
+		 * Extends counted here are only those that go through shared buffers
+		 */
+		pgstat_increment_buffer_action(BA_Extend);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1163,6 +1167,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1173,7 +1178,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1210,6 +1215,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				BufferActionType buffer_action;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1227,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1236,6 +1242,21 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 * When a strategy is not in use, at this point, the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				buffer_action = from_ring ? BA_Write_Strat : BA_Write;
+				pgstat_increment_buffer_action(buffer_action);
+
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2543,6 +2564,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_increment_buffer_action(BA_Write);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..17b76e9c2c 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,6 +213,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
+		*from_ring = buf == NULL ? false : true;
 		if (buf != NULL)
 			return buf;
 	}
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_increment_buffer_action(BA_Alloc);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer, then
+	 * ensure that we count it as such in pg_stat_buffer_actions view.
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping for
+	 * pg_stat_buffer_actions.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index e19c4506ef..f8f914ac7e 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -279,7 +279,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
@@ -399,6 +399,11 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.allocs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.extends, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.fsyncs, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes, 0);
+	pg_atomic_init_u64(&lbeentry.buffer_action_stats.writes_strat, 0);
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -1045,6 +1050,50 @@ pgstat_get_my_query_id(void)
 	return MyBEEntry->st_query_id;
 }
 
+volatile PgBackendStatus *
+pgstat_access_backend_status_array(void)
+{
+	return BackendStatusArray;
+}
+
+void
+pgstat_increment_buffer_action(BufferActionType ba_type)
+{
+	volatile PgBackendStatus *beentry   = MyBEEntry;
+
+	if (!beentry || !pgstat_track_activities)
+		return;
+
+	switch (ba_type)
+	{
+		case BA_Alloc:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.allocs,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.allocs) + 1);
+			break;
+		case BA_Extend:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.extends,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.extends) + 1);
+			break;
+		case BA_Fsync:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.fsyncs,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs) + 1);
+			break;
+		case BA_Write:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.writes,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.writes) + 1);
+			break;
+		case BA_Write_Strat:
+			pg_atomic_write_u64(&beentry->buffer_action_stats.writes_strat,
+					pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat) + 1);
+			break;
+		default:
+			ereport(LOG,
+					(errmsg(
+							"Statistics on Buffer Action Type, %d, are not currently collected.",
+							ba_type)));
+	}
+
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..4ba492c121 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,99 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffer_actions(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	const PgStat_MsgBufferActions *buffer_actions;
+	int i;
+	volatile PgBackendStatus *beentry;
+	Datum all_values[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES];
+	bool all_nulls[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES];
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	MemSet(all_values, 0, sizeof(Datum[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES]));
+	MemSet(all_nulls, 0, sizeof(bool[BACKEND_NUM_TYPES][BUFFER_ACTION_NUM_TYPES]));
+
+	/* Add stats from all exited backends */
+	pgstat_fetch_global();
+	/* 0 is not a valid BackendType */
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		Datum *values = all_values[i];
+
+		values[0] = CStringGetTextDatum(GetBackendTypeDesc(i));
+		buffer_actions = pgstat_get_buffer_action_stats(i);
+
+		values[BA_Alloc] += buffer_actions->allocs;
+		values[BA_Extend] += buffer_actions->extends;
+		values[BA_Fsync] += buffer_actions->fsyncs;
+		values[BA_Write] += buffer_actions->writes;
+		values[BA_Write_Strat] += buffer_actions->writes_strat;
+	}
+
+	/*
+	 * Loop through all live backends and count their buffer actions
+	 */
+	beentry = pgstat_access_backend_status_array();
+	for (i = 0; i <= MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		Datum *values;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+		values = all_values[beentry->st_backendType];
+
+
+		values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
+		values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
+		values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
+		values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
+		values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
+
+	}
+
+	for (i = 1; i < BACKEND_NUM_TYPES; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..cbeaa9ab94 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -294,6 +294,8 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_LOGGER:
 			backendDesc = "logger";
 			break;
+		case BACKEND_NUM_TYPES:
+			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..ee3c11db06 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of buffer actions taken by each backend type',
+  proname => 'pg_stat_get_buffer_actions', provolatile => 's', proisstrict => 'f',
+  prorows => '13', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int8,int8,int8,int8,int8}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffers_alloc,buffers_extend,buffers_fsync,buffers_write,buffers_write_strat}',
+  prosrc => 'pg_stat_get_buffer_actions' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2e2e9a364a..03d5e464a9 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -336,8 +336,21 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_STATS_COLLECTOR,
 	B_LOGGER,
+	BACKEND_NUM_TYPES,
 } BackendType;
 
+typedef enum BufferActionType
+{
+	BA_Invalid = 0,
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+	BA_Write_Strat,
+	BUFFER_ACTION_NUM_TYPES,
+}	BufferActionType;
+
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 509849c7ff..21f5f24e8c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -473,6 +474,18 @@ typedef struct PgStat_MsgCheckpointer
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
 
+typedef struct PgStat_MsgBufferActions
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	uint64 allocs;
+	uint64 extends;
+	uint64 fsyncs;
+	uint64 writes;
+	uint64 writes_strat;
+} PgStat_MsgBufferActions;
+
 /* ----------
  * PgStat_MsgWal			Sent by backends and background processes to update WAL statistics.
  * ----------
@@ -691,6 +704,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgBufferActions msg_buffer_actions;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -866,6 +880,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_MsgBufferActions buffer_actions[BACKEND_NUM_TYPES];
 } PgStat_GlobalStats;
 
 /*
@@ -1105,6 +1120,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_checkpointer(void);
+extern void pgstat_report_buffers(void);
+extern const PgStat_MsgBufferActions * pgstat_get_buffer_action_stats(BackendType backend_type);
 extern void pgstat_send_wal(bool force);
 
 /* ----------
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..c23b74b4a6 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -79,6 +80,14 @@ typedef struct PgBackendGSSStatus
 
 } PgBackendGSSStatus;
 
+typedef struct PgBackendBufferActionStats
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+	pg_atomic_uint64 writes_strat;
+} PgBackendBufferActionStats;
 
 /* ----------
  * PgBackendStatus
@@ -168,6 +177,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBackendBufferActionStats buffer_action_stats;
 } PgBackendStatus;
 
 
@@ -306,6 +316,9 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+extern volatile PgBackendStatus *pgstat_access_backend_status_array(void);
+extern void pgstat_increment_buffer_action(BufferActionType ba_type);
+
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..5c5445bcd7 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,13 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffer_actions| SELECT b.backend_type,
+    b.buffers_alloc,
+    b.buffers_extend,
+    b.buffers_fsync,
+    b.buffers_write,
+    b.buffers_write_strat
+   FROM pg_stat_get_buffer_actions() b(backend_type, buffers_alloc, buffers_extend, buffers_fsync, buffers_write, buffers_write_strat);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..fb4b613d4b 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,5 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffer_actions;
 -- End of Stats Test
-- 
2.27.0

v7-0002-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Remove-superfluous-bgwriter-stats.patchDownload

From f853ecf1a44984158b2a06705c2f5d01bbace47e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:47:41 -0400
Subject: [PATCH v7 2/3] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffer_actions.
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  5 ---
 src/backend/postmaster/checkpointer.c | 26 ---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index edd19368be..96258ada18 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3416,24 +3416,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3444,35 +3426,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7ba54d1119..a5d7972687 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1064,12 +1064,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffer_actions AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 23f2ffccd9..fca78fa4ef 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1102,8 +1087,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_increment_buffer_action(BA_Fsync);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 75db7a1995..43e88f488f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5425,9 +5425,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5443,9 +5441,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ef83c576b0..ff219038e2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2158,7 +2158,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2267,9 +2266,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2466,8 +2462,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 4ba492c121..e373c20525 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffer_actions(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ee3c11db06..afab94ca96 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,16 +5600,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5629,18 +5619,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of buffer actions taken by each backend type',
   proname => 'pg_stat_get_buffer_actions', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 21f5f24e8c..7c79995a2b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -452,9 +452,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -467,9 +465,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -850,9 +845,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -866,9 +859,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5c5445bcd7..f88c060370 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffer_actions| SELECT b.backend_type,
     b.buffers_alloc,
-- 
2.27.0

v7-0003-Rename-pg_stat_buffer_actions-to-pg_stat_buffers.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Rename-pg_stat_buffer_actions-to-pg_stat_buffers.patchDownload

From c60cf6ec100aae062cbaa9cab1639eb077bca6ed Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 13 Sep 2021 16:29:50 -0400
Subject: [PATCH v7 3/3] Rename pg_stat_buffer_actions to pg_stat_buffers

Also, rename all members of view to allow for future expansion of buffer
types covered.
---
 doc/src/sgml/monitoring.sgml          | 32 +++++++++++++--------------
 src/backend/catalog/system_views.sql  | 12 +++++-----
 src/backend/storage/buffer/freelist.c |  4 ++--
 src/include/catalog/pg_proc.dat       |  2 +-
 src/test/regress/expected/rules.out   | 14 ++++++------
 src/test/regress/sql/stats.sql        |  2 +-
 6 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 96258ada18..1ee46694f0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -445,11 +445,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </row>
 
      <row>
-      <entry><structname>pg_stat_buffer_actions</structname><indexterm><primary>pg_stat_buffer_actions</primary></indexterm></entry>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
       <entry>One row for each backend type showing statistics about
       backend buffer activity. See
        <link linkend="monitoring-pg-stat-buffer-actions-view">
-       <structname>pg_stat_buffer_actions</structname></link> for details.
+       <structname>pg_stat_buffers</structname></link> for details.
      </entry>
      </row>
 
@@ -3441,19 +3441,19 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  </sect2>
 
  <sect2 id="monitoring-pg-stat-buffer-actions-view">
-  <title><structname>pg_stat_buffer_actions</structname></title>
+  <title><structname>pg_stat_buffers</structname></title>
 
   <indexterm>
-   <primary>pg_stat_buffer_actions</primary>
+   <primary>pg_stat_buffers</primary>
   </indexterm>
 
   <para>
-   The <structname>pg_stat_buffer_actions</structname> view has a row for each
+   The <structname>pg_stat_buffers</structname> view has a row for each
    backend type, containing global data for the cluster for that backend type.
   </para>
 
-  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffer_actions">
-   <title><structname>pg_stat_buffer_actions</structname> View</title>
+  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
    <tgroup cols="1">
     <thead>
      <row>
@@ -3477,43 +3477,43 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>integer</type>
+       <structfield>shared_buffers_alloc</structfield> <type>integer</type>
       </para>
       <para>
-       Number of buffers allocated.
+       Number of shared buffers allocated.
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_extend</structfield> <type>integer</type>
+       <structfield>shared_buffers_extend</structfield> <type>integer</type>
       </para>
       <para>
-       Number of buffers extended.
+       Number of shared buffers extended.
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_fsync</structfield> <type>integer</type>
+       <structfield>shared_buffers_fsync</structfield> <type>integer</type>
       </para>
       <para>
-       Number of buffers fsynced. TODO: is this only shared buffers?
+       Number of shared buffers fsynced.
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_write</structfield> <type>integer</type>
+       <structfield>shared_buffers_write</structfield> <type>integer</type>
       </para>
       <para>
-       Number of buffers written.
+       Number of shared buffers written.
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_write_strat</structfield> <type>integer</type>
+       <structfield>strategy_buffers_write</structfield> <type>integer</type>
       </para>
       <para>
        Number of buffers written as part of a buffer access strategy.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a5d7972687..ff668fb256 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1067,14 +1067,14 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
-CREATE VIEW pg_stat_buffer_actions AS
+CREATE VIEW pg_stat_buffers AS
 SELECT
        b.backend_type,
-       b.buffers_alloc,
-       b.buffers_extend,
-       b.buffers_fsync,
-       b.buffers_write,
-       b.buffers_write_strat
+       b.shared_buffers_alloc,
+       b.shared_buffers_extend,
+       b.shared_buffers_fsync,
+       b.shared_buffers_write,
+       b.strategy_buffers_write
 FROM pg_stat_get_buffer_actions() b;
 
 CREATE VIEW pg_stat_wal AS
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 17b76e9c2c..c85ec3eec0 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -690,7 +690,7 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_
 {
 	/*
 	 * If we decide to use the dirty buffer selected by StrategyGetBuffer, then
-	 * ensure that we count it as such in pg_stat_buffer_actions view.
+	 * ensure that we count it as such in pg_stat_buffers view.
 	 */
 	*from_ring = true;
 
@@ -713,7 +713,7 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_
 	 * Since we will not be writing out a dirty buffer from the ring, set
 	 * from_ring to false so that the caller does not count this write as a
 	 * "strategy write" and can do proper bookkeeping for
-	 * pg_stat_buffer_actions.
+	 * pg_stat_buffers.
 	 */
 	*from_ring = false;
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index afab94ca96..02d161aa5c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5626,7 +5626,7 @@
   proparallel => 'r', prorettype => 'record', proargtypes => '',
   proallargtypes => '{text,int8,int8,int8,int8,int8}',
   proargmodes => '{o,o,o,o,o,o}',
-  proargnames => '{backend_type,buffers_alloc,buffers_extend,buffers_fsync,buffers_write,buffers_write_strat}',
+  proargnames => '{backend_type,shared_buffers_alloc,shared_buffers_extend,shared_buffers_fsync,shared_buffers_write,strategy_buffers_write}',
   prosrc => 'pg_stat_get_buffer_actions' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f88c060370..8a61ef93e1 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1823,13 +1823,13 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
-pg_stat_buffer_actions| SELECT b.backend_type,
-    b.buffers_alloc,
-    b.buffers_extend,
-    b.buffers_fsync,
-    b.buffers_write,
-    b.buffers_write_strat
-   FROM pg_stat_get_buffer_actions() b(backend_type, buffers_alloc, buffers_extend, buffers_fsync, buffers_write, buffers_write_strat);
+pg_stat_buffers| SELECT b.backend_type,
+    b.shared_buffers_alloc,
+    b.shared_buffers_extend,
+    b.shared_buffers_fsync,
+    b.shared_buffers_write,
+    b.strategy_buffers_write
+   FROM pg_stat_get_buffer_actions() b(backend_type, shared_buffers_alloc, shared_buffers_extend, shared_buffers_fsync, shared_buffers_write, strategy_buffers_write);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index fb4b613d4b..e908ac2591 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,5 +176,5 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
-SELECT * FROM pg_stat_buffer_actions;
+SELECT * FROM pg_stat_buffers;
 -- End of Stats Test
-- 
2.27.0

buffer_type_enum_addition.patchtext/x-patch; charset=US-ASCII; name=buffer_type_enum_addition.patchDownload

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e52599dc75..1b201ce829 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1090,7 +1090,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		pgstat_increment_buffer_action(BA_Fsync);
+		pgstat_increment_buffer_action(BT_Shared, BA_Fsync);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ff219038e2..0aa1ba8830 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -966,7 +966,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/*
 		 * Extends counted here are only those that go through shared buffers
 		 */
-		pgstat_increment_buffer_action(BA_Extend);
+		pgstat_increment_buffer_action(BT_Shared, BA_Extend);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1215,7 +1215,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
-				BufferActionType buffer_action;
+				BufferType buffer_type;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1253,8 +1253,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				 * only be a "regular" write of a dirty buffer.
 				 */
 
-				buffer_action = from_ring ? BA_Write_Strat : BA_Write;
-				pgstat_increment_buffer_action(buffer_action);
+				buffer_type = from_ring ? BT_Strategy : BT_Shared;
+				pgstat_increment_buffer_action(buffer_type, BA_Write);
 
 
 				/* OK, do the I/O */
@@ -2559,7 +2559,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * buffer is clean by the time we've locked it.)
 	 */
 
-	pgstat_increment_buffer_action(BA_Write);
+	pgstat_increment_buffer_action(BT_Shared, BA_Write);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 17b76e9c2c..fd95cba478 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -249,7 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	pgstat_increment_buffer_action(BA_Alloc);
+	pgstat_increment_buffer_action(BT_Shared, BA_Alloc);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index f8f914ac7e..a0833a733b 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -1057,14 +1057,14 @@ pgstat_access_backend_status_array(void)
 }
 
 void
-pgstat_increment_buffer_action(BufferActionType ba_type)
+pgstat_increment_buffer_action(BufferType buffer_type, BufferActionType buffer_action_type)
 {
 	volatile PgBackendStatus *beentry   = MyBEEntry;
 
 	if (!beentry || !pgstat_track_activities)
 		return;
 
-	switch (ba_type)
+	switch (buffer_action_type)
 	{
 		case BA_Alloc:
 			pg_atomic_write_u64(&beentry->buffer_action_stats.allocs,
@@ -1079,18 +1079,18 @@ pgstat_increment_buffer_action(BufferActionType ba_type)
 					pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs) + 1);
 			break;
 		case BA_Write:
-			pg_atomic_write_u64(&beentry->buffer_action_stats.writes,
-					pg_atomic_read_u64(&beentry->buffer_action_stats.writes) + 1);
-			break;
-		case BA_Write_Strat:
-			pg_atomic_write_u64(&beentry->buffer_action_stats.writes_strat,
-					pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat) + 1);
+			if (buffer_type == BT_Strategy)
+				pg_atomic_write_u64(&beentry->buffer_action_stats.writes_strat,
+						pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat) + 1);
+			else if (buffer_type == BT_Shared)
+				pg_atomic_write_u64(&beentry->buffer_action_stats.writes,
+						pg_atomic_read_u64(&beentry->buffer_action_stats.writes) + 1);
 			break;
 		default:
 			ereport(LOG,
 					(errmsg(
 							"Statistics on Buffer Action Type, %d, are not currently collected.",
-							ba_type)));
+							buffer_action_type)));
 	}
 
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 03d5e464a9..b6885a5d66 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -339,6 +339,15 @@ typedef enum BackendType
 	BACKEND_NUM_TYPES,
 } BackendType;
 
+typedef enum BufferType
+{
+	BT_Shared,
+	BT_Strategy,
+	BT_Local,
+	BUFFER_NUM_TYPES,
+} BufferType;
+
+// TODO: should BufferAction be BufferAccess? See BufferAccessStrategyType
 typedef enum BufferActionType
 {
 	BA_Invalid = 0,
@@ -346,7 +355,6 @@ typedef enum BufferActionType
 	BA_Extend,
 	BA_Fsync,
 	BA_Write,
-	BA_Write_Strat,
 	BUFFER_ACTION_NUM_TYPES,
 }	BufferActionType;
 
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index c23b74b4a6..fb1334401e 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -317,7 +317,7 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 extern uint64 pgstat_get_my_query_id(void);
 
 extern volatile PgBackendStatus *pgstat_access_backend_status_array(void);
-extern void pgstat_increment_buffer_action(BufferActionType ba_type);
+extern void pgstat_increment_buffer_action(BufferType buffer_type, BufferActionType buffer_action_type);
 
 
 /* ----------

#20

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Melanie Plageman (#19)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hello Melanie

On 2021-Sep-13, Melanie Plageman wrote:

I also think it makes sense to rename the pg_stat_buffer_actions view to
pg_stat_buffers and to name the columns using both the buffer action
type and buffer type -- e.g. shared, strategy, local. This leaves open
the possibility of counting buffer actions done on other non-shared
buffers -- like those done while building indexes or those using local
buffers. The third patch in the set does this (I wanted to see if it
made sense before fixing it up into the first patch in the set).

What do you think of the idea of having the "shared/strategy/local"
attribute be a column? So you'd have up to three rows per buffer action
type. Users wishing to see an aggregate can just aggregate them, just
like they'd do with pg_buffercache. I think that leads to an easy
decision with regards to this point:

I attached a patch with the outline of this idea
(buffer_type_enum_addition.patch). It doesn't work because
pg_stat_get_buffer_actions() uses the BufferActionType as an index into
the values array returned. If I wanted to use a combination of the two
enums as an indexing mechanism (BufferActionType and BufferType), we
would end up with a tuple having every combination of the two
enums--some of which aren't valid. It might not make sense to implement
this. I do think it is useful to think of these stats as a combination
of a buffer action and a type of buffer.

Does that seem sensible?

(It's weird to have enum values that are there just to indicate what's
the maximum value. I think that sort of thing is better done by having
a "#define LAST_THING" that takes the last valid value from the enum.
That would free you from having to handle the last value in switch
blocks, for example. LAST_OCLASS in dependency.h is a precedent on this.)

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense." (Paul Thomas)

#21

melanieplageman@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#20)

2 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Sep 14, 2021 at 9:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2021-Sep-13, Melanie Plageman wrote:

I also think it makes sense to rename the pg_stat_buffer_actions view to
pg_stat_buffers and to name the columns using both the buffer action
type and buffer type -- e.g. shared, strategy, local. This leaves open
the possibility of counting buffer actions done on other non-shared
buffers -- like those done while building indexes or those using local
buffers. The third patch in the set does this (I wanted to see if it
made sense before fixing it up into the first patch in the set).

What do you think of the idea of having the "shared/strategy/local"
attribute be a column? So you'd have up to three rows per buffer action
type. Users wishing to see an aggregate can just aggregate them, just
like they'd do with pg_buffercache. I think that leads to an easy
decision with regards to this point:

I have rewritten the code to implement this.

(It's weird to have enum values that are there just to indicate what's
the maximum value. I think that sort of thing is better done by having
a "#define LAST_THING" that takes the last valid value from the enum.
That would free you from having to handle the last value in switch
blocks, for example. LAST_OCLASS in dependency.h is a precedent on this.)

I have made this change.

The attached v8 patchset is rewritten to add in an additional dimension
-- buffer type. Now, a backend keeps track of how many buffers of a
particular type (e.g. shared, local) it has accessed in a particular way
(e.g. alloc, write). It also changes the naming of various structures
and the view members.

Previously, stats reset did not work since it did not consider live
backends' counters. Now, the reset message includes the current live
backends' counters to be tracked by the stats collector and used when
the view is queried.

The reset message is one of the areas in which I still need to do some
work -- I shoved the array of PgBufferAccesses into the existing reset
message used for checkpointer, bgwriter, etc. Before making a new type
of message, I would like feedback from a reviewer about the approach.

There are various TODOs in the code which are actually questions for the
reviewer. Once I have some feedback, it will be easier to address these
items.

There a few other items which may be material for other commits that
I would also like to do:
1) write wrapper functions for smgr* functions which count buffer
accesses of the appropriate type. I wasn't sure if these should
literally just take all the parameters that the smgr* functions take +
buffer type. Once these exist, there will be less possibility for
regressions in which new code is added using smgr* functions without
counting this buffer activity. Once I add these, I was going to go
through and replace existing calls to smgr* functions and thereby start
counting currently uncounted buffer type accesses (direct, local, etc).

2) Separate checkpointer and bgwriter into two views and add additional
stats to the bgwriter view.

3) Consider adding a helper function to pgstatfuncs.c to help create the
tuplestore. These functions all have quite a few lines which are exactly
the same, and I thought it might be nice to do something about that:
pg_stat_get_progress_info(PG_FUNCTION_ARGS)
pg_stat_get_activity(PG_FUNCTION_ARGS)
pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
pg_stat_get_slru(PG_FUNCTION_ARGS)
pg_stat_get_progress_info(PG_FUNCTION_ARGS)
I can imagine a function that takes a Datums array, a nulls array, and a
ResultSetInfo and then makes the tuplestore -- though I think that will
use more memory. Perhaps we could make a macro which does the initial
error checking (checking if caller supports returning a tuplestore)? I'm
not sure if there is something meaningful here, but I thought I would
ask.

Finally, I haven't removed the test in pg_stats and haven't done a final
pass for comment clarity, alphabetization, etc on this version.

- Melanie

Attachments:

v8-0002-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Remove-superfluous-bgwriter-stats.patchDownload

From 479fdfc53d1a6ee9943bdc580884866e99497673 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:47:41 -0400
Subject: [PATCH v8 2/2] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  6 +---
 src/backend/postmaster/checkpointer.c | 26 ---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 60627c692a..08772652ac 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3416,24 +3416,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3444,35 +3426,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30280d520b..c45c261f4b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1058,18 +1058,14 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
+-- TODO: make separate pg_stat_checkpointer view
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
         pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe6bd15506..6d3a8da948 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1087,10 +1076,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1104,8 +1089,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_increment_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1263,15 +1246,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 40f646b7c6..849a9be702 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5586,9 +5586,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5604,9 +5602,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 135b2d3925..f3c0fd96b3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2177,7 +2177,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2286,9 +2285,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2485,8 +2481,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index d4e0ce4143..94a4500ead 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 54661e2b5f..02f624c18c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,16 +5600,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5629,18 +5619,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
   proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c97b6897cf..9b4217cbbb 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -489,9 +489,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -504,9 +502,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -884,9 +879,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -900,9 +893,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9172b0fcd2..ac2f7cf61e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.buffer_type,
-- 
2.27.0

v8-0001-Add-system-view-tracking-accesses-to-buffers.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Add-system-view-tracking-accesses-to-buffers.patchDownload

From 156f8c96ac9bab619c55a072624e7ee4dfa82c79 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:33:59 -0400
Subject: [PATCH v8 1/2] Add system view tracking accesses to buffers

Add pg_stat_buffers, a system view which tracks the number of buffers of
a particular type (e.g. shared, local) allocated, written, fsync'd, and
extended by each backend type.

Some of these should always be zero. For example, a checkpointer backend
will not use a BufferAccessStrategy (currently), so buffer type
"strategy" for checkpointer will be 0 for all buffer access types
(alloc, write, fsync, and extend).

All backends increment a counter in their PgBackendStatus when
performing a buffer access. On exit, backends send these stats to the
stats collector to be persisted.

When stats are reset, the reset message includes the current values of
all the live backends' buffer access counters. When receiving this
message, the stats collector will 1) save these reset values in an array
of "resets" and 2) zero out the exited backends' saved buffer access
counters. This is required for accurate stats after a reset without
writing to other backends' PgBackendStatus.

When the pg_stat_buffers view is queried, sum live backends' stats with
saved stats from exited backends and subtract saved reset stats,
returning the total.

Each row of the view is for a particular backend type and a particular
buffer type (e.g. shared buffer accesses by checkpointer) and each
column in the view is the total number of buffers of each kind of buffer
access (e.g. written). So a cell in the view would be, for example, the
number of shared buffers written by checkpointer since the last stats
reset.

Note that this commit does not add code to increment buffer accesses for
all types of buffers. It includes all possible combinations in the stats
view but doesn't populate all of them.

TODO:
- TODOs in code (which are questions for the reviewer)
- Wrappers for smgr funcs to protect against regressions and cover other
  buffer types
- Consider helper func in pgstatfuncs.c to refactor out some of the
  redundant tuplestore creation code from pg_stat_get_progress_info,
  pg_stat_get_activity, etc
- Remove pg_stats test I added
- Additional polish for comments, check ordering of function
  definitions, datatypes, etc
- When finished, catalog bump
- pgindent
---
 doc/src/sgml/monitoring.sgml                | 116 +++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/checkpointer.c       |   3 +
 src/backend/postmaster/pgstat.c             | 138 +++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         |  37 +++++-
 src/backend/storage/buffer/freelist.c       |  22 +++-
 src/backend/utils/activity/backend_status.c |  53 +++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 137 +++++++++++++++++++
 src/backend/utils/init/miscinit.c           |  25 ++++
 src/include/access/xlog.h                   |   1 +
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/miscadmin.h                     |  23 ++++
 src/include/pgstat.h                        |  41 +++++-
 src/include/storage/buf_internals.h         |   4 +-
 src/include/utils/backend_status.h          |  15 +++
 src/test/regress/expected/rules.out         |   8 ++
 src/test/regress/sql/stats.sql              |   6 +
 17 files changed, 631 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2281ba120f..60627c692a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each buffer type for each backend type showing
+      statistics about backend buffer activity. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3478,6 +3487,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each buffer
+   type for each backend type, containing global data for the cluster for that
+   backend and buffer type.
+  </para>
+
+  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffer_type</structfield> <type>text</type>
+      </para>
+      <para>
+      Type of buffer accessed (e.g. shared).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5074,12 +5178,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
+        argument.  The argument can be <literal>bgwriter</literal> to reset all
+        the counters shown in the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view, or
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..30280d520b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1072,6 +1072,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.buffer_type,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers_accesses() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..fe6bd15506 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -577,6 +577,8 @@ HandleCheckpointerInterrupts(void)
 		ShutdownXLOG(0, 0);
 		pgstat_send_checkpointer();
 		pgstat_send_wal(true);
+		// TODO: is this needed
+		pgstat_report_buffers();
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
@@ -1104,6 +1106,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
+		pgstat_increment_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..40f646b7c6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -63,6 +63,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
+#include "utils/backend_status.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -124,9 +125,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various buffer actions are tracked in PgBackendStatus while a
+ * backend is alive and then sent to stats collector before a backend exits in
+ * a PgStat_MsgBufferActions.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -362,6 +366,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -974,6 +979,7 @@ pgstat_report_stat(bool disconnect)
 	/* Now, send function statistics */
 	pgstat_send_funcstats();
 
+
 	/* Send WAL statistics */
 	pgstat_send_wal(true);
 
@@ -1452,6 +1458,42 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "buffers") == 0) {
+		int i, buffer_type;
+		volatile	PgBackendStatus *beentry;
+
+		memset(msg.resets, 0, sizeof(PgStat_MsgBufferTypeAccesses) * BACKEND_NUM_TYPES);
+
+		msg.m_resettarget = RESET_BUFFERS;
+
+		beentry = pgstat_access_backend_status_array();
+		/*
+		 * Loop through live backends and capture reset values
+		 */
+		for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+		{
+			PgBufferAccesses *live_accesses;
+			PgStatBufferAccesses *reset_accesses;
+			beentry++;
+			/* Don't count dead backends. They should already be counted */
+			if (beentry->st_procpid == 0)
+				continue;
+
+			live_accesses = (PgBufferAccesses *) beentry->buffer_access_stats;
+			reset_accesses = msg.resets[beentry->st_backendType].buffer_type_accesses;
+
+			for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+			{
+				uint64 live_allocs = pg_atomic_read_u64(&live_accesses->allocs);
+				reset_accesses->allocs = live_allocs;
+				reset_accesses->extends = pg_atomic_read_u64(&live_accesses->extends);
+				reset_accesses->fsyncs = pg_atomic_read_u64(&live_accesses->fsyncs);
+				reset_accesses->writes = pg_atomic_read_u64(&live_accesses->writes);
+				reset_accesses++;
+				live_accesses++;
+			}
+		}
+	}
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -2998,6 +3040,13 @@ static void
 pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
+	/*
+	 * Only need to send stats on buffer accesses when a process exits, as
+	 * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+	 * then sum this with totals from exited backends persisted by the stats
+	 * collector.
+	 */
+	pgstat_report_buffers();
 
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
@@ -3180,6 +3229,43 @@ pgstat_send_checkpointer(void)
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
 }
 
+/*
+ * Called for a single backend at the time of death to send its I/O stats to
+ * the stats collector so that they may be persisted.
+ */
+void
+pgstat_report_buffers(void)
+{
+	PgStat_MsgBufferTypeAccesses msg;
+	PgBufferAccesses *src_accesses;
+	PgStatBufferAccesses *dest_accesses;
+	int buffer_type;
+
+	volatile	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	MemSet(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	src_accesses = (PgBufferAccesses *) &beentry->buffer_access_stats;
+	dest_accesses = msg.buffer_type_accesses;
+
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		dest_accesses->allocs += pg_atomic_read_u64(&src_accesses->allocs);
+		dest_accesses->extends += pg_atomic_read_u64(&src_accesses->extends);
+		dest_accesses->fsyncs += pg_atomic_read_u64(&src_accesses->fsyncs);
+		dest_accesses->writes += pg_atomic_read_u64(&src_accesses->writes);
+		dest_accesses++;
+		src_accesses++;
+	}
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3522,6 +3608,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_type_accesses(&msg.msg_buffer_accesses, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5222,9 +5312,16 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
 		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		memset(&globalStats.buffers.accesses, 0, sizeof(globalStats.buffers.accesses));
+		memcpy(globalStats.buffers.resets, msg->resets, sizeof(msg->resets));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
@@ -5512,6 +5609,39 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len)
+{
+	int buffer_type;
+	PgStatBufferAccesses *src_buffer_accesses = msg->buffer_type_accesses;
+	PgStatBufferAccesses *dest_buffer_accesses = globalStats.buffers.accesses[msg->backend_type].buffer_type_accesses;
+
+	/*
+	 * No users will likely need PgStat_MsgBufferTypeAccesses->backend_type
+	 * when accessing it from globalStats since its place in the
+	 * globalStats.buffers.accesses array indicates backend_type. However,
+	 * leaving it undefined seemed like an invitation for unnecessary future
+	 * bugs.
+	 */
+	globalStats.buffers.accesses[msg->backend_type].backend_type = msg->backend_type;
+
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgStatBufferAccesses *src = &src_buffer_accesses[buffer_type];
+		PgStatBufferAccesses *dest = &dest_buffer_accesses[buffer_type];
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
+PgStat_BackendAccesses *
+pgstat_get_global_buffers_stats(void)
+{
+	return &globalStats.buffers;
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..135b2d3925 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_increment_buffer_access_type(BA_Extend, Buf_Shared);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1184,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1236,7 +1238,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,6 +1247,23 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 * When a strategy is not in use, at this point, the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				if (from_ring)
+					pgstat_increment_buffer_access_type(BA_Write, Buf_Strategy);
+				else
+					pgstat_increment_buffer_access_type(BA_Write, Buf_Shared);
+
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2143,6 +2162,18 @@ BufferSync(int flags)
 		 */
 		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 		{
+			// TODO: could remove BUF_WRITTEN and instead check buffers written
+			// for this backend before starting BufferSync() and after
+			// finishing. It seems like BUF_WRITTEN it is only used for
+			// CheckpointStats.ckpt_bufs_written -- which seems redundant with
+			// the new PgBufferAccesses -- and for
+			// TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN - which can perhaps be done
+			// another way?
+			// BgBufferSync() has a similar local variable, num_written, which
+			// it uses in a similar way and also seems like it could be
+			// refactored to use PgBufferAccesses.
+			// Then, SyncOneBuffer() could only return a bool representing
+			// whether or not the buffer is reusable.
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
@@ -2552,6 +2583,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_increment_buffer_access_type(BA_Write, Buf_Shared);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..866cdd3911 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,6 +213,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
+		*from_ring = buf == NULL ? false : true;
 		if (buf != NULL)
 			return buf;
 	}
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_increment_buffer_access_type(BA_Alloc, Buf_Shared);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer, then
+	 * ensure that we count it as such in pg_stat_buffers view.
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping for
+	 * pg_stat_buffers.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..50e7081a16 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -279,7 +279,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
@@ -293,6 +293,7 @@ pgstat_bestart(void)
 {
 	volatile PgBackendStatus *vbeentry = MyBEEntry;
 	PgBackendStatus lbeentry;
+	int buffer_type;
 #ifdef USE_SSL
 	PgBackendSSLStatus lsslstatus;
 #endif
@@ -399,6 +400,17 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		// TODO: is it okay stylistically that I do this loop differently
+		// elsewhere (e.g. increment the accesses variable vs using buffer_type
+		// as an index into the array)?
+		PgBufferAccesses *accesses = &lbeentry.buffer_access_stats[buffer_type];
+		pg_atomic_init_u64(&accesses->allocs, 0);
+		pg_atomic_init_u64(&accesses->extends, 0);
+		pg_atomic_init_u64(&accesses->fsyncs, 0);
+		pg_atomic_init_u64(&accesses->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -1045,6 +1057,45 @@ pgstat_get_my_query_id(void)
 	return MyBEEntry->st_query_id;
 }
 
+PgBackendStatus *
+pgstat_access_backend_status_array(void)
+{
+	return BackendStatusArray;
+}
+
+// TODO: is there a way to inline this in a header file for performance reasons?
+// not sure with the static function calls in it
+void
+pgstat_increment_buffer_access_type(BufferAccessType ba_type, BufferType buf_type)
+{
+	PgBufferAccesses *accesses;
+	PgBackendStatus *beentry   = MyBEEntry;
+
+	// TODO: do I need to check pgstat_track_activities?
+	if (!beentry || !pgstat_track_activities)
+		return;
+
+	accesses = &beentry->buffer_access_stats[buf_type];
+	switch (ba_type)
+	{
+		case BA_Alloc:
+			pg_atomic_write_u64(&accesses->allocs,
+					pg_atomic_read_u64(&accesses->allocs) + 1);
+			break;
+		case BA_Extend:
+			pg_atomic_write_u64(&accesses->extends,
+					pg_atomic_read_u64(&accesses->extends) + 1);
+			break;
+		case BA_Fsync:
+			pg_atomic_write_u64(&accesses->fsyncs,
+					pg_atomic_read_u64(&accesses->fsyncs) + 1);
+			break;
+		case BA_Write:
+			pg_atomic_write_u64(&accesses->writes,
+					pg_atomic_read_u64(&accesses->writes) + 1);
+			break;
+	}
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..d4e0ce4143 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,143 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
+{
+#define NROWS ((BACKEND_NUM_TYPES - 1) * BUFFER_NUM_TYPES)
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	PgStat_BackendAccesses *backend_accesses;
+	int buffer_type;
+	int backend_type;
+	Datum reset_time;
+	int i;
+	// TODO: does this need to be volatile
+	volatile PgBackendStatus *beentry;
+
+	enum {
+		COLUMN_BACKEND_TYPE,
+		COLUMN_BUFFER_TYPE,
+		COLUMN_ALLOCS,
+		COLUMN_EXTENDS,
+		COLUMN_FSYNCS,
+		COLUMN_WRITES,
+		COLUMN_RESET_TIME,
+		COLUMN_LENGTH,
+	};
+
+	Datum all_values[NROWS][COLUMN_LENGTH];
+	bool all_nulls[NROWS][COLUMN_LENGTH];
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their buffer accesses for each
+	 * buffer type
+	 */
+	beentry = pgstat_access_backend_status_array();
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		PgBufferAccesses *buffer_accesses;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		buffer_accesses = (PgBufferAccesses *) beentry->buffer_access_stats;
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			int rownum = (beentry->st_backendType - 1) * BUFFER_NUM_TYPES + buffer_type;
+			Datum *values = all_values[rownum];
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_BUFFER_TYPE
+			 * will all be set when looping through exited backends array
+			 */
+			values[COLUMN_ALLOCS] += pg_atomic_read_u64(&buffer_accesses->allocs);
+			values[COLUMN_EXTENDS] += pg_atomic_read_u64(&buffer_accesses->extends);
+			values[COLUMN_FSYNCS] += pg_atomic_read_u64(&buffer_accesses->fsyncs);
+			values[COLUMN_WRITES] += pg_atomic_read_u64(&buffer_accesses->writes);
+			buffer_accesses++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	pgstat_fetch_global();
+	backend_accesses = pgstat_get_global_buffers_stats();
+
+	reset_time = TimestampTzGetDatum(backend_accesses->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatBufferAccesses *buffer_accesses = backend_accesses->accesses[backend_type].buffer_type_accesses;
+		PgStatBufferAccesses *resets = backend_accesses->resets[backend_type].buffer_type_accesses;
+
+		Datum backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			/*
+			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+			 * BackendType
+			 */
+			Datum *values = all_values[(backend_type - 1) * BUFFER_NUM_TYPES + buffer_type];
+
+			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[COLUMN_BUFFER_TYPE] = CStringGetTextDatum(GetBufferTypeDesc(buffer_type));
+			values[COLUMN_ALLOCS] = values[COLUMN_ALLOCS] + buffer_accesses->allocs - resets->allocs;
+			values[COLUMN_EXTENDS] = values[COLUMN_EXTENDS] + buffer_accesses->extends - resets->extends;
+			values[COLUMN_FSYNCS] = values[COLUMN_FSYNCS] + buffer_accesses->fsyncs - resets->fsyncs;
+			values[COLUMN_WRITES] = values[COLUMN_WRITES] + buffer_accesses->writes - resets->writes;
+			values[COLUMN_RESET_TIME] = reset_time;
+			buffer_accesses++;
+			resets++;
+		}
+	}
+
+	for (i = 0; i < NROWS; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..50b50d00ce 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -299,6 +299,31 @@ GetBackendTypeDesc(BackendType backendType)
 	return backendDesc;
 }
 
+const char *
+GetBufferTypeDesc(BufferType bufferType)
+{
+	const char *bufferDesc = "unknown buffer type";
+
+	switch (bufferType)
+	{
+		case Buf_Direct:
+			bufferDesc = "direct";
+			break;
+		case Buf_Local:
+			bufferDesc = "local";
+			break;
+		case Buf_Shared:
+			bufferDesc = "shared";
+			break;
+		case Buf_Strategy:
+			bufferDesc = "strategy";
+			break;
+	}
+
+	return bufferDesc;
+}
+
+
 /* ----------------------------------------------------------------
  *				database path / name support stuff
  * ----------------------------------------------------------------
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5e2c94a05f..58575889de 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -224,6 +224,7 @@ typedef struct CheckpointStatsData
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
 	TimestampTz ckpt_end_t;		/* end of checkpoint */
 
+	// TODO: is this now redundant with checkpointer's backend writes buffers counter?
 	int			ckpt_bufs_written;	/* # of buffers written */
 
 	int			ckpt_segs_added;	/* # of new xlog segments created */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..54661e2b5f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
+  proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffer_type,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers_accesses' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..266d32835c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,9 +338,32 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
+typedef enum BufferAccessType
+{
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+}	BufferAccessType;
+
+#define BUFFER_ACCESS_NUM_TYPES (BA_Write + 1)
+
+typedef enum BufferType
+{
+	Buf_Direct,
+	Buf_Local,
+	Buf_Shared,
+	Buf_Strategy,
+} BufferType;
+
+#define BUFFER_NUM_TYPES (Buf_Strategy + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
+extern const char * GetBufferTypeDesc(BufferType bufferType);
 
 extern void SetDatabasePath(const char *path);
 extern void checkDataDir(void);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..c97b6897cf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -138,6 +139,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -224,7 +226,9 @@ typedef struct PgStat_MsgHdr
  * platforms, but we're being conservative here.)
  * ----------
  */
-#define PGSTAT_MAX_MSG_SIZE 1000
+// TODO: how sketchy is this? What can I do instead? The array of counters for
+// reset message is 2kB, I think
+#define PGSTAT_MAX_MSG_SIZE 3000
 #define PGSTAT_MSG_PAYLOAD	(PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
 
 
@@ -342,6 +346,30 @@ typedef struct PgStat_MsgResetcounter
 	Oid			m_databaseid;
 } PgStat_MsgResetcounter;
 
+// TODO: should these be PgStat_Counters? kind of want them unsigned though...
+typedef struct PgStatBufferAccesses
+{
+	uint64 allocs;
+	uint64 extends;
+	uint64 fsyncs;
+	uint64 writes;
+} PgStatBufferAccesses;
+
+typedef struct PgStat_MsgBufferTypeAccesses
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatBufferAccesses buffer_type_accesses[BUFFER_NUM_TYPES];
+} PgStat_MsgBufferTypeAccesses;
+
+typedef struct PgStat_BackendAccesses
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_MsgBufferTypeAccesses accesses[BACKEND_NUM_TYPES];
+	PgStat_MsgBufferTypeAccesses resets[BACKEND_NUM_TYPES];
+} PgStat_BackendAccesses;
+
 /* ----------
  * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
  *								to reset a shared counter
@@ -351,6 +379,10 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	// TODO: Should I make a new type of reset message that is only used for
+	// resetting pg_stat_buffers stats and includes this resets array?
+	// Is it okay that we are mixing query and command messages?
+	PgStat_MsgBufferTypeAccesses resets[BACKEND_NUM_TYPES];
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
@@ -479,6 +511,8 @@ typedef struct PgStat_MsgCheckpointer
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
 
+
+
 /* ----------
  * PgStat_MsgWal			Sent by backends and background processes to update WAL statistics.
  * ----------
@@ -703,6 +737,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgBufferTypeAccesses msg_buffer_accesses;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -879,6 +914,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendAccesses buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -946,7 +982,6 @@ typedef struct PgStat_FunctionCallUsage
 	instr_time	f_start;
 } PgStat_FunctionCallUsage;
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -1119,6 +1154,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_checkpointer(void);
+extern void pgstat_report_buffers(void);
+extern PgStat_BackendAccesses * pgstat_get_global_buffers_stats(void);
 extern void pgstat_send_wal(bool force);
 
 /* ----------
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..30b7103831 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -79,6 +80,16 @@ typedef struct PgBackendGSSStatus
 
 } PgBackendGSSStatus;
 
+// TODO: I'd like to move this elsewhere since it doesn't seem related to backend_status, but, I'm not sure where
+// I tried miscadmin.h but couldn't figure out what to include to get the Postgres atomic types and link correctly
+// Should it be elsewhere?
+typedef struct PgBufferAccesses
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} PgBufferAccesses;
 
 /* ----------
  * PgBackendStatus
@@ -168,6 +179,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBufferAccesses buffer_access_stats[BUFFER_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -306,6 +318,9 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+extern PgBackendStatus *pgstat_access_backend_status_array(void);
+extern void pgstat_increment_buffer_access_type(BufferAccessType ba_type, BufferType buf_type);
+
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..9172b0fcd2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.buffer_type,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers_accesses() b(backend_type, buffer_type, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..c132717e08 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,10 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffers;
+SELECT pg_stat_reset_shared('buffers');
+-- TODO: Is it okay that it takes a second for the message to be received, so I
+-- can't query the view right away and see it reset?
+SELECT pg_sleep(2);
+SELECT * FROM pg_stat_buffers;
 -- End of Stats Test
-- 
2.27.0

#22

melanieplageman@gmail.com

over 4 years ago

In reply to: Melanie Plageman (#21)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

The attached v8 patchset is rewritten to add in an additional dimension
-- buffer type. Now, a backend keeps track of how many buffers of a
particular type (e.g. shared, local) it has accessed in a particular way
(e.g. alloc, write). It also changes the naming of various structures
and the view members.

Previously, stats reset did not work since it did not consider live
backends' counters. Now, the reset message includes the current live
backends' counters to be tracked by the stats collector and used when
the view is queried.

The reset message is one of the areas in which I still need to do some
work -- I shoved the array of PgBufferAccesses into the existing reset
message used for checkpointer, bgwriter, etc. Before making a new type
of message, I would like feedback from a reviewer about the approach.

There are various TODOs in the code which are actually questions for the
reviewer. Once I have some feedback, it will be easier to address these
items.

There a few other items which may be material for other commits that
I would also like to do:
1) write wrapper functions for smgr* functions which count buffer
accesses of the appropriate type. I wasn't sure if these should
literally just take all the parameters that the smgr* functions take +
buffer type. Once these exist, there will be less possibility for
regressions in which new code is added using smgr* functions without
counting this buffer activity. Once I add these, I was going to go
through and replace existing calls to smgr* functions and thereby start
counting currently uncounted buffer type accesses (direct, local, etc).

2) Separate checkpointer and bgwriter into two views and add additional
stats to the bgwriter view.

3) Consider adding a helper function to pgstatfuncs.c to help create the
tuplestore. These functions all have quite a few lines which are exactly
the same, and I thought it might be nice to do something about that:
pg_stat_get_progress_info(PG_FUNCTION_ARGS)
pg_stat_get_activity(PG_FUNCTION_ARGS)
pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
pg_stat_get_slru(PG_FUNCTION_ARGS)
pg_stat_get_progress_info(PG_FUNCTION_ARGS)
I can imagine a function that takes a Datums array, a nulls array, and a
ResultSetInfo and then makes the tuplestore -- though I think that will
use more memory. Perhaps we could make a macro which does the initial
error checking (checking if caller supports returning a tuplestore)? I'm
not sure if there is something meaningful here, but I thought I would
ask.

Finally, I haven't removed the test in pg_stats and haven't done a final
pass for comment clarity, alphabetization, etc on this version.

I have addressed almost all of the issues mentioned above in v9.
The only remaining TODOs are described in the commit message.
most critical one is that the reset message doesn't work.

Attachments:

v9-0003-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v9-0003-Remove-superfluous-bgwriter-stats.patchDownload

From 9747484ad0b6f1fe97f98cfb681fa117982dfb2f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:47:41 -0400
Subject: [PATCH v9 3/3] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  6 +---
 src/backend/postmaster/checkpointer.c | 26 ---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 60627c692a..08772652ac 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3416,24 +3416,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3444,35 +3426,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30280d520b..c45c261f4b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1058,18 +1058,14 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
+-- TODO: make separate pg_stat_checkpointer view
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
         pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c0c4122fd5..829f52cc8f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1102,8 +1087,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_increment_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 903d4df911..b8c17f8e7f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5569,9 +5569,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5587,9 +5585,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9832d35b90..997fff9f3f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2165,7 +2165,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2274,9 +2273,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2473,8 +2469,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 477caf2536..998625d490 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 54661e2b5f..02f624c18c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,16 +5600,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5629,18 +5619,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
   proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b73265ab13..9a3ffc9ee4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -486,9 +486,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -501,9 +499,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -879,9 +874,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -895,9 +888,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9172b0fcd2..ac2f7cf61e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.buffer_type,
-- 
2.27.0

v9-0001-Allow-bootstrap-process-to-beinit.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Allow-bootstrap-process-to-beinit.patchDownload

From b0a24e0cd0115f5bfb15a69693ce205a9dca841e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Sep 2021 17:39:12 -0400
Subject: [PATCH v9 1/3] Allow bootstrap process to beinit

---
 src/backend/utils/init/postinit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	if (!bootstrap)
-		pgstat_beinit();
+	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
-- 
2.27.0

v9-0002-Add-system-view-tracking-accesses-to-buffers.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Add-system-view-tracking-accesses-to-buffers.patchDownload

From f924873f296fa4691a41f38b4b3509d08ee3b62d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:33:59 -0400
Subject: [PATCH v9 2/3] Add system view tracking accesses to buffers

Add pg_stat_buffers, a system view which tracks the number of buffers of
a particular type (e.g. shared, local) allocated, written, fsync'd, and
extended by each backend type.

Some of these should always be zero. For example, a checkpointer backend
will not use a BufferAccessStrategy (currently), so buffer type
"strategy" for checkpointer will be 0 for all buffer access types
(alloc, write, fsync, and extend).

All backends increment a counter in their PgBackendStatus when
performing a buffer access. On exit, backends send these stats to the
stats collector to be persisted.

When stats are reset, the reset message includes the current values of
all the live backends' buffer access counters. When receiving this
message, the stats collector will 1) save these reset values in an array
of "resets" and 2) zero out the exited backends' saved buffer access
counters. This is required for accurate stats after a reset without
writing to other backends' PgBackendStatus.

When the pg_stat_buffers view is queried, sum live backends' stats with
saved stats from exited backends and subtract saved reset stats,
returning the total.

Each row of the view is for a particular backend type and a particular
buffer type (e.g. shared buffer accesses by checkpointer) and each
column in the view is the total number of buffers of each kind of buffer
access (e.g. written). So a cell in the view would be, for example, the
number of shared buffers written by checkpointer since the last stats
reset.

Note that this commit does not add code to increment buffer accesses for
all types of buffers. It includes all possible combinations in the stats
view but doesn't populate all of them.

TODO:
- pgstat reset message too large -- needs to be fixed
- Wrappers for smgr funcs to protect against regressions and cover other
  buffer types
- Consider helper func in pgstatfuncs.c to refactor out some of the
  redundant tuplestore creation code from pg_stat_get_progress_info,
  pg_stat_get_activity, etc
- Remove pg_stats test I added
- current code TODOs are mostly about adding comments
- When finished, catalog bump
- pgindent
---
 doc/src/sgml/monitoring.sgml                | 116 ++++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/checkpointer.c       |   1 +
 src/backend/postmaster/pgstat.c             | 115 ++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         |  25 +++-
 src/backend/storage/buffer/freelist.c       |  22 +++-
 src/backend/utils/activity/backend_status.c |  49 ++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 136 ++++++++++++++++++++
 src/backend/utils/init/miscinit.c           |  25 ++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/miscadmin.h                     |  23 ++++
 src/include/pgstat.h                        |  35 ++++-
 src/include/storage/buf_internals.h         |   4 +-
 src/include/utils/backend_status.h          |  44 +++++++
 src/test/regress/expected/rules.out         |   8 ++
 src/test/regress/sql/stats.sql              |   4 +
 16 files changed, 610 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2281ba120f..60627c692a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each buffer type for each backend type showing
+      statistics about backend buffer activity. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3478,6 +3487,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each buffer
+   type for each backend type, containing global data for the cluster for that
+   backend and buffer type.
+  </para>
+
+  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffer_type</structfield> <type>text</type>
+      </para>
+      <para>
+      Type of buffer accessed (e.g. shared).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5074,12 +5178,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
+        argument.  The argument can be <literal>bgwriter</literal> to reset all
+        the counters shown in the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view, or
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..30280d520b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1072,6 +1072,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.buffer_type,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers_accesses() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..c0c4122fd5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
+		pgstat_increment_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..903d4df911 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -124,9 +124,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various buffer actions are tracked in PgBackendStatus while a
+ * backend is alive and then sent to stats collector before a backend exits in
+ * a PgStat_MsgBufferActions.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -362,6 +365,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -974,6 +978,7 @@ pgstat_report_stat(bool disconnect)
 	/* Now, send function statistics */
 	pgstat_send_funcstats();
 
+
 	/* Send WAL statistics */
 	pgstat_send_wal(true);
 
@@ -1452,6 +1457,13 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "buffers") == 0) {
+		memset(msg.resets, 0, sizeof(PgStat_MsgBufferTypeAccesses) * BACKEND_NUM_TYPES);
+
+		msg.m_resettarget = RESET_BUFFERS;
+
+		pgstat_report_live_backend_accesses(msg.resets);
+	}
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -2760,6 +2772,15 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+// TODO: add comment?
+PgStat_BackendAccesses *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
@@ -2998,6 +3019,13 @@ static void
 pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
+	/*
+	 * Only need to send stats on buffer accesses when a process exits, as
+	 * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+	 * then sum this with totals from exited backends persisted by the stats
+	 * collector.
+	 */
+	pgstat_send_buffers();
 
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
@@ -3148,6 +3176,47 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/* ----------
+ * pgstat_send_buffers() -
+ *
+ *		Before exiting, a backend sends its buffer access statistics to the
+ *		collector so that they may be persisted
+ * ----------
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgBufferTypeAccesses msg;
+	PgBufferAccesses *src_accesses;
+	PgStatBufferAccesses *dest_accesses;
+	int buffer_type;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	MemSet(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	src_accesses = (PgBufferAccesses *) &beentry->buffer_access_stats;
+	dest_accesses = msg.buffer_type_accesses;
+
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		dest_accesses->allocs += pg_atomic_read_u64(&src_accesses->allocs);
+		dest_accesses->extends += pg_atomic_read_u64(&src_accesses->extends);
+		dest_accesses->fsyncs += pg_atomic_read_u64(&src_accesses->fsyncs);
+		dest_accesses->writes += pg_atomic_read_u64(&src_accesses->writes);
+		dest_accesses++;
+		src_accesses++;
+	}
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3522,6 +3591,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_type_accesses(&msg.msg_buffer_accesses, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5222,9 +5295,16 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
 		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		memset(&globalStats.buffers.accesses, 0, sizeof(globalStats.buffers.accesses));
+		memcpy(globalStats.buffers.resets, msg->resets, sizeof(msg->resets));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
@@ -5512,6 +5592,33 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len)
+{
+	int buffer_type;
+	PgStatBufferAccesses *src_buffer_accesses = msg->buffer_type_accesses;
+	PgStatBufferAccesses *dest_buffer_accesses = globalStats.buffers.accesses[msg->backend_type].buffer_type_accesses;
+
+	/*
+	 * No users will likely need PgStat_MsgBufferTypeAccesses->backend_type
+	 * when accessing it from globalStats since its place in the
+	 * globalStats.buffers.accesses array indicates backend_type. However,
+	 * leaving it undefined seemed like an invitation for unnecessary future
+	 * bugs.
+	 */
+	globalStats.buffers.accesses[msg->backend_type].backend_type = msg->backend_type;
+
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgStatBufferAccesses *src = &src_buffer_accesses[buffer_type];
+		PgStatBufferAccesses *dest = &dest_buffer_accesses[buffer_type];
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..9832d35b90 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_increment_buffer_access_type(BA_Extend, Buf_Shared);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1184,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1236,7 +1238,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,6 +1247,23 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 * When a strategy is not in use, at this point, the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				if (from_ring)
+					pgstat_increment_buffer_access_type(BA_Write, Buf_Strategy);
+				else
+					pgstat_increment_buffer_access_type(BA_Write, Buf_Shared);
+
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2552,6 +2571,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_increment_buffer_access_type(BA_Write, Buf_Shared);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..866cdd3911 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,6 +213,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
+		*from_ring = buf == NULL ? false : true;
 		if (buf != NULL)
 			return buf;
 	}
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_increment_buffer_access_type(BA_Alloc, Buf_Shared);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer, then
+	 * ensure that we count it as such in pg_stat_buffers view.
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping for
+	 * pg_stat_buffers.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..0581dce8b9 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -279,7 +279,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
@@ -293,6 +293,7 @@ pgstat_bestart(void)
 {
 	volatile PgBackendStatus *vbeentry = MyBEEntry;
 	PgBackendStatus lbeentry;
+	int buffer_type;
 #ifdef USE_SSL
 	PgBackendSSLStatus lsslstatus;
 #endif
@@ -399,6 +400,14 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgBufferAccesses *accesses = &lbeentry.buffer_access_stats[buffer_type];
+		pg_atomic_init_u64(&accesses->allocs, 0);
+		pg_atomic_init_u64(&accesses->extends, 0);
+		pg_atomic_init_u64(&accesses->fsyncs, 0);
+		pg_atomic_init_u64(&accesses->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -621,6 +630,38 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+// TODO: function comment
+void pgstat_report_live_backend_accesses(PgStat_MsgBufferTypeAccesses *backend_accesses)
+{
+	int i, buffer_type;
+	PgBackendStatus *beentry = BackendStatusArray;
+	/*
+		* Loop through live backends and capture reset values
+		*/
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		PgBufferAccesses *live_accesses;
+		PgStatBufferAccesses *buffer_accesses;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		live_accesses = (PgBufferAccesses *) beentry->buffer_access_stats;
+		buffer_accesses = backend_accesses[beentry->st_backendType].buffer_type_accesses;
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			buffer_accesses->allocs = pg_atomic_read_u64(&live_accesses->allocs);
+			buffer_accesses->extends = pg_atomic_read_u64(&live_accesses->extends);
+			buffer_accesses->fsyncs = pg_atomic_read_u64(&live_accesses->fsyncs);
+			buffer_accesses->writes = pg_atomic_read_u64(&live_accesses->writes);
+			buffer_accesses++;
+			live_accesses++;
+		}
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
@@ -1046,6 +1087,12 @@ pgstat_get_my_query_id(void)
 }
 
 
+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+	return BackendStatusArray;
+}
+
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..477caf2536 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,142 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
+{
+#define NROWS ((BACKEND_NUM_TYPES - 1) * BUFFER_NUM_TYPES)
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	PgStat_BackendAccesses *backend_accesses;
+	int buffer_type;
+	int backend_type;
+	Datum reset_time;
+	int i;
+	PgBackendStatus *beentry;
+
+	enum {
+		COLUMN_BACKEND_TYPE,
+		COLUMN_BUFFER_TYPE,
+		COLUMN_ALLOCS,
+		COLUMN_EXTENDS,
+		COLUMN_FSYNCS,
+		COLUMN_WRITES,
+		COLUMN_RESET_TIME,
+		COLUMN_LENGTH,
+	};
+
+	Datum all_values[NROWS][COLUMN_LENGTH];
+	bool all_nulls[NROWS][COLUMN_LENGTH];
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their buffer accesses for each
+	 * buffer type
+	 */
+	beentry = pgstat_fetch_backend_statuses();
+
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		PgBufferAccesses *buffer_accesses;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		buffer_accesses = beentry->buffer_access_stats;
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			int rownum = (beentry->st_backendType - 1) * BUFFER_NUM_TYPES + buffer_type;
+			Datum *values = all_values[rownum];
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_BUFFER_TYPE
+			 * will all be set when looping through exited backends array
+			 */
+			values[COLUMN_ALLOCS] += pg_atomic_read_u64(&buffer_accesses->allocs);
+			values[COLUMN_EXTENDS] += pg_atomic_read_u64(&buffer_accesses->extends);
+			values[COLUMN_FSYNCS] += pg_atomic_read_u64(&buffer_accesses->fsyncs);
+			values[COLUMN_WRITES] += pg_atomic_read_u64(&buffer_accesses->writes);
+			buffer_accesses++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_accesses = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_accesses->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatBufferAccesses *buffer_accesses = backend_accesses->accesses[backend_type].buffer_type_accesses;
+		PgStatBufferAccesses *resets = backend_accesses->resets[backend_type].buffer_type_accesses;
+
+		Datum backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			/*
+			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+			 * BackendType
+			 */
+			Datum *values = all_values[(backend_type - 1) * BUFFER_NUM_TYPES + buffer_type];
+
+			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[COLUMN_BUFFER_TYPE] = CStringGetTextDatum(GetBufferTypeDesc(buffer_type));
+			values[COLUMN_ALLOCS] = values[COLUMN_ALLOCS] + buffer_accesses->allocs - resets->allocs;
+			values[COLUMN_EXTENDS] = values[COLUMN_EXTENDS] + buffer_accesses->extends - resets->extends;
+			values[COLUMN_FSYNCS] = values[COLUMN_FSYNCS] + buffer_accesses->fsyncs - resets->fsyncs;
+			values[COLUMN_WRITES] = values[COLUMN_WRITES] + buffer_accesses->writes - resets->writes;
+			values[COLUMN_RESET_TIME] = reset_time;
+			buffer_accesses++;
+			resets++;
+		}
+	}
+
+	for (i = 0; i < NROWS; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..50b50d00ce 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -299,6 +299,31 @@ GetBackendTypeDesc(BackendType backendType)
 	return backendDesc;
 }
 
+const char *
+GetBufferTypeDesc(BufferType bufferType)
+{
+	const char *bufferDesc = "unknown buffer type";
+
+	switch (bufferType)
+	{
+		case Buf_Direct:
+			bufferDesc = "direct";
+			break;
+		case Buf_Local:
+			bufferDesc = "local";
+			break;
+		case Buf_Shared:
+			bufferDesc = "shared";
+			break;
+		case Buf_Strategy:
+			bufferDesc = "strategy";
+			break;
+	}
+
+	return bufferDesc;
+}
+
+
 /* ----------------------------------------------------------------
  *				database path / name support stuff
  * ----------------------------------------------------------------
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..54661e2b5f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
+  proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffer_type,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers_accesses' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..266d32835c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,9 +338,32 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
+typedef enum BufferAccessType
+{
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+}	BufferAccessType;
+
+#define BUFFER_ACCESS_NUM_TYPES (BA_Write + 1)
+
+typedef enum BufferType
+{
+	Buf_Direct,
+	Buf_Local,
+	Buf_Shared,
+	Buf_Strategy,
+} BufferType;
+
+#define BUFFER_NUM_TYPES (Buf_Strategy + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
+extern const char * GetBufferTypeDesc(BufferType bufferType);
 
 extern void SetDatabasePath(const char *path);
 extern void checkDataDir(void);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..b73265ab13 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -138,6 +139,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -224,7 +226,9 @@ typedef struct PgStat_MsgHdr
  * platforms, but we're being conservative here.)
  * ----------
  */
-#define PGSTAT_MAX_MSG_SIZE 1000
+// TODO: how sketchy is this? What can I do instead? The array of counters for
+// reset message is 2kB, I think
+#define PGSTAT_MAX_MSG_SIZE 3000
 #define PGSTAT_MSG_PAYLOAD	(PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
 
 
@@ -331,6 +335,30 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+// TODO: add comment
+typedef struct PgStatBufferAccesses
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatBufferAccesses;
+
+typedef struct PgStat_MsgBufferTypeAccesses
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatBufferAccesses buffer_type_accesses[BUFFER_NUM_TYPES];
+} PgStat_MsgBufferTypeAccesses;
+
+typedef struct PgStat_BackendAccesses
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_MsgBufferTypeAccesses accesses[BACKEND_NUM_TYPES];
+	PgStat_MsgBufferTypeAccesses resets[BACKEND_NUM_TYPES];
+} PgStat_BackendAccesses;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -351,6 +379,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgBufferTypeAccesses resets[BACKEND_NUM_TYPES];
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
@@ -703,6 +732,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgBufferTypeAccesses msg_buffer_accesses;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -879,6 +909,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendAccesses buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1118,6 +1149,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
 
@@ -1126,6 +1158,7 @@ extern void pgstat_send_wal(bool force);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendAccesses * pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..0364779978 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -37,6 +38,15 @@ typedef enum BackendState
  * ----------
  */
 
+// TODO: add a comment
+typedef struct PgBufferAccesses
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} PgBufferAccesses;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +178,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBufferAccesses buffer_access_stats[BUFFER_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,7 +307,39 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStat_MsgBufferTypeAccesses PgStat_MsgBufferTypeAccesses;
+
+static inline void
+pgstat_increment_buffer_access_type(BufferAccessType ba_type, BufferType buf_type)
+{
+	PgBufferAccesses *accesses;
+	PgBackendStatus *beentry   = MyBEEntry;
+
+	Assert(beentry);
+
+	accesses = &beentry->buffer_access_stats[buf_type];
+	switch (ba_type)
+	{
+		case BA_Alloc:
+			pg_atomic_write_u64(&accesses->allocs,
+					pg_atomic_read_u64(&accesses->allocs) + 1);
+			break;
+		case BA_Extend:
+			pg_atomic_write_u64(&accesses->extends,
+					pg_atomic_read_u64(&accesses->extends) + 1);
+			break;
+		case BA_Fsync:
+			pg_atomic_write_u64(&accesses->fsyncs,
+					pg_atomic_read_u64(&accesses->fsyncs) + 1);
+			break;
+		case BA_Write:
+			pg_atomic_write_u64(&accesses->writes,
+					pg_atomic_read_u64(&accesses->writes) + 1);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_accesses(PgStat_MsgBufferTypeAccesses *backend_accesses);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
@@ -312,6 +355,7 @@ extern uint64 pgstat_get_my_query_id(void);
  * generate the pgstat* views.
  * ----------
  */
+extern PgBackendStatus *pgstat_fetch_backend_statuses(void);
 extern int	pgstat_fetch_stat_numbackends(void);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..9172b0fcd2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.buffer_type,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers_accesses() b(backend_type, buffer_type, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..4ad672b35a 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,8 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffers;
+SELECT pg_stat_reset_shared('buffers');
+SELECT pg_sleep(2);
+SELECT * FROM pg_stat_buffers;
 -- End of Stats Test
-- 
2.27.0

#23

melanieplageman@gmail.com

over 4 years ago

In reply to: Melanie Plageman (#22)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
The only remaining TODOs are described in the commit message.
most critical one is that the reset message doesn't work.

v10 is attached with updated comments and some limited code refactoring

Attachments:

v10-0003-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v10-0003-Remove-superfluous-bgwriter-stats.patchDownload

From 566a5e6194acf6e670f9c1836d4296cf43b7ffcc Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:47:41 -0400
Subject: [PATCH v10 3/3] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  6 +---
 src/backend/postmaster/checkpointer.c | 26 ---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 60627c692a..08772652ac 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3416,24 +3416,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3444,35 +3426,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30280d520b..c45c261f4b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1058,18 +1058,14 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
+-- TODO: make separate pg_stat_checkpointer view
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
         pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c0c4122fd5..829f52cc8f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1102,8 +1087,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_increment_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0df7c3ceb6..7a1a5d22ed 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5585,9 +5585,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5603,9 +5601,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9832d35b90..997fff9f3f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2165,7 +2165,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2274,9 +2273,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2473,8 +2469,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 477caf2536..998625d490 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 54661e2b5f..02f624c18c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,16 +5600,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5629,18 +5619,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
   proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 38b2167380..fb074d7094 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -501,9 +501,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -516,9 +514,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -894,9 +889,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -910,9 +903,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9172b0fcd2..ac2f7cf61e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.buffer_type,
-- 
2.27.0

v10-0001-Allow-bootstrap-process-to-beinit.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Allow-bootstrap-process-to-beinit.patchDownload

From b0a24e0cd0115f5bfb15a69693ce205a9dca841e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Sep 2021 17:39:12 -0400
Subject: [PATCH v10 1/3] Allow bootstrap process to beinit

---
 src/backend/utils/init/postinit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	if (!bootstrap)
-		pgstat_beinit();
+	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
-- 
2.27.0

v10-0002-Add-system-view-tracking-accesses-to-buffers.patchtext/x-patch; charset=US-ASCII; name=v10-0002-Add-system-view-tracking-accesses-to-buffers.patchDownload

From d8fb7a3b83a5ec322201cf8b02b49e36e58fe26d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Sep 2021 11:33:59 -0400
Subject: [PATCH v10 2/3] Add system view tracking accesses to buffers

Add pg_stat_buffers, a system view which tracks the number of buffers of
a particular type (e.g. shared, local) allocated, written, fsync'd, and
extended by each backend type.

Some of these should always be zero. For example, a checkpointer backend
will not use a BufferAccessStrategy (currently), so buffer type
"strategy" for checkpointer will be 0 for all buffer access types
(alloc, write, fsync, and extend).

All backends increment a counter in their PgBackendStatus when
performing a buffer access. On exit, backends send these stats to the
stats collector to be persisted.

When stats are reset, the reset message includes the current values of
all the live backends' buffer access counters. When receiving this
message, the stats collector will 1) save these reset values in an array
of "resets" and 2) zero out the exited backends' saved buffer access
counters. This is required for accurate stats after a reset without
writing to other backends' PgBackendStatus.

When the pg_stat_buffers view is queried, sum live backends' stats with
saved stats from exited backends and subtract saved reset stats,
returning the total.

Each row of the view is for a particular backend type and a particular
buffer type (e.g. shared buffer accesses by checkpointer) and each
column in the view is the total number of buffers of each kind of buffer
access (e.g. written). So a cell in the view would be, for example, the
number of shared buffers written by checkpointer since the last stats
reset.

Note that this commit does not add code to increment buffer accesses for
all types of buffers. It includes all possible combinations in the stats
view but doesn't populate all of them.

TODO:
- pgstat reset message too large -- needs to be fixed
- Wrappers for smgr funcs to protect against regressions and cover other
  buffer types
- Consider helper func in pgstatfuncs.c to refactor out some of the
  redundant tuplestore creation code from pg_stat_get_progress_info,
  pg_stat_get_activity, etc
- Remove pg_stats test I added
- When finished, catalog bump
- pgindent
---
 doc/src/sgml/monitoring.sgml                | 116 ++++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/checkpointer.c       |   1 +
 src/backend/postmaster/pgstat.c             | 131 ++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         |  25 +++-
 src/backend/storage/buffer/freelist.c       |  22 +++-
 src/backend/utils/activity/backend_status.c |  43 ++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 136 ++++++++++++++++++++
 src/backend/utils/init/miscinit.c           |  25 ++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/miscadmin.h                     |  23 ++++
 src/include/pgstat.h                        |  52 +++++++-
 src/include/storage/buf_internals.h         |   4 +-
 src/include/utils/backend_status.h          |  46 +++++++
 src/test/regress/expected/rules.out         |   8 ++
 src/test/regress/sql/stats.sql              |   4 +
 16 files changed, 639 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2281ba120f..60627c692a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each buffer type for each backend type showing
+      statistics about backend buffer activity. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3478,6 +3487,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each buffer
+   type for each backend type, containing global data for the cluster for that
+   backend and buffer type.
+  </para>
+
+  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffer_type</structfield> <type>text</type>
+      </para>
+      <para>
+      Type of buffer accessed (e.g. shared).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5074,12 +5178,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
+        argument.  The argument can be <literal>bgwriter</literal> to reset all
+        the counters shown in the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view, or
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..30280d520b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1072,6 +1072,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.buffer_type,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers_accesses() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..c0c4122fd5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
+		pgstat_increment_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..0df7c3ceb6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -124,9 +124,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various buffer actions are tracked in PgBackendStatus while a
+ * backend is alive and then sent to stats collector before a backend exits in
+ * a PgStat_MsgBufferActions.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -362,6 +365,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -974,6 +978,7 @@ pgstat_report_stat(bool disconnect)
 	/* Now, send function statistics */
 	pgstat_send_funcstats();
 
+
 	/* Send WAL statistics */
 	pgstat_send_wal(true);
 
@@ -1452,6 +1457,13 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "buffers") == 0) {
+		memset(msg.resets, 0, sizeof(PgStat_MsgBufferTypeAccesses) * BACKEND_NUM_TYPES);
+
+		msg.m_resettarget = RESET_BUFFERS;
+
+		pgstat_report_live_backend_accesses(msg.resets);
+	}
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -2760,6 +2772,20 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the BackendAccesses structure tracking buffer access statistics for both
+ *	exited backends and reset arithmetic.
+ */
+PgStat_BackendAccesses *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
@@ -2998,6 +3024,13 @@ static void
 pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
+	/*
+	 * Only need to send stats on buffer accesses when a process exits, as
+	 * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+	 * then sum this with totals from exited backends persisted by the stats
+	 * collector.
+	 */
+	pgstat_send_buffers();
 
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
@@ -3092,6 +3125,29 @@ pgstat_send(void *msg, int len)
 #endif
 }
 
+/*
+ * Add live buffer access stats for all buffer types (e.g. shared, local) to
+ * those in the equivalent stats structure for exited backends. Note that this
+ * adds and doesn't set, so the destination buffer access stats should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all buffer access stats for all buffer types for a particular backend type
+ * to the pgstats structure.
+ */
+void pgstat_add_buffer_type_accesses(PgStatBufferAccesses *dest, PgBufferAccesses *src, int buffer_num_types)
+{
+	int buffer_type;
+	for (buffer_type = 0; buffer_type < buffer_num_types; buffer_type++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+
+}
+
 /* ----------
  * pgstat_send_archiver() -
  *
@@ -3148,6 +3204,35 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/* ----------
+ * pgstat_send_buffers() -
+ *
+ *		Before exiting, a backend sends its buffer access statistics to the
+ *		collector so that they may be persisted
+ * ----------
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgBufferTypeAccesses msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	MemSet(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_add_buffer_type_accesses(msg.buffer_type_accesses,
+			(PgBufferAccesses *) &beentry->buffer_access_stats,
+			BUFFER_NUM_TYPES);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3522,6 +3607,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_type_accesses(&msg.msg_buffer_accesses, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5222,9 +5311,16 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
 		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		memset(&globalStats.buffers.accesses, 0, sizeof(globalStats.buffers.accesses));
+		memcpy(globalStats.buffers.resets, msg->resets, sizeof(msg->resets));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
@@ -5512,6 +5608,33 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len)
+{
+	int buffer_type;
+	PgStatBufferAccesses *src_buffer_accesses = msg->buffer_type_accesses;
+	PgStatBufferAccesses *dest_buffer_accesses = globalStats.buffers.accesses[msg->backend_type].buffer_type_accesses;
+
+	/*
+	 * No users will likely need PgStat_MsgBufferTypeAccesses->backend_type
+	 * when accessing it from globalStats since its place in the
+	 * globalStats.buffers.accesses array indicates backend_type. However,
+	 * leaving it undefined seemed like an invitation for unnecessary future
+	 * bugs.
+	 */
+	globalStats.buffers.accesses[msg->backend_type].backend_type = msg->backend_type;
+
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgStatBufferAccesses *src = &src_buffer_accesses[buffer_type];
+		PgStatBufferAccesses *dest = &dest_buffer_accesses[buffer_type];
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..9832d35b90 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_increment_buffer_access_type(BA_Extend, Buf_Shared);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1184,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1236,7 +1238,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,6 +1247,23 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 * When a strategy is not in use, at this point, the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				if (from_ring)
+					pgstat_increment_buffer_access_type(BA_Write, Buf_Strategy);
+				else
+					pgstat_increment_buffer_access_type(BA_Write, Buf_Shared);
+
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2552,6 +2571,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_increment_buffer_access_type(BA_Write, Buf_Shared);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..866cdd3911 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,6 +213,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
+		*from_ring = buf == NULL ? false : true;
 		if (buf != NULL)
 			return buf;
 	}
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_increment_buffer_access_type(BA_Alloc, Buf_Shared);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer, then
+	 * ensure that we count it as such in pg_stat_buffers view.
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping for
+	 * pg_stat_buffers.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..6f02b624bb 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -279,7 +279,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
@@ -293,6 +293,7 @@ pgstat_bestart(void)
 {
 	volatile PgBackendStatus *vbeentry = MyBEEntry;
 	PgBackendStatus lbeentry;
+	int buffer_type;
 #ifdef USE_SSL
 	PgBackendSSLStatus lsslstatus;
 #endif
@@ -399,6 +400,14 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgBufferAccesses *accesses = &lbeentry.buffer_access_stats[buffer_type];
+		pg_atomic_init_u64(&accesses->allocs, 0);
+		pg_atomic_init_u64(&accesses->extends, 0);
+		pg_atomic_init_u64(&accesses->fsyncs, 0);
+		pg_atomic_init_u64(&accesses->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -621,6 +630,32 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' buffer access
+ * stats, adding them to that backend type's member of the backend_accesses
+ * structure.
+ */
+void pgstat_report_live_backend_accesses(PgStat_MsgBufferTypeAccesses *backend_accesses)
+{
+	int i;
+	PgBackendStatus *beentry = BackendStatusArray;
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		pgstat_add_buffer_type_accesses(backend_accesses[beentry->st_backendType].buffer_type_accesses,
+				(PgBufferAccesses *) beentry->buffer_access_stats,
+				BUFFER_NUM_TYPES);
+
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
@@ -1046,6 +1081,12 @@ pgstat_get_my_query_id(void)
 }
 
 
+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+	return BackendStatusArray;
+}
+
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..477caf2536 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,142 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
+{
+#define NROWS ((BACKEND_NUM_TYPES - 1) * BUFFER_NUM_TYPES)
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	PgStat_BackendAccesses *backend_accesses;
+	int buffer_type;
+	int backend_type;
+	Datum reset_time;
+	int i;
+	PgBackendStatus *beentry;
+
+	enum {
+		COLUMN_BACKEND_TYPE,
+		COLUMN_BUFFER_TYPE,
+		COLUMN_ALLOCS,
+		COLUMN_EXTENDS,
+		COLUMN_FSYNCS,
+		COLUMN_WRITES,
+		COLUMN_RESET_TIME,
+		COLUMN_LENGTH,
+	};
+
+	Datum all_values[NROWS][COLUMN_LENGTH];
+	bool all_nulls[NROWS][COLUMN_LENGTH];
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their buffer accesses for each
+	 * buffer type
+	 */
+	beentry = pgstat_fetch_backend_statuses();
+
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		PgBufferAccesses *buffer_accesses;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		buffer_accesses = beentry->buffer_access_stats;
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			int rownum = (beentry->st_backendType - 1) * BUFFER_NUM_TYPES + buffer_type;
+			Datum *values = all_values[rownum];
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_BUFFER_TYPE
+			 * will all be set when looping through exited backends array
+			 */
+			values[COLUMN_ALLOCS] += pg_atomic_read_u64(&buffer_accesses->allocs);
+			values[COLUMN_EXTENDS] += pg_atomic_read_u64(&buffer_accesses->extends);
+			values[COLUMN_FSYNCS] += pg_atomic_read_u64(&buffer_accesses->fsyncs);
+			values[COLUMN_WRITES] += pg_atomic_read_u64(&buffer_accesses->writes);
+			buffer_accesses++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_accesses = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_accesses->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatBufferAccesses *buffer_accesses = backend_accesses->accesses[backend_type].buffer_type_accesses;
+		PgStatBufferAccesses *resets = backend_accesses->resets[backend_type].buffer_type_accesses;
+
+		Datum backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			/*
+			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+			 * BackendType
+			 */
+			Datum *values = all_values[(backend_type - 1) * BUFFER_NUM_TYPES + buffer_type];
+
+			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[COLUMN_BUFFER_TYPE] = CStringGetTextDatum(GetBufferTypeDesc(buffer_type));
+			values[COLUMN_ALLOCS] = values[COLUMN_ALLOCS] + buffer_accesses->allocs - resets->allocs;
+			values[COLUMN_EXTENDS] = values[COLUMN_EXTENDS] + buffer_accesses->extends - resets->extends;
+			values[COLUMN_FSYNCS] = values[COLUMN_FSYNCS] + buffer_accesses->fsyncs - resets->fsyncs;
+			values[COLUMN_WRITES] = values[COLUMN_WRITES] + buffer_accesses->writes - resets->writes;
+			values[COLUMN_RESET_TIME] = reset_time;
+			buffer_accesses++;
+			resets++;
+		}
+	}
+
+	for (i = 0; i < NROWS; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..50b50d00ce 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -299,6 +299,31 @@ GetBackendTypeDesc(BackendType backendType)
 	return backendDesc;
 }
 
+const char *
+GetBufferTypeDesc(BufferType bufferType)
+{
+	const char *bufferDesc = "unknown buffer type";
+
+	switch (bufferType)
+	{
+		case Buf_Direct:
+			bufferDesc = "direct";
+			break;
+		case Buf_Local:
+			bufferDesc = "local";
+			break;
+		case Buf_Shared:
+			bufferDesc = "shared";
+			break;
+		case Buf_Strategy:
+			bufferDesc = "strategy";
+			break;
+	}
+
+	return bufferDesc;
+}
+
+
 /* ----------------------------------------------------------------
  *				database path / name support stuff
  * ----------------------------------------------------------------
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..54661e2b5f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
+  proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffer_type,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers_accesses' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..266d32835c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,9 +338,32 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
+typedef enum BufferAccessType
+{
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+}	BufferAccessType;
+
+#define BUFFER_ACCESS_NUM_TYPES (BA_Write + 1)
+
+typedef enum BufferType
+{
+	Buf_Direct,
+	Buf_Local,
+	Buf_Shared,
+	Buf_Strategy,
+} BufferType;
+
+#define BUFFER_NUM_TYPES (Buf_Strategy + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
+extern const char * GetBufferTypeDesc(BufferType bufferType);
 
 extern void SetDatabasePath(const char *path);
 extern void checkDataDir(void);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..38b2167380 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -138,6 +139,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -224,7 +226,9 @@ typedef struct PgStat_MsgHdr
  * platforms, but we're being conservative here.)
  * ----------
  */
-#define PGSTAT_MAX_MSG_SIZE 1000
+// TODO: how sketchy is this? What can I do instead? The array of counters for
+// reset message is 2kB, I think
+#define PGSTAT_MAX_MSG_SIZE 3000
 #define PGSTAT_MSG_PAYLOAD	(PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
 
 
@@ -331,6 +335,45 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of buffer accesses in the stats collector
+ * It has no message header, so, it must be used within a
+ * PgStat_MsgBufferTypeAccesses when being sent to the stats collector.
+ */
+typedef struct PgStatBufferAccesses
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatBufferAccesses;
+
+/*
+ * Sent by a backend to the stats collector to report all buffer accesses of
+ * all types of buffers for a given type of a backend. This will happen when
+ * the backend exits or when stats are reset.
+ */
+typedef struct PgStat_MsgBufferTypeAccesses
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatBufferAccesses buffer_type_accesses[BUFFER_NUM_TYPES];
+} PgStat_MsgBufferTypeAccesses;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' buffer accesses for all types of buffers as well as all stats from
+ * live backends at the time of stats reset. resets is populated using a reset
+ * message sent to the stats collector.
+ */
+typedef struct PgStat_BackendAccesses
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_MsgBufferTypeAccesses accesses[BACKEND_NUM_TYPES];
+	PgStat_MsgBufferTypeAccesses resets[BACKEND_NUM_TYPES];
+} PgStat_BackendAccesses;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -351,6 +394,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgBufferTypeAccesses resets[BACKEND_NUM_TYPES];
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
@@ -703,6 +747,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgBufferTypeAccesses msg_buffer_accesses;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -879,6 +924,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendAccesses buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1116,8 +1162,11 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
+extern void pgstat_add_buffer_type_accesses(PgStatBufferAccesses *dest,
+		PgBufferAccesses *src, int buffer_num_types);
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
 
@@ -1126,6 +1175,7 @@ extern void pgstat_send_wal(bool force);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendAccesses * pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..c910cb6206 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -37,6 +38,17 @@ typedef enum BackendState
  * ----------
  */
 
+/*
+ * Structure for counting all types of buffer accesses for a live backend.
+ */
+typedef struct PgBufferAccesses
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} PgBufferAccesses;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +180,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBufferAccesses buffer_access_stats[BUFFER_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,7 +309,39 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStat_MsgBufferTypeAccesses PgStat_MsgBufferTypeAccesses;
+
+static inline void
+pgstat_increment_buffer_access_type(BufferAccessType ba_type, BufferType buf_type)
+{
+	PgBufferAccesses *accesses;
+	PgBackendStatus *beentry   = MyBEEntry;
+
+	Assert(beentry);
+
+	accesses = &beentry->buffer_access_stats[buf_type];
+	switch (ba_type)
+	{
+		case BA_Alloc:
+			pg_atomic_write_u64(&accesses->allocs,
+					pg_atomic_read_u64(&accesses->allocs) + 1);
+			break;
+		case BA_Extend:
+			pg_atomic_write_u64(&accesses->extends,
+					pg_atomic_read_u64(&accesses->extends) + 1);
+			break;
+		case BA_Fsync:
+			pg_atomic_write_u64(&accesses->fsyncs,
+					pg_atomic_read_u64(&accesses->fsyncs) + 1);
+			break;
+		case BA_Write:
+			pg_atomic_write_u64(&accesses->writes,
+					pg_atomic_read_u64(&accesses->writes) + 1);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_accesses(PgStat_MsgBufferTypeAccesses *backend_accesses);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
@@ -312,6 +357,7 @@ extern uint64 pgstat_get_my_query_id(void);
  * generate the pgstat* views.
  * ----------
  */
+extern PgBackendStatus *pgstat_fetch_backend_statuses(void);
 extern int	pgstat_fetch_stat_numbackends(void);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..9172b0fcd2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.buffer_type,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers_accesses() b(backend_type, buffer_type, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..4ad672b35a 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,8 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffers;
+SELECT pg_stat_reset_shared('buffers');
+SELECT pg_sleep(2);
+SELECT * FROM pg_stat_buffers;
 -- End of Stats Test
-- 
2.27.0

#24

[1]: /messages/by-id/CAAKRu_aw72w70X1P=ba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g@mail.gmail.com

melanieplageman@gmail.com

over 4 years ago

In reply to: Melanie Plageman (#23)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Sep 27, 2021 at 2:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
The only remaining TODOs are described in the commit message.
most critical one is that the reset message doesn't work.

v10 is attached with updated comments and some limited code refactoring

v11 has fixed the oversize message issue by sending a reset message for
each backend type. Now, we will call GetCurrentTimestamp
BACKEND_NUM_TYPES times, so maybe I should add some kind of flag to the
reset message that indicates the first message so that all the "do once"
things can be done at that point.

I've also fixed a few style/cosmetic issues and updated the commit
message with a link to the thread [1]/messages/by-id/CAAKRu_aw72w70X1P=ba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g@mail.gmail.com where I proposed smgrwrite() and
smgrextend() wrappers (which is where I propose to call
pgstat_incremement_buffer_access_type() for unbuffered writes and
extends).

- Melanie

Attachments:

v11-0003-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v11-0003-Remove-superfluous-bgwriter-stats.patchDownload

From b2023adb80f2920081a6f35c19e0276d38ae3a15 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:44:51 -0400
Subject: [PATCH v11 3/3] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  6 +---
 src/backend/postmaster/checkpointer.c | 26 ---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 75753c3339..5852c45246 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3416,24 +3416,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3444,35 +3426,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30280d520b..c45c261f4b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1058,18 +1058,14 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
+-- TODO: make separate pg_stat_checkpointer view
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
         pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c0c4122fd5..829f52cc8f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1102,8 +1087,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_increment_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 683be10430..9048154515 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5601,9 +5601,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5619,9 +5617,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0bbb1e2458..72fb5de664 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2164,7 +2164,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 184968d99c..bf98f19db1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 54661e2b5f..02f624c18c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,16 +5600,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5629,18 +5619,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
   proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3f51f1fd8..246ab05f76 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -499,9 +499,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -514,9 +512,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -892,9 +887,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -908,9 +901,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9172b0fcd2..ac2f7cf61e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.buffer_type,
-- 
2.27.0

v11-0002-Add-system-view-tracking-accesses-to-buffers.patchtext/x-patch; charset=US-ASCII; name=v11-0002-Add-system-view-tracking-accesses-to-buffers.patchDownload

From 454b63a963a46a6a3541ed5627d985e09c640446 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:39:45 -0400
Subject: [PATCH v11 2/3] Add system view tracking accesses to buffers

Add pg_stat_buffers, a system view which tracks the number of buffers of
a particular type (e.g. shared, local) allocated, written, fsync'd, and
extended by each backend type.

Some of these should always be zero. For example, a checkpointer backend
will not use a BufferAccessStrategy (currently), so buffer type
"strategy" for checkpointer will be 0 for all buffer access types
(alloc, write, fsync, and extend).

All backends increment a counter in their PgBackendStatus when
performing a buffer access. On exit, backends send these stats to the
stats collector to be persisted.

When stats are reset, the backend sending the reset message will loop
through and collect all of the live backends' buffer access counters,
sending a reset message for each backend type containing its buffer
access stats. When receiving this message, the stats collector will 1)
save these reset values in an array of "resets" and 2) zero out the
exited backends' saved buffer access counters. This is required for
accurate stats after a reset without writing to other backends'
PgBackendStatus.

When the pg_stat_buffers view is queried, sum live backends' stats with
saved stats from exited backends and subtract saved reset stats,
returning the total.

Each row of the view is for a particular backend type and a particular
buffer type (e.g. shared buffer accesses by checkpointer) and each
column in the view is the total number of buffers of each kind of buffer
access (e.g. written). So a cell in the view would be, for example, the
number of shared buffers written by checkpointer since the last stats
reset.

Note that this commit does not add code to increment buffer accesses for
all types of buffers. It includes all possible combinations in the stats
view but doesn't populate all of them.

A separate proposed patch [1] which would add wrappers for smgrwrite()
and extend() would provide a good location to call
pgstat_increment_buffer_access_type() for unbuffered IO and avoid
regressions for future users of these functions.

TODO:
- Consider helper func in pgstatfuncs.c to refactor out some of the
  redundant tuplestore creation code from pg_stat_get_progress_info,
  pg_stat_get_activity, etc
- Remove pg_stats test I added
- When finished, catalog bump
- pgindent

[1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml                | 116 ++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/checkpointer.c       |   1 +
 src/backend/postmaster/pgstat.c             | 149 +++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         |  24 +++-
 src/backend/storage/buffer/freelist.c       |  24 +++-
 src/backend/utils/activity/backend_status.c |  44 +++++-
 src/backend/utils/adt/pgstatfuncs.c         | 140 ++++++++++++++++++
 src/backend/utils/init/miscinit.c           |  19 +++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/miscadmin.h                     |  23 +++
 src/include/pgstat.h                        |  48 +++++++
 src/include/storage/buf_internals.h         |   4 +-
 src/include/utils/backend_status.h          |  46 ++++++
 src/test/regress/expected/rules.out         |   8 ++
 src/test/regress/sql/stats.sql              |   4 +
 16 files changed, 652 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2cd8920645..75753c3339 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each buffer type for each backend type showing
+      statistics about backend buffer activity. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3478,6 +3487,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each buffer
+   type for each backend type, containing global data for the cluster for that
+   backend and buffer type.
+  </para>
+
+  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffer_type</structfield> <type>text</type>
+      </para>
+      <para>
+      Type of buffer accessed (e.g. shared).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5074,12 +5178,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
+        argument.  The argument can be <literal>bgwriter</literal> to reset all
+        the counters shown in the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view, or
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..30280d520b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1072,6 +1072,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.buffer_type,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers_accesses() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..c0c4122fd5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
+		pgstat_increment_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..683be10430 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -124,9 +124,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various buffer actions are tracked in PgBackendStatus while a
+ * backend is alive and then sent to stats collector before a backend exits in
+ * a PgStat_MsgBufferActions.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -362,6 +365,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -974,6 +978,7 @@ pgstat_report_stat(bool disconnect)
 	/* Now, send function statistics */
 	pgstat_send_funcstats();
 
+
 	/* Send WAL statistics */
 	pgstat_send_wal(true);
 
@@ -1452,6 +1457,8 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "buffers") == 0)
+		msg.m_resettarget = RESET_BUFFERS;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -1461,7 +1468,23 @@ pgstat_reset_shared_counters(const char *target)
 				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-	pgstat_send(&msg, sizeof(msg));
+
+	if (msg.m_resettarget == RESET_BUFFERS)
+	{
+		int backend_type;
+		PgStat_MsgBufferTypeAccesses accesses[BACKEND_NUM_TYPES];
+		memset(accesses, 0, sizeof(accesses));
+		pgstat_report_live_backend_accesses(accesses);
+
+		for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+		{
+			memcpy(&msg.backend_resets, &accesses[backend_type], sizeof(msg.backend_resets));
+			pgstat_send(&msg, sizeof(msg));
+		}
+	}
+	else
+		pgstat_send(&msg, sizeof(msg));
+
 }
 
 /* ----------
@@ -2760,6 +2783,20 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the BackendAccesses structure tracking buffer access statistics for both
+ *	exited backends and reset arithmetic.
+ */
+PgStat_BackendAccesses *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
@@ -2998,6 +3035,13 @@ static void
 pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
+	/*
+	 * Only need to send stats on buffer accesses when a process exits, as
+	 * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+	 * then sum this with totals from exited backends persisted by the stats
+	 * collector.
+	 */
+	pgstat_send_buffers();
 
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
@@ -3092,6 +3136,29 @@ pgstat_send(void *msg, int len)
 #endif
 }
 
+/*
+ * Add live buffer access stats for all buffer types (e.g. shared, local) to
+ * those in the equivalent stats structure for exited backends. Note that this
+ * adds and doesn't set, so the destination buffer access stats should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all buffer access stats for all buffer types for a particular backend type
+ * to the pgstats structure.
+ */
+void pgstat_add_buffer_type_accesses(PgStatBufferAccesses *dest, PgBufferAccesses *src, int buffer_num_types)
+{
+	int buffer_type;
+	for (buffer_type = 0; buffer_type < buffer_num_types; buffer_type++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+
+}
+
 /* ----------
  * pgstat_send_archiver() -
  *
@@ -3148,6 +3215,32 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its buffer access statistics to the
+ * collector so that they may be persisted
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgBufferTypeAccesses msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_add_buffer_type_accesses(msg.buffer_type_accesses,
+			(PgBufferAccesses *) &beentry->buffer_access_stats,
+			BUFFER_NUM_TYPES);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3522,6 +3615,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_type_accesses(&msg.msg_buffer_accesses, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5222,9 +5319,24 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
 		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		BackendType backend_type = msg->backend_resets.backend_type;
+		// TODO: this only needs to be done once, but I am doing it for every backend_type
+		// should I add a way to tell if this is the first of the RESET_BUFFERS messages?
+		memset(&globalStats.buffers.accesses, 0, sizeof(globalStats.buffers.accesses));
+
+		memcpy(&globalStats.buffers.resets[backend_type],
+				&msg->backend_resets, sizeof(msg->backend_resets));
+		// TODO: likewise this should only be done once -- especially because
+		// it isn't cheap. Should I add something to the message to indicate it
+		// is the first message when it is of type RESET_BUFFERS?
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
@@ -5512,6 +5624,33 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len)
+{
+	int buffer_type;
+	PgStatBufferAccesses *src_buffer_accesses = msg->buffer_type_accesses;
+	PgStatBufferAccesses *dest_buffer_accesses = globalStats.buffers.accesses[msg->backend_type].buffer_type_accesses;
+
+	/*
+	 * No users will likely need PgStat_MsgBufferTypeAccesses->backend_type
+	 * when accessing it from globalStats since its place in the
+	 * globalStats.buffers.accesses array indicates backend_type. However,
+	 * leaving it undefined seemed like an invitation for unnecessary future
+	 * bugs.
+	 */
+	globalStats.buffers.accesses[msg->backend_type].backend_type = msg->backend_type;
+
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgStatBufferAccesses *src = &src_buffer_accesses[buffer_type];
+		PgStatBufferAccesses *dest = &dest_buffer_accesses[buffer_type];
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..0bbb1e2458 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_increment_buffer_access_type(BA_Extend, Buf_Shared);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1184,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1221,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				BufferType buftype;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1239,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,6 +1248,21 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 * When a strategy is not in use, at this point, the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				buftype = from_ring ? Buf_Strategy : Buf_Shared;
+				pgstat_increment_buffer_access_type(BA_Write, buftype);
+
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2552,6 +2570,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_increment_buffer_access_type(BA_Write, Buf_Shared);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..386cad1a36 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_increment_buffer_access_type(BA_Alloc, Buf_Shared);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer, then
+	 * ensure that we count it as such in pg_stat_buffers view.
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping for
+	 * pg_stat_buffers.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..0bfa22264f 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -279,7 +279,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
@@ -293,6 +293,7 @@ pgstat_bestart(void)
 {
 	volatile PgBackendStatus *vbeentry = MyBEEntry;
 	PgBackendStatus lbeentry;
+	int buffer_type;
 #ifdef USE_SSL
 	PgBackendSSLStatus lsslstatus;
 #endif
@@ -399,6 +400,14 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgBufferAccesses *accesses = &lbeentry.buffer_access_stats[buffer_type];
+		pg_atomic_init_u64(&accesses->allocs, 0);
+		pg_atomic_init_u64(&accesses->extends, 0);
+		pg_atomic_init_u64(&accesses->fsyncs, 0);
+		pg_atomic_init_u64(&accesses->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -621,6 +630,33 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' buffer access
+ * stats, adding them to that backend type's member of the backend_accesses
+ * structure.
+ */
+void pgstat_report_live_backend_accesses(PgStat_MsgBufferTypeAccesses *backend_accesses)
+{
+	int i;
+	PgBackendStatus *beentry = BackendStatusArray;
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		backend_accesses[beentry->st_backendType].backend_type = beentry->st_backendType;
+		pgstat_add_buffer_type_accesses(backend_accesses[beentry->st_backendType].buffer_type_accesses,
+				(PgBufferAccesses *) beentry->buffer_access_stats,
+				BUFFER_NUM_TYPES);
+
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
@@ -1046,6 +1082,12 @@ pgstat_get_my_query_id(void)
 }
 
 
+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+	return BackendStatusArray;
+}
+
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..184968d99c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,146 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
+{
+#define NROWS ((BACKEND_NUM_TYPES - 1) * BUFFER_NUM_TYPES)
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	PgStat_BackendAccesses *backend_accesses;
+	int buffer_type;
+	int backend_type;
+	Datum reset_time;
+	int i;
+	PgBackendStatus *beentry;
+
+	/*
+	 * When adding a new column to the pg_stat_buffers view, add a new enum
+	 * value here above COLUMN_LENGTH.
+	 */
+	enum {
+		COLUMN_BACKEND_TYPE,
+		COLUMN_BUFFER_TYPE,
+		COLUMN_ALLOCS,
+		COLUMN_EXTENDS,
+		COLUMN_FSYNCS,
+		COLUMN_WRITES,
+		COLUMN_RESET_TIME,
+		COLUMN_LENGTH,
+	};
+
+	Datum all_values[NROWS][COLUMN_LENGTH];
+	bool all_nulls[NROWS][COLUMN_LENGTH];
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their buffer accesses for each
+	 * buffer type
+	 */
+	beentry = pgstat_fetch_backend_statuses();
+
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		PgBufferAccesses *buffer_accesses;
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		buffer_accesses = beentry->buffer_access_stats;
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			int rownum = (beentry->st_backendType - 1) * BUFFER_NUM_TYPES + buffer_type;
+			Datum *values = all_values[rownum];
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_BUFFER_TYPE
+			 * will all be set when looping through exited backends array
+			 */
+			values[COLUMN_ALLOCS] += pg_atomic_read_u64(&buffer_accesses->allocs);
+			values[COLUMN_EXTENDS] += pg_atomic_read_u64(&buffer_accesses->extends);
+			values[COLUMN_FSYNCS] += pg_atomic_read_u64(&buffer_accesses->fsyncs);
+			values[COLUMN_WRITES] += pg_atomic_read_u64(&buffer_accesses->writes);
+			buffer_accesses++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_accesses = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_accesses->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatBufferAccesses *buffer_accesses = backend_accesses->accesses[backend_type].buffer_type_accesses;
+		PgStatBufferAccesses *resets = backend_accesses->resets[backend_type].buffer_type_accesses;
+
+		Datum backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			/*
+			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+			 * BackendType
+			 */
+			Datum *values = all_values[(backend_type - 1) * BUFFER_NUM_TYPES + buffer_type];
+
+			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[COLUMN_BUFFER_TYPE] = CStringGetTextDatum(GetBufferTypeDesc(buffer_type));
+			values[COLUMN_ALLOCS] = values[COLUMN_ALLOCS] + buffer_accesses->allocs - resets->allocs;
+			values[COLUMN_EXTENDS] = values[COLUMN_EXTENDS] + buffer_accesses->extends - resets->extends;
+			values[COLUMN_FSYNCS] = values[COLUMN_FSYNCS] + buffer_accesses->fsyncs - resets->fsyncs;
+			values[COLUMN_WRITES] = values[COLUMN_WRITES] + buffer_accesses->writes - resets->writes;
+			values[COLUMN_RESET_TIME] = reset_time;
+			buffer_accesses++;
+			resets++;
+		}
+	}
+
+	for (i = 0; i < NROWS; i++)
+	{
+		Datum *values = all_values[i];
+		bool *nulls = all_nulls[i];
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..0922bc2c81 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -299,6 +299,25 @@ GetBackendTypeDesc(BackendType backendType)
 	return backendDesc;
 }
 
+const char *
+GetBufferTypeDesc(BufferType bufferType)
+{
+
+	switch (bufferType)
+	{
+		case Buf_Direct:
+			return "direct";
+		case Buf_Local:
+			return "local";
+		case Buf_Shared:
+			return "shared";
+		case Buf_Strategy:
+			return "strategy";
+	}
+	return "unknown buffer type";
+}
+
+
 /* ----------------------------------------------------------------
  *				database path / name support stuff
  * ----------------------------------------------------------------
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..54661e2b5f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
+  proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffer_type,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers_accesses' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..266d32835c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,9 +338,32 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
+typedef enum BufferAccessType
+{
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+}	BufferAccessType;
+
+#define BUFFER_ACCESS_NUM_TYPES (BA_Write + 1)
+
+typedef enum BufferType
+{
+	Buf_Direct,
+	Buf_Local,
+	Buf_Shared,
+	Buf_Strategy,
+} BufferType;
+
+#define BUFFER_NUM_TYPES (Buf_Strategy + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
+extern const char * GetBufferTypeDesc(BufferType bufferType);
 
 extern void SetDatabasePath(const char *path);
 extern void checkDataDir(void);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..d3f51f1fd8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -138,6 +139,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -331,6 +333,45 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of buffer accesses in the stats collector
+ * It has no message header, so, it must be used within a
+ * PgStat_MsgBufferTypeAccesses when being sent to the stats collector.
+ */
+typedef struct PgStatBufferAccesses
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatBufferAccesses;
+
+/*
+ * Sent by a backend to the stats collector to report all buffer accesses of
+ * all types of buffers for a given type of a backend. This will happen when
+ * the backend exits or when stats are reset.
+ */
+typedef struct PgStat_MsgBufferTypeAccesses
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatBufferAccesses buffer_type_accesses[BUFFER_NUM_TYPES];
+} PgStat_MsgBufferTypeAccesses;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' buffer accesses for all types of buffers as well as all stats from
+ * live backends at the time of stats reset. resets is populated using a reset
+ * message sent to the stats collector.
+ */
+typedef struct PgStat_BackendAccesses
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_MsgBufferTypeAccesses accesses[BACKEND_NUM_TYPES];
+	PgStat_MsgBufferTypeAccesses resets[BACKEND_NUM_TYPES];
+} PgStat_BackendAccesses;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -351,6 +392,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgBufferTypeAccesses backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
@@ -703,6 +745,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgBufferTypeAccesses msg_buffer_accesses;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -879,6 +922,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendAccesses buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1116,8 +1160,11 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
+extern void pgstat_add_buffer_type_accesses(PgStatBufferAccesses *dest,
+		PgBufferAccesses *src, int buffer_num_types);
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
 
@@ -1126,6 +1173,7 @@ extern void pgstat_send_wal(bool force);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendAccesses * pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..c910cb6206 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -37,6 +38,17 @@ typedef enum BackendState
  * ----------
  */
 
+/*
+ * Structure for counting all types of buffer accesses for a live backend.
+ */
+typedef struct PgBufferAccesses
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} PgBufferAccesses;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +180,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBufferAccesses buffer_access_stats[BUFFER_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,7 +309,39 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStat_MsgBufferTypeAccesses PgStat_MsgBufferTypeAccesses;
+
+static inline void
+pgstat_increment_buffer_access_type(BufferAccessType ba_type, BufferType buf_type)
+{
+	PgBufferAccesses *accesses;
+	PgBackendStatus *beentry   = MyBEEntry;
+
+	Assert(beentry);
+
+	accesses = &beentry->buffer_access_stats[buf_type];
+	switch (ba_type)
+	{
+		case BA_Alloc:
+			pg_atomic_write_u64(&accesses->allocs,
+					pg_atomic_read_u64(&accesses->allocs) + 1);
+			break;
+		case BA_Extend:
+			pg_atomic_write_u64(&accesses->extends,
+					pg_atomic_read_u64(&accesses->extends) + 1);
+			break;
+		case BA_Fsync:
+			pg_atomic_write_u64(&accesses->fsyncs,
+					pg_atomic_read_u64(&accesses->fsyncs) + 1);
+			break;
+		case BA_Write:
+			pg_atomic_write_u64(&accesses->writes,
+					pg_atomic_read_u64(&accesses->writes) + 1);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_accesses(PgStat_MsgBufferTypeAccesses *backend_accesses);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
@@ -312,6 +357,7 @@ extern uint64 pgstat_get_my_query_id(void);
  * generate the pgstat* views.
  * ----------
  */
+extern PgBackendStatus *pgstat_fetch_backend_statuses(void);
 extern int	pgstat_fetch_stat_numbackends(void);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..9172b0fcd2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.buffer_type,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers_accesses() b(backend_type, buffer_type, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..4ad672b35a 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,8 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffers;
+SELECT pg_stat_reset_shared('buffers');
+SELECT pg_sleep(2);
+SELECT * FROM pg_stat_buffers;
 -- End of Stats Test
-- 
2.27.0

v11-0001-Allow-bootstrap-process-to-beinit.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Allow-bootstrap-process-to-beinit.patchDownload

From 1b27d12692bce68d5c7acb57d5f3debcf54cbbae Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Sep 2021 17:39:12 -0400
Subject: [PATCH v11 1/3] Allow bootstrap process to beinit

---
 src/backend/utils/init/postinit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	if (!bootstrap)
-		pgstat_beinit();
+	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
-- 
2.27.0

#25

melanieplageman@gmail.com

over 4 years ago

In reply to: Melanie Plageman (#24)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Sep 29, 2021 at 4:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Sep 27, 2021 at 2:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
The only remaining TODOs are described in the commit message.
most critical one is that the reset message doesn't work.

v10 is attached with updated comments and some limited code refactoring

v11 has fixed the oversize message issue by sending a reset message for
each backend type. Now, we will call GetCurrentTimestamp
BACKEND_NUM_TYPES times, so maybe I should add some kind of flag to the
reset message that indicates the first message so that all the "do once"
things can be done at that point.

I've also fixed a few style/cosmetic issues and updated the commit
message with a link to the thread [1] where I proposed smgrwrite() and
smgrextend() wrappers (which is where I propose to call
pgstat_incremement_buffer_access_type() for unbuffered writes and
extends).

- Melanie

[1] /messages/by-id/CAAKRu_aw72w70X1P=ba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g@mail.gmail.com

v12 (attached) has various style and code clarity updates (it is
pgindented as well). I also added a new commit which creates a utility
function to make a tuplestore for views that need one in pgstatfuncs.c.

Having received some offlist feedback about the names BufferAccessType
and BufferType being confusing, I am planning to rename these variables
and all of the associated functions. I agree that BufferType and
BufferAccessType are confusing for the following reasons:
- They sound similar.
- They aren't very precise.
- One of the types of buffers is not using a Postgres buffer.

So far, the proposed alternative is IO_Op or IOOp for BufferAccessType
and IOPath for BufferType.

- Melanie

Attachments:

v12-0004-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v12-0004-Remove-superfluous-bgwriter-stats.patchDownload

From 5c3f382ba4eef310fc82b2b676029097eb99cd70 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:44:51 -0400
Subject: [PATCH v12 4/4] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  6 +---
 src/backend/postmaster/checkpointer.c | 26 ---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 75753c3339..5852c45246 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3416,24 +3416,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3444,35 +3426,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30280d520b..c45c261f4b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1058,18 +1058,14 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
+-- TODO: make separate pg_stat_checkpointer view
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
         pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 931bdcaa59..712c878f1c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1102,8 +1087,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_inc_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3673b34f50..f85e508689 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5611,9 +5611,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5629,9 +5627,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 58bf60425b..f69dbe38b8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2166,7 +2166,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2275,9 +2274,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2474,8 +2470,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 674a2167ec..cf2998514e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1723,18 +1723,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1763,24 +1751,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 54661e2b5f..02f624c18c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,16 +5600,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5629,18 +5619,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
   proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2e3dfcc01d..02a9fff64e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -505,9 +505,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -520,9 +518,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -898,9 +893,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -914,9 +907,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9172b0fcd2..ac2f7cf61e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.buffer_type,
-- 
2.27.0

v12-0002-Add-utility-to-make-tuplestores-for-pg-stat-view.patchtext/x-patch; charset=US-ASCII; name=v12-0002-Add-utility-to-make-tuplestores-for-pg-stat-view.patchDownload

From a709ddb30b2b747beb214f0b13cd1e1816094e6b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 30 Sep 2021 16:16:22 -0400
Subject: [PATCH v12 2/4] Add utility to make tuplestores for pg stat views

Most of the steps to make a tuplestore for those pg_stat views requiring
one are the same. Consolidate them into a single helper function for
clarity and to avoid bugs.
---
 src/backend/utils/adt/pgstatfuncs.c | 129 ++++++++++------------------
 1 file changed, 44 insertions(+), 85 deletions(-)

diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..513f5aecf6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -36,6 +36,42 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)	 (is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))
 
+/*
+ * Helper function for views with multiple rows constructed from a tuplestore
+ */
+static Tuplestorestate *
+pg_stat_make_tuplestore(FunctionCallInfo fcinfo, TupleDesc *tupdesc)
+{
+	Tuplestorestate *tupstore;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = *tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+	return tupstore;
+}
+
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -457,29 +493,13 @@ Datum
 pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 {
 #define PG_STAT_GET_PROGRESS_COLS	PGSTAT_NUM_PROGRESS_PARAM + 3
-	int			num_backends = pgstat_fetch_stat_numbackends();
 	int			curr_backend;
-	char	   *cmd = text_to_cstring(PG_GETARG_TEXT_PP(0));
 	ProgressCommandType cmdtype;
 	TupleDesc	tupdesc;
-	Tuplestorestate *tupstore;
-	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
-	MemoryContext per_query_ctx;
-	MemoryContext oldcontext;
-
-	/* check to see if caller supports us returning a tuplestore */
-	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("set-valued function called in context that cannot accept a set")));
-	if (!(rsinfo->allowedModes & SFRM_Materialize))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("materialize mode required, but it is not allowed in this context")));
 
-	/* Build a tuple descriptor for our result type */
-	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-		elog(ERROR, "return type must be a row type");
+	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
+	int			num_backends = pgstat_fetch_stat_numbackends();
+	char	   *cmd = text_to_cstring(PG_GETARG_TEXT_PP(0));
 
 	/* Translate command name into command type code. */
 	if (pg_strcasecmp(cmd, "VACUUM") == 0)
@@ -499,15 +519,6 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("invalid command name: \"%s\"", cmd)));
 
-	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
-	oldcontext = MemoryContextSwitchTo(per_query_ctx);
-
-	tupstore = tuplestore_begin_heap(true, false, work_mem);
-	rsinfo->returnMode = SFRM_Materialize;
-	rsinfo->setResult = tupstore;
-	rsinfo->setDesc = tupdesc;
-	MemoryContextSwitchTo(oldcontext);
-
 	/* 1-based index */
 	for (curr_backend = 1; curr_backend <= num_backends; curr_backend++)
 	{
@@ -568,38 +579,12 @@ Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
 #define PG_STAT_GET_ACTIVITY_COLS	30
-	int			num_backends = pgstat_fetch_stat_numbackends();
-	int			curr_backend;
-	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
-	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
-	Tuplestorestate *tupstore;
-	MemoryContext per_query_ctx;
-	MemoryContext oldcontext;
-
-	/* check to see if caller supports us returning a tuplestore */
-	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("set-valued function called in context that cannot accept a set")));
-	if (!(rsinfo->allowedModes & SFRM_Materialize))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("materialize mode required, but it is not allowed in this context")));
-
-	/* Build a tuple descriptor for our result type */
-	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-		elog(ERROR, "return type must be a row type");
-
-	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
-	oldcontext = MemoryContextSwitchTo(per_query_ctx);
-
-	tupstore = tuplestore_begin_heap(true, false, work_mem);
-	rsinfo->returnMode = SFRM_Materialize;
-	rsinfo->setResult = tupstore;
-	rsinfo->setDesc = tupdesc;
+	int			curr_backend;
 
-	MemoryContextSwitchTo(oldcontext);
+	int			num_backends = pgstat_fetch_stat_numbackends();
+	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
+	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
 
 	/* 1-based index */
 	for (curr_backend = 1; curr_backend <= num_backends; curr_backend++)
@@ -1871,37 +1856,11 @@ Datum
 pg_stat_get_slru(PG_FUNCTION_ARGS)
 {
 #define PG_STAT_GET_SLRU_COLS	9
-	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
-	Tuplestorestate *tupstore;
-	MemoryContext per_query_ctx;
-	MemoryContext oldcontext;
 	int			i;
 	PgStat_SLRUStats *stats;
 
-	/* check to see if caller supports us returning a tuplestore */
-	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("set-valued function called in context that cannot accept a set")));
-	if (!(rsinfo->allowedModes & SFRM_Materialize))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("materialize mode required, but it is not allowed in this context")));
-
-	/* Build a tuple descriptor for our result type */
-	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-		elog(ERROR, "return type must be a row type");
-
-	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
-	oldcontext = MemoryContextSwitchTo(per_query_ctx);
-
-	tupstore = tuplestore_begin_heap(true, false, work_mem);
-	rsinfo->returnMode = SFRM_Materialize;
-	rsinfo->setResult = tupstore;
-	rsinfo->setDesc = tupdesc;
-
-	MemoryContextSwitchTo(oldcontext);
+	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
 
 	/* request SLRU stats from the stat collector */
 	stats = pgstat_fetch_slru();
-- 
2.27.0

v12-0001-Allow-bootstrap-process-to-beinit.patchtext/x-patch; charset=US-ASCII; name=v12-0001-Allow-bootstrap-process-to-beinit.patchDownload

From 40c809ad1127322f3462e85be080c10534485f0d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Sep 2021 17:39:12 -0400
Subject: [PATCH v12 1/4] Allow bootstrap process to beinit

---
 src/backend/utils/init/postinit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	if (!bootstrap)
-		pgstat_beinit();
+	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
-- 
2.27.0

v12-0003-Add-system-view-tracking-accesses-to-buffers.patchtext/x-patch; charset=US-ASCII; name=v12-0003-Add-system-view-tracking-accesses-to-buffers.patchDownload

From 397777ca5d1512a233d3f0ba8954b0a32421ad4f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:39:45 -0400
Subject: [PATCH v12 3/4] Add system view tracking accesses to buffers

Add pg_stat_buffers, a system view which tracks the number of buffers of
a particular type (e.g. shared, local) allocated, written, fsync'd, and
extended by each backend type.

Some of these should always be zero. For example, a checkpointer backend
will not use a BufferAccessStrategy (currently), so buffer type
"strategy" for checkpointer will be 0 for all buffer access types
(alloc, write, fsync, and extend).

All backends increment a counter in their PgBackendStatus when
performing a buffer access. On exit, backends send these stats to the
stats collector to be persisted.

When stats are reset, the backend sending the reset message will loop
through and collect all of the live backends' buffer access counters,
sending a reset message for each backend type containing its buffer
access stats. When receiving this message, the stats collector will 1)
save these reset values in an array of "resets" and 2) zero out the
exited backends' saved buffer access counters. This is required for
accurate stats after a reset without writing to other backends'
PgBackendStatus.

When the pg_stat_buffers view is queried, sum live backends' stats with
saved stats from exited backends and subtract saved reset stats,
returning the total.

Each row of the view is for a particular backend type and a particular
buffer type (e.g. shared buffer accesses by checkpointer) and each
column in the view is the total number of buffers of each kind of buffer
access (e.g. written). So a cell in the view would be, for example, the
number of shared buffers written by checkpointer since the last stats
reset.

Note that this commit does not add code to increment buffer accesses for
all types of buffers. It includes all possible combinations in the stats
view but doesn't populate all of them.

A separate proposed patch [1] which would add wrappers for smgrwrite()
and extend() would provide a good location to call
pgstat_increment_buffer_access_type() for unbuffered IO and avoid
regressions for future users of these functions.

TODO:
- Remove pg_stats test I added
- When finished, catalog bump

[1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com

Discussion: https://www.postgresql.org/message-id/flat/20210415235954.qcypb4urtovzkat5%40alap3.anarazel.de#724d5cce4bcb587f9167b80a5824bc5c
---
 doc/src/sgml/monitoring.sgml                | 116 ++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/checkpointer.c       |   1 +
 src/backend/postmaster/pgstat.c             | 153 +++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         |  26 +++-
 src/backend/storage/buffer/freelist.c       |  23 ++-
 src/backend/utils/activity/backend_status.c |  64 +++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 116 +++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/miscadmin.h                     |   2 +
 src/include/pgstat.h                        |  54 +++++++
 src/include/storage/buf_internals.h         |   4 +-
 src/include/utils/backend_status.h          |  75 ++++++++++
 src/test/regress/expected/rules.out         |   8 +
 src/test/regress/sql/stats.sql              |   4 +
 15 files changed, 647 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2cd8920645..75753c3339 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each buffer type for each backend type showing
+      statistics about backend buffer activity. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3478,6 +3487,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each buffer
+   type for each backend type, containing global data for the cluster for that
+   backend and buffer type.
+  </para>
+
+  <table id="pg-stat-buffer-actions-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>buffer_type</structfield> <type>text</type>
+      </para>
+      <para>
+      Type of buffer accessed (e.g. shared).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5074,12 +5178,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
+        argument.  The argument can be <literal>bgwriter</literal> to reset all
+        the counters shown in the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view, or
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..30280d520b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1072,6 +1072,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.buffer_type,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers_accesses() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..931bdcaa59 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
+		pgstat_inc_buffer_access_type(BA_Fsync, Buf_Shared);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..3673b34f50 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -124,9 +124,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various buffer actions are tracked in PgBackendStatus while a
+ * backend is alive and then sent to stats collector before a backend exits in
+ * a PgStat_MsgBufferTypeAccesses.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -362,6 +365,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -974,6 +978,7 @@ pgstat_report_stat(bool disconnect)
 	/* Now, send function statistics */
 	pgstat_send_funcstats();
 
+
 	/* Send WAL statistics */
 	pgstat_send_wal(true);
 
@@ -1452,6 +1457,8 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "buffers") == 0)
+		msg.m_resettarget = RESET_BUFFERS;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -1461,7 +1468,25 @@ pgstat_reset_shared_counters(const char *target)
 				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-	pgstat_send(&msg, sizeof(msg));
+
+	if (msg.m_resettarget == RESET_BUFFERS)
+	{
+		int			backend_type;
+		PgStatBufferTypeAccesses accesses[BACKEND_NUM_TYPES];
+
+		memset(accesses, 0, sizeof(accesses));
+		pgstat_report_live_backend_accesses(accesses);
+
+		for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+		{
+			msg.m_backend_resets.backend_type = backend_type;
+			memcpy(&msg.m_backend_resets.bta, &accesses[backend_type], sizeof(msg.m_backend_resets.bta));
+			pgstat_send(&msg, sizeof(msg));
+		}
+	}
+	else
+		pgstat_send(&msg, sizeof(msg));
+
 }
 
 /* ----------
@@ -2760,6 +2785,20 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the BackendAccesses structure tracking buffer access statistics for both
+ *	exited backends and reset arithmetic.
+ */
+PgStat_BackendAccesses *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
@@ -2999,6 +3038,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on buffer accesses when a process exits, as
+	 * pg_stat_get_buffers_accesses() will read from live backends'
+	 * PgBackendStatus and then sum this with totals from exited backends
+	 * persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3092,6 +3139,31 @@ pgstat_send(void *msg, int len)
 #endif
 }
 
+/*
+ * Add live buffer access stats for all buffer types (e.g. shared, local) to
+ * those in the equivalent stats structure for exited backends. Note that this
+ * adds and doesn't set, so the destination buffer access stats should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all buffer access stats for all buffer types for a particular backend type
+ * to the pgstats structure.
+ */
+void
+pgstat_add_buffer_type_accesses(PgStatBufferAccesses *dest, PgBufferAccesses *src, int buffer_num_types)
+{
+	int			buffer_type;
+
+	for (buffer_type = 0; buffer_type < buffer_num_types; buffer_type++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+
+}
+
 /* ----------
  * pgstat_send_archiver() -
  *
@@ -3148,6 +3220,32 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its buffer access statistics to the
+ * collector so that they may be persisted
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgBufferTypeAccesses msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_add_buffer_type_accesses(msg.bta.bt_accesses,
+									(PgBufferAccesses *) &beentry->buffer_access_stats,
+									BUFFER_NUM_TYPES);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_BUFFER_ACTIONS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3522,6 +3620,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_BUFFER_ACTIONS:
+					pgstat_recv_buffer_type_accesses(&msg.msg_buffer_accesses, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5221,10 +5323,30 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global, bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * Though globalStats.buffers only needs to be reset once, doing so
+		 * for every message is less brittle and the extra cost is
+		 * irrelevant-- given how often stats are reset.
+		 */
+		memset(&globalStats.buffers.accesses, 0, sizeof(globalStats.buffers.accesses));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+
+		memcpy(&globalStats.buffers.resets[backend_type],
+			   &msg->m_backend_resets.bta.bt_accesses, sizeof(msg->m_backend_resets.bta.bt_accesses));
+
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
@@ -5512,6 +5634,25 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_buffer_type_accesses(PgStat_MsgBufferTypeAccesses *msg, int len)
+{
+	int			buffer_type;
+	PgStatBufferAccesses *src_buffer_accesses = msg->bta.bt_accesses;
+	PgStatBufferAccesses *dest_buffer_accesses = globalStats.buffers.accesses[msg->backend_type].bt_accesses;
+
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgStatBufferAccesses *src = &src_buffer_accesses[buffer_type];
+		PgStatBufferAccesses *dest = &dest_buffer_accesses[buffer_type];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..58bf60425b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_buffer_access_type(BA_Extend, Buf_Shared);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				BufferType	buftype;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,6 +1250,21 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular
+				 * write. When a strategy is not in use, at this point, the
+				 * write can only be a "regular" write of a dirty buffer.
+				 */
+
+				buftype = from_ring ? Buf_Strategy : Buf_Shared;
+				pgstat_inc_buffer_access_type(BA_Write, buftype);
+
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2552,6 +2572,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_inc_buffer_access_type(BA_Write, Buf_Shared);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..574965212b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_buffer_access_type(BA_Alloc, Buf_Shared);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer(),
+	 * then ensure that we count it as such in pg_stat_buffers view.
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping for pg_stat_buffers.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..d02326423a 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -236,6 +236,24 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetBufferTypeDesc(BufferType bufferType)
+{
+
+	switch (bufferType)
+	{
+		case Buf_Direct:
+			return "direct";
+		case Buf_Local:
+			return "local";
+		case Buf_Shared:
+			return "shared";
+		case Buf_Strategy:
+			return "strategy";
+	}
+	return "unknown buffer type";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
@@ -279,7 +297,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
@@ -293,6 +311,7 @@ pgstat_bestart(void)
 {
 	volatile PgBackendStatus *vbeentry = MyBEEntry;
 	PgBackendStatus lbeentry;
+	int			buffer_type;
 #ifdef USE_SSL
 	PgBackendSSLStatus lsslstatus;
 #endif
@@ -399,6 +418,15 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+	{
+		PgBufferAccesses *accesses = &lbeentry.buffer_access_stats[buffer_type];
+
+		pg_atomic_init_u64(&accesses->allocs, 0);
+		pg_atomic_init_u64(&accesses->extends, 0);
+		pg_atomic_init_u64(&accesses->fsyncs, 0);
+		pg_atomic_init_u64(&accesses->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -621,6 +649,34 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' buffer access
+ * stats, adding them to that backend type's member of the backend_accesses
+ * structure.
+ */
+void
+pgstat_report_live_backend_accesses(PgStatBufferTypeAccesses *backend_accesses)
+{
+	int			i;
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		pgstat_add_buffer_type_accesses(backend_accesses[beentry->st_backendType].bt_accesses,
+										(PgBufferAccesses *) beentry->buffer_access_stats,
+										BUFFER_NUM_TYPES);
+
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
@@ -1046,6 +1102,12 @@ pgstat_get_my_query_id(void)
 }
 
 
+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+	return BackendStatusArray;
+}
+
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 513f5aecf6..674a2167ec 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1781,6 +1781,122 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS)
+{
+#define NROWS ((BACKEND_NUM_TYPES - 1) * BUFFER_NUM_TYPES)
+	PgStat_BackendAccesses *backend_accesses;
+	int i;
+	int			buffer_type, backend_type;
+	Datum		reset_time;
+	PgBackendStatus *beentry;
+	TupleDesc	tupdesc;
+
+	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
+
+	/*
+	 * When adding a new column to the pg_stat_buffers view, add a new enum
+	 * value here above COLUMN_LENGTH.
+	 */
+	enum
+	{
+		COLUMN_BACKEND_TYPE,
+		COLUMN_BUFFER_TYPE,
+		COLUMN_ALLOCS,
+		COLUMN_EXTENDS,
+		COLUMN_FSYNCS,
+		COLUMN_WRITES,
+		COLUMN_RESET_TIME,
+		COLUMN_LENGTH,
+	};
+
+	Datum		all_values[NROWS][COLUMN_LENGTH];
+	bool		all_nulls[NROWS][COLUMN_LENGTH];
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their buffer accesses for each
+	 * buffer type
+	 */
+	beentry = pgstat_fetch_backend_statuses();
+
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		PgBufferAccesses *buffer_accesses;
+
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		buffer_accesses = beentry->buffer_access_stats;
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			int			rownum = (beentry->st_backendType - 1) * BUFFER_NUM_TYPES + buffer_type;
+			Datum	   *values = all_values[rownum];
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_BUFFER_TYPE
+			 * will all be set when looping through exited backends array
+			 */
+			values[COLUMN_ALLOCS] += pg_atomic_read_u64(&buffer_accesses->allocs);
+			values[COLUMN_EXTENDS] += pg_atomic_read_u64(&buffer_accesses->extends);
+			values[COLUMN_FSYNCS] += pg_atomic_read_u64(&buffer_accesses->fsyncs);
+			values[COLUMN_WRITES] += pg_atomic_read_u64(&buffer_accesses->writes);
+			buffer_accesses++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_accesses = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_accesses->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatBufferAccesses *buffer_accesses = backend_accesses->accesses[backend_type].bt_accesses;
+		PgStatBufferAccesses *resets = backend_accesses->resets[backend_type].bt_accesses;
+
+		Datum		backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (buffer_type = 0; buffer_type < BUFFER_NUM_TYPES; buffer_type++)
+		{
+			/*
+			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+			 * BackendType
+			 */
+			Datum	   *values = all_values[(backend_type - 1) * BUFFER_NUM_TYPES + buffer_type];
+
+			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[COLUMN_BUFFER_TYPE] = CStringGetTextDatum(GetBufferTypeDesc(buffer_type));
+			values[COLUMN_ALLOCS] = values[COLUMN_ALLOCS] + buffer_accesses->allocs - resets->allocs;
+			values[COLUMN_EXTENDS] = values[COLUMN_EXTENDS] + buffer_accesses->extends - resets->extends;
+			values[COLUMN_FSYNCS] = values[COLUMN_FSYNCS] + buffer_accesses->fsyncs - resets->fsyncs;
+			values[COLUMN_WRITES] = values[COLUMN_WRITES] + buffer_accesses->writes - resets->writes;
+			values[COLUMN_RESET_TIME] = reset_time;
+			buffer_accesses++;
+			resets++;
+		}
+	}
+
+	for (i = 0; i < NROWS; i++)
+	{
+		Datum	   *values = all_values[i];
+		bool	   *nulls = all_nulls[i];
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..54661e2b5f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of various types of accesses of buffers done by each backend type',
+  proname => 'pg_stat_get_buffers_accesses', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,buffer_type,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers_accesses' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..6785fb3813 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..2e3dfcc01d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_BUFFER_ACTIONS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -138,6 +139,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -331,6 +333,51 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of buffer accesses in the stats collector
+ */
+typedef struct PgStatBufferAccesses
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatBufferAccesses;
+
+/*
+ * Structure for counting all buffer accesses of all types of buffers.
+ */
+typedef struct PgStatBufferTypeAccesses
+{
+	PgStatBufferAccesses bt_accesses[BUFFER_NUM_TYPES];
+} PgStatBufferTypeAccesses;
+
+/*
+ * Sent by a backend to the stats collector to report all buffer accesses of
+ * all types of buffers for a given type of a backend. This will happen when
+ * the backend exits or when stats are reset.
+ */
+typedef struct PgStat_MsgBufferTypeAccesses
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatBufferTypeAccesses bta;
+} PgStat_MsgBufferTypeAccesses;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' buffer accesses for all types of buffers as well as all stats from
+ * live backends at the time of stats reset. resets is populated using a reset
+ * message sent to the stats collector.
+ */
+typedef struct PgStat_BackendAccesses
+{
+	TimestampTz stat_reset_timestamp;
+	PgStatBufferTypeAccesses accesses[BACKEND_NUM_TYPES];
+	PgStatBufferTypeAccesses resets[BACKEND_NUM_TYPES];
+} PgStat_BackendAccesses;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -351,6 +398,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgBufferTypeAccesses m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
@@ -703,6 +751,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgBufferTypeAccesses msg_buffer_accesses;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -879,6 +928,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendAccesses buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1116,8 +1166,11 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
+extern void pgstat_add_buffer_type_accesses(PgStatBufferAccesses *dest,
+											PgBufferAccesses *src, int buffer_num_types);
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
 
@@ -1126,6 +1179,7 @@ extern void pgstat_send_wal(bool force);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendAccesses *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..eb134d82f1 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,48 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum BufferAccessType
+{
+	BA_Alloc,
+	BA_Extend,
+	BA_Fsync,
+	BA_Write,
+} BufferAccessType;
+
+#define BUFFER_ACCESS_NUM_TYPES (BA_Write + 1)
+
+typedef enum BufferType
+{
+	Buf_Direct,
+	Buf_Local,
+	Buf_Shared,
+	Buf_Strategy,
+} BufferType;
+
+#define BUFFER_NUM_TYPES (Buf_Strategy + 1)
+
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of buffer accesses for a live backend.
+ */
+typedef struct PgBufferAccesses
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} PgBufferAccesses;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +205,7 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+	PgBufferAccesses buffer_access_stats[BUFFER_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -289,6 +327,10 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+extern const char *GetBufferTypeDesc(BufferType bufferType);
+
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
@@ -296,7 +338,39 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatBufferTypeAccesses PgStatBufferTypeAccesses;
+
+static inline void
+pgstat_inc_buffer_access_type(BufferAccessType ba_type, BufferType buf_type)
+{
+	PgBufferAccesses *accesses;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	accesses = &beentry->buffer_access_stats[buf_type];
+	switch (ba_type)
+	{
+		case BA_Alloc:
+			pg_atomic_write_u64(&accesses->allocs,
+								pg_atomic_read_u64(&accesses->allocs) + 1);
+			break;
+		case BA_Extend:
+			pg_atomic_write_u64(&accesses->extends,
+								pg_atomic_read_u64(&accesses->extends) + 1);
+			break;
+		case BA_Fsync:
+			pg_atomic_write_u64(&accesses->fsyncs,
+								pg_atomic_read_u64(&accesses->fsyncs) + 1);
+			break;
+		case BA_Write:
+			pg_atomic_write_u64(&accesses->writes,
+								pg_atomic_read_u64(&accesses->writes) + 1);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_accesses(PgStatBufferTypeAccesses *backend_accesses);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
@@ -312,6 +386,7 @@ extern uint64 pgstat_get_my_query_id(void);
  * generate the pgstat* views.
  * ----------
  */
+extern PgBackendStatus *pgstat_fetch_backend_statuses(void);
 extern int	pgstat_fetch_stat_numbackends(void);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..9172b0fcd2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.buffer_type,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers_accesses() b(backend_type, buffer_type, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index feaaee6326..4ad672b35a 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -176,4 +176,8 @@ FROM prevstats AS pr;
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT * FROM pg_stat_buffers;
+SELECT pg_stat_reset_shared('buffers');
+SELECT pg_sleep(2);
+SELECT * FROM pg_stat_buffers;
 -- End of Stats Test
-- 
2.27.0

#26

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Melanie Plageman (#25)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Can you say more about 0001?

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"

#27

melanieplageman@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#26)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v13 (attached) contains several cosmetic updates and the full rename
(comments included) of BufferAccessType and BufferType.

On Thu, Sep 30, 2021 at 7:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Can you say more about 0001?

The rationale for this patch was that it doesn't save much to avoid
initializing backend activity state in the bootstrap process and by
doing so, I don't have to do the check if (beentry) in pgstat_inc_ioop()
--which happens on most buffer accesses.

Attachments:

v13-0004-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v13-0004-Remove-superfluous-bgwriter-stats.patchDownload

From ee11056ad25a095593ba2acc2dc8ff31f4ceb9ab Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:44:51 -0400
Subject: [PATCH v13 4/4] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  6 +---
 src/backend/postmaster/checkpointer.c | 26 ---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6debc53ecc..f59ca8b993 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3416,24 +3416,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3444,35 +3426,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8e92b23edc..a27fe0c80c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1058,18 +1058,14 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
+-- TODO: make separate pg_stat_checkpointer view
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
         pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0d18e7f71a..8f2ef63ee5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1102,8 +1087,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 27f4b6ce2f..fbec722a1f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5608,9 +5608,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5626,9 +5624,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 504cf37ff9..b911dd9ce5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2165,7 +2165,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2274,9 +2273,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2473,8 +2469,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5f4b15c9e1..557b2673c0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1723,18 +1723,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1763,24 +1751,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffers(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bbdb07b222..a3cdbe1dbc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,16 +5600,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5629,18 +5619,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 94eee19b8e..8ff87a3f54 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -505,9 +505,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -520,9 +518,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -898,9 +893,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -914,9 +907,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5e5a0324ee..090a65cdb0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
-- 
2.27.0

v13-0001-Allow-bootstrap-process-to-beinit.patchtext/x-patch; charset=US-ASCII; name=v13-0001-Allow-bootstrap-process-to-beinit.patchDownload

From 40c809ad1127322f3462e85be080c10534485f0d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Sep 2021 17:39:12 -0400
Subject: [PATCH v13 1/4] Allow bootstrap process to beinit

---
 src/backend/utils/init/postinit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	if (!bootstrap)
-		pgstat_beinit();
+	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
-- 
2.27.0

v13-0002-Add-utility-to-make-tuplestores-for-pg-stat-view.patchtext/x-patch; charset=US-ASCII; name=v13-0002-Add-utility-to-make-tuplestores-for-pg-stat-view.patchDownload

From a709ddb30b2b747beb214f0b13cd1e1816094e6b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 30 Sep 2021 16:16:22 -0400
Subject: [PATCH v13 2/4] Add utility to make tuplestores for pg stat views

Most of the steps to make a tuplestore for those pg_stat views requiring
one are the same. Consolidate them into a single helper function for
clarity and to avoid bugs.
---
 src/backend/utils/adt/pgstatfuncs.c | 129 ++++++++++------------------
 1 file changed, 44 insertions(+), 85 deletions(-)

diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..513f5aecf6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -36,6 +36,42 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)	 (is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))
 
+/*
+ * Helper function for views with multiple rows constructed from a tuplestore
+ */
+static Tuplestorestate *
+pg_stat_make_tuplestore(FunctionCallInfo fcinfo, TupleDesc *tupdesc)
+{
+	Tuplestorestate *tupstore;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = *tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+	return tupstore;
+}
+
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -457,29 +493,13 @@ Datum
 pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 {
 #define PG_STAT_GET_PROGRESS_COLS	PGSTAT_NUM_PROGRESS_PARAM + 3
-	int			num_backends = pgstat_fetch_stat_numbackends();
 	int			curr_backend;
-	char	   *cmd = text_to_cstring(PG_GETARG_TEXT_PP(0));
 	ProgressCommandType cmdtype;
 	TupleDesc	tupdesc;
-	Tuplestorestate *tupstore;
-	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
-	MemoryContext per_query_ctx;
-	MemoryContext oldcontext;
-
-	/* check to see if caller supports us returning a tuplestore */
-	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("set-valued function called in context that cannot accept a set")));
-	if (!(rsinfo->allowedModes & SFRM_Materialize))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("materialize mode required, but it is not allowed in this context")));
 
-	/* Build a tuple descriptor for our result type */
-	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-		elog(ERROR, "return type must be a row type");
+	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
+	int			num_backends = pgstat_fetch_stat_numbackends();
+	char	   *cmd = text_to_cstring(PG_GETARG_TEXT_PP(0));
 
 	/* Translate command name into command type code. */
 	if (pg_strcasecmp(cmd, "VACUUM") == 0)
@@ -499,15 +519,6 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("invalid command name: \"%s\"", cmd)));
 
-	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
-	oldcontext = MemoryContextSwitchTo(per_query_ctx);
-
-	tupstore = tuplestore_begin_heap(true, false, work_mem);
-	rsinfo->returnMode = SFRM_Materialize;
-	rsinfo->setResult = tupstore;
-	rsinfo->setDesc = tupdesc;
-	MemoryContextSwitchTo(oldcontext);
-
 	/* 1-based index */
 	for (curr_backend = 1; curr_backend <= num_backends; curr_backend++)
 	{
@@ -568,38 +579,12 @@ Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
 #define PG_STAT_GET_ACTIVITY_COLS	30
-	int			num_backends = pgstat_fetch_stat_numbackends();
-	int			curr_backend;
-	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
-	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
-	Tuplestorestate *tupstore;
-	MemoryContext per_query_ctx;
-	MemoryContext oldcontext;
-
-	/* check to see if caller supports us returning a tuplestore */
-	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("set-valued function called in context that cannot accept a set")));
-	if (!(rsinfo->allowedModes & SFRM_Materialize))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("materialize mode required, but it is not allowed in this context")));
-
-	/* Build a tuple descriptor for our result type */
-	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-		elog(ERROR, "return type must be a row type");
-
-	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
-	oldcontext = MemoryContextSwitchTo(per_query_ctx);
-
-	tupstore = tuplestore_begin_heap(true, false, work_mem);
-	rsinfo->returnMode = SFRM_Materialize;
-	rsinfo->setResult = tupstore;
-	rsinfo->setDesc = tupdesc;
+	int			curr_backend;
 
-	MemoryContextSwitchTo(oldcontext);
+	int			num_backends = pgstat_fetch_stat_numbackends();
+	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
+	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
 
 	/* 1-based index */
 	for (curr_backend = 1; curr_backend <= num_backends; curr_backend++)
@@ -1871,37 +1856,11 @@ Datum
 pg_stat_get_slru(PG_FUNCTION_ARGS)
 {
 #define PG_STAT_GET_SLRU_COLS	9
-	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
-	Tuplestorestate *tupstore;
-	MemoryContext per_query_ctx;
-	MemoryContext oldcontext;
 	int			i;
 	PgStat_SLRUStats *stats;
 
-	/* check to see if caller supports us returning a tuplestore */
-	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("set-valued function called in context that cannot accept a set")));
-	if (!(rsinfo->allowedModes & SFRM_Materialize))
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("materialize mode required, but it is not allowed in this context")));
-
-	/* Build a tuple descriptor for our result type */
-	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
-		elog(ERROR, "return type must be a row type");
-
-	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
-	oldcontext = MemoryContextSwitchTo(per_query_ctx);
-
-	tupstore = tuplestore_begin_heap(true, false, work_mem);
-	rsinfo->returnMode = SFRM_Materialize;
-	rsinfo->setResult = tupstore;
-	rsinfo->setDesc = tupdesc;
-
-	MemoryContextSwitchTo(oldcontext);
+	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
 
 	/* request SLRU stats from the stat collector */
 	stats = pgstat_fetch_slru();
-- 
2.27.0

v13-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v13-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From e9a5d2a021d429fdbb2daa58ab9d75a069f334d4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:39:45 -0400
Subject: [PATCH v13 3/4] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend).

All backends increment a counter in their PgBackendStatus when
performing an IO operation. On exit, backends send these stats to the
stats collector to be persisted.

When stats are reset, the backend sending the reset message will loop
through and collect all of the live backends' IO op stats, sending a
reset message for each backend type containing these stats. When
receiving this message, the stats collector will 1) save these reset
values in an array of "resets" and 2) zero out the exited backends'
saved IO op counters. This is required for accurate stats after a reset
without writing to other backends' PgBackendStatuses.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that this commit does not add code to increment IO ops for all IO
paths. It includes all possible combinations in the stats view but
doesn't populate all of them.

A separate proposed patch [1] which would add wrappers for smgrwrite()
and extend() would provide a good location to call pgstat_inc_ioop() for
unbuffered IO and avoid regressions for future users of these functions.

TODO:
- catalog bump

[1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com

Discussion: https://www.postgresql.org/message-id/flat/20210415235954.qcypb4urtovzkat5%40alap3.anarazel.de#724d5cce4bcb587f9167b80a5824bc5c
---
 doc/src/sgml/monitoring.sgml                | 116 ++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/checkpointer.c       |   1 +
 src/backend/postmaster/pgstat.c             | 151 +++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         |  25 +++-
 src/backend/storage/buffer/freelist.c       |  23 ++-
 src/backend/utils/activity/backend_status.c |  64 ++++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 120 ++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/miscadmin.h                     |   2 +
 src/include/pgstat.h                        |  54 +++++++
 src/include/storage/buf_internals.h         |   4 +-
 src/include/utils/backend_status.h          |  84 +++++++++++
 src/test/regress/expected/rules.out         |   8 ++
 14 files changed, 653 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2cd8920645..6debc53ecc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3478,6 +3487,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path, containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5074,12 +5178,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
+        argument.  The argument can be <literal>bgwriter</literal> to reset all
+        the counters shown in the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view, or
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f6e3711d..8e92b23edc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1072,6 +1072,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..0d18e7f71a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..27f4b6ce2f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -124,9 +124,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various IO ops are tracked in PgBackendStatus while a backend
+ * is alive and then sent to stats collector before a backend exits in a
+ * PgStat_MsgIOPathOps.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -362,6 +365,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -1452,6 +1456,8 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "buffers") == 0)
+		msg.m_resettarget = RESET_BUFFERS;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -1461,7 +1467,25 @@ pgstat_reset_shared_counters(const char *target)
 				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-	pgstat_send(&msg, sizeof(msg));
+
+	if (msg.m_resettarget == RESET_BUFFERS)
+	{
+		int			backend_type;
+		PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+		memset(ops, 0, sizeof(ops));
+		pgstat_report_live_backend_io_path_ops(ops);
+
+		for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+		{
+			msg.m_backend_resets.backend_type = backend_type;
+			memcpy(&msg.m_backend_resets.iop, &ops[backend_type], sizeof(msg.m_backend_resets.iop));
+			pgstat_send(&msg, sizeof(msg));
+		}
+	}
+	else
+		pgstat_send(&msg, sizeof(msg));
+
 }
 
 /* ----------
@@ -2760,6 +2784,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO op statistics for both
+ *	exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
@@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IO Ops for IO Paths when a process exits, as
+	 * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+	 * then sum this with totals from exited backends persisted by the stats
+	 * collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
 #endif
 }
 
+/*
+ * Add live IO Op stats for all IO Paths (e.g. shared, local) to those in the
+ * equivalent stats structure for exited backends. Note that this adds and
+ * doesn't set, so the destination stats structure should be zeroed out by the
+ * caller initially. This would commonly be used to transfer all IO Op stats
+ * for all IO Paths for a particular backend type to the pgstats structure.
+ */
+void
+pgstat_add_io_path_ops(PgStatIOOps *dest, IOOps *src, int io_path_num_types)
+{
+	int			io_path;
+
+	for (io_path = 0; io_path < io_path_num_types; io_path++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+
+}
+
 /* ----------
  * pgstat_send_archiver() -
  *
@@ -3148,6 +3217,32 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_add_io_path_ops(msg.iop.io_path_ops,
+						   (IOOps *) &beentry->io_path_stats,
+						   IOPATH_NUM_TYPES);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3522,6 +3617,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5221,10 +5320,30 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global, bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * Though globalStats.buffers only needs to be reset once, doing so
+		 * for every message is less brittle and the extra cost is irrelevant
+		 * given how often stats are reset.
+		 */
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+
+		memcpy(&globalStats.buffers.resets[backend_type],
+			   &msg->m_backend_resets.iop.io_path_ops, sizeof(msg->m_backend_resets.iop.io_path_ops));
+
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
@@ -5512,6 +5631,26 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	int			io_path;
+	PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
+	PgStatIOOps *dest_io_path_ops =
+	globalStats.buffers.ops[msg->backend_type].io_path_ops;
+
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		PgStatIOOps *src = &src_io_path_ops[io_path];
+		PgStatIOOps *dest = &dest_io_path_ops[io_path];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..504cf37ff9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,6 +1250,20 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular
+				 * write. When a strategy is not in use, at this point, the
+				 * write can only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+				pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
@@ -2552,6 +2571,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
+	pgstat_inc_ioop(IOOP_WRITE, IOPATH_SHARED);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..e2e1c3bf56 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer(),
+	 * then ensure that we count it as such in pg_stat_buffers view.
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping for pg_stat_buffers.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..f326297517 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -236,6 +236,24 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
@@ -279,7 +297,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
@@ -293,6 +311,7 @@ pgstat_bestart(void)
 {
 	volatile PgBackendStatus *vbeentry = MyBEEntry;
 	PgBackendStatus lbeentry;
+	int			io_path;
 #ifdef USE_SSL
 	PgBackendSSLStatus lsslstatus;
 #endif
@@ -399,6 +418,15 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		IOOps	   *io_ops = &lbeentry.io_path_stats[io_path];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -621,6 +649,34 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IO
+ * Ops for all IO Paths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	int			i;
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		pgstat_add_io_path_ops(backend_io_path_ops[beentry->st_backendType].io_path_ops,
+							   (IOOps *) beentry->io_path_stats,
+							   IOPATH_NUM_TYPES);
+
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
@@ -1046,6 +1102,12 @@ pgstat_get_my_query_id(void)
 }
 
 
+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+	return BackendStatusArray;
+}
+
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 513f5aecf6..5f4b15c9e1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1781,6 +1781,126 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+#define NROWS ((BACKEND_NUM_TYPES - 1) * IOPATH_NUM_TYPES)
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	int			i;
+	int			io_path,
+				backend_type;
+	Datum		reset_time;
+	PgBackendStatus *beentry;
+	TupleDesc	tupdesc;
+
+	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
+
+	/*
+	 * When adding a new column to the pg_stat_buffers view, add a new enum
+	 * value here above COLUMN_LENGTH.
+	 */
+	enum
+	{
+		COLUMN_BACKEND_TYPE,
+		COLUMN_IO_PATH,
+		COLUMN_ALLOCS,
+		COLUMN_EXTENDS,
+		COLUMN_FSYNCS,
+		COLUMN_WRITES,
+		COLUMN_RESET_TIME,
+		COLUMN_LENGTH,
+	};
+
+	Datum		all_values[NROWS][COLUMN_LENGTH];
+	bool		all_nulls[NROWS][COLUMN_LENGTH];
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their IO Ops for each IO Path
+	 */
+	beentry = pgstat_fetch_backend_statuses();
+
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		IOOps	   *io_ops;
+
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			/*
+			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+			 * BackendType
+			 */
+			int			rownum = (beentry->st_backendType - 1) * IOPATH_NUM_TYPES + io_path;
+			Datum	   *values = all_values[rownum];
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_IO_PATH will
+			 * all be set when looping through exited backends array
+			 */
+			values[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			values[COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			values[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			values[COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatIOOps *io_ops = backend_io_path_ops->ops[backend_type].io_path_ops;
+		PgStatIOOps *resets = backend_io_path_ops->resets[backend_type].io_path_ops;
+
+		Datum		backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			/*
+			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+			 * BackendType
+			 */
+			Datum	   *values = all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];
+
+			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
+			values[COLUMN_ALLOCS] = values[COLUMN_ALLOCS] + io_ops->allocs - resets->allocs;
+			values[COLUMN_EXTENDS] = values[COLUMN_EXTENDS] + io_ops->extends - resets->extends;
+			values[COLUMN_FSYNCS] = values[COLUMN_FSYNCS] + io_ops->fsyncs - resets->fsyncs;
+			values[COLUMN_WRITES] = values[COLUMN_WRITES] + io_ops->writes - resets->writes;
+			values[COLUMN_RESET_TIME] = reset_time;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (i = 0; i < NROWS; i++)
+	{
+		Datum	   *values = all_values[i];
+		bool	   *nulls = all_nulls[i];
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..bbdb07b222 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..6785fb3813 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..94eee19b8e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -138,6 +139,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -331,6 +333,51 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IO ops in the stats collector
+ */
+typedef struct PgStatIOOps
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOps;
+
+/*
+ * Structure for counting all IO ops on all types of buffers.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOps io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IO Ops for all IO
+ * Paths for a given type of a backend. This will happen when the backend exits
+ * or when stats are reset.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IO Ops for all IO Paths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -351,6 +398,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
@@ -703,6 +751,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -879,6 +928,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1116,8 +1166,11 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
+extern void pgstat_add_io_path_ops(PgStatIOOps *dest,
+								   IOOps *src, int io_path_num_types);
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
 
@@ -1126,6 +1179,7 @@ extern void pgstat_send_wal(bool force);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..419de72591 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,48 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOps for a live backend.
+ */
+typedef struct IOOps
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOps;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +205,16 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IO Ops for all IO Paths for this backend. When the
+	 * pg_stat_buffers view is queried and when stats are reset, one backend
+	 * will read io_path_stats from all live backends and combine them with
+	 * io_path_stats from exited backends for each backend type. When this
+	 * backend exits, it will send io_path_stats to the stats collector to be
+	 * persisted.
+	 */
+	IOOps		io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -289,6 +336,10 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+extern const char *GetIOPathDesc(IOPath io_path);
+
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
@@ -296,7 +347,39 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOps	   *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_write_u64(&io_ops->allocs,
+								pg_atomic_read_u64(&io_ops->allocs) + 1);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_write_u64(&io_ops->extends,
+								pg_atomic_read_u64(&io_ops->extends) + 1);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_write_u64(&io_ops->fsyncs,
+								pg_atomic_read_u64(&io_ops->fsyncs) + 1);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_write_u64(&io_ops->writes,
+								pg_atomic_read_u64(&io_ops->writes) + 1);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
@@ -312,6 +395,7 @@ extern uint64 pgstat_get_my_query_id(void);
  * generate the pgstat* views.
  * ----------
  */
+extern PgBackendStatus *pgstat_fetch_backend_statuses(void);
 extern int	pgstat_fetch_stat_numbackends(void);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..5e5a0324ee 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.27.0

#28

andres@anarazel.de

over 4 years ago

In reply to: Melanie Plageman (#27)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote:

From 40c809ad1127322f3462e85be080c10534485f0d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Sep 2021 17:39:12 -0400
Subject: [PATCH v13 1/4] Allow bootstrap process to beinit

---
src/backend/utils/init/postinit.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
EnablePortalManager();
/* Initialize status reporting */
-	if (!bootstrap)
-		pgstat_beinit();
+	pgstat_beinit();
/*
* Load relcache entries for the shared system catalogs. This must create
--
2.27.0

I think it's good to remove more and more of these !bootstrap cases - they
really make it harder to understand the state of the system at various
points. Optimizing for the rarely executed bootstrap mode at the cost of
checks in very common codepaths...

From a709ddb30b2b747beb214f0b13cd1e1816094e6b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 30 Sep 2021 16:16:22 -0400
Subject: [PATCH v13 2/4] Add utility to make tuplestores for pg stat views

Most of the steps to make a tuplestore for those pg_stat views requiring
one are the same. Consolidate them into a single helper function for
clarity and to avoid bugs.
---
src/backend/utils/adt/pgstatfuncs.c | 129 ++++++++++------------------
1 file changed, 44 insertions(+), 85 deletions(-)

diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..513f5aecf6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -36,6 +36,42 @@

#define HAS_PGSTAT_PERMISSIONS(role) (is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))

+/*
+ * Helper function for views with multiple rows constructed from a tuplestore
+ */
+static Tuplestorestate *
+pg_stat_make_tuplestore(FunctionCallInfo fcinfo, TupleDesc *tupdesc)
+{
+	Tuplestorestate *tupstore;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = *tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+	return tupstore;
+}

Is pgstattuple the best place for this helper? It's not really pgstatfuncs
specific...

It also looks vaguely familiar - I wonder if we have a helper roughly like
this somewhere else already...

From e9a5d2a021d429fdbb2daa58ab9d75a069f334d4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:39:45 -0400
Subject: [PATCH v13 3/4] Add system view tracking IO ops per backend type

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..0d18e7f71a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
*/
if (!AmBackgroundWriterProcess())
CheckpointerShmem->num_backend_fsync++;
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
LWLockRelease(CheckpointerCommLock);
return false;
}

ISTM this doens't need to happen while holding CheckpointerCommLock?

@@ -1461,7 +1467,25 @@ pgstat_reset_shared_counters(const char *target)
errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));

pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-	pgstat_send(&msg, sizeof(msg));
+
+	if (msg.m_resettarget == RESET_BUFFERS)
+	{
+		int			backend_type;
+		PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+		memset(ops, 0, sizeof(ops));
+		pgstat_report_live_backend_io_path_ops(ops);
+
+		for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+		{
+			msg.m_backend_resets.backend_type = backend_type;
+			memcpy(&msg.m_backend_resets.iop, &ops[backend_type], sizeof(msg.m_backend_resets.iop));
+			pgstat_send(&msg, sizeof(msg));
+		}
+	}
+	else
+		pgstat_send(&msg, sizeof(msg));
+
}

I'd perhaps put this in a small helper function.

/* ----------
* pgstat_fetch_stat_dbentry() -
@@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
{
Assert(!pgstat_is_shutdown);

+	/*
+	 * Only need to send stats on IO Ops for IO Paths when a process exits, as
+	 * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+	 * then sum this with totals from exited backends persisted by the stats
+	 * collector.
+	 */
+	pgstat_send_buffers();
+
/*
* If we got as far as discovering our own database ID, we can report what
* we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
#endif
}

I think it might be nicer to move pgstat_beshutdown_hook() to be a
before_shmem_exit(), and do this in there.

+/*
+ * Add live IO Op stats for all IO Paths (e.g. shared, local) to those in the
+ * equivalent stats structure for exited backends. Note that this adds and
+ * doesn't set, so the destination stats structure should be zeroed out by the
+ * caller initially. This would commonly be used to transfer all IO Op stats
+ * for all IO Paths for a particular backend type to the pgstats structure.
+ */

This seems a bit odd. Why not zero it in here? Perhaps it also should be
called something like _sum_ instead of _add_?

+void
+pgstat_add_io_path_ops(PgStatIOOps *dest, IOOps *src, int io_path_num_types)
+{

Why is io_path_num_types a parameter?

+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	int			io_path;
+	PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
+	PgStatIOOps *dest_io_path_ops =
+	globalStats.buffers.ops[msg->backend_type].io_path_ops;
+
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		PgStatIOOps *src = &src_io_path_ops[io_path];
+		PgStatIOOps *dest = &dest_io_path_ops[io_path];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}

Could this, with a bit of finessing, use pgstat_add_io_path_ops()?

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

What about writes originating in like FlushRelationBuffers()?

bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
{
+	/*
+	 * If we decide to use the dirty buffer selected by StrategyGetBuffer(),
+	 * then ensure that we count it as such in pg_stat_buffers view.
+	 */
+	*from_ring = true;
+

Absolutely minor nitpick: Somehow it feelsoff to talk about the view here.

+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+	return BackendStatusArray;
+}

Hm, not sure this adds much?

+			/*
+			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+			 * BackendType
+			 */
+			int			rownum = (beentry->st_backendType - 1) * IOPATH_NUM_TYPES + io_path;

Perhaps worth wrapping this in a macro or inline function? It's repeated and nontrivial.

+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();

It's probably *not* worth it, but I do wonder if we should do the addition on the SQL
level, and actually have two functions, one returning data for exited
backends, and one for currently connected ones.

+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOps	   *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_write_u64(&io_ops->allocs,
+								pg_atomic_read_u64(&io_ops->allocs) + 1);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_write_u64(&io_ops->extends,
+								pg_atomic_read_u64(&io_ops->extends) + 1);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_write_u64(&io_ops->fsyncs,
+								pg_atomic_read_u64(&io_ops->fsyncs) + 1);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_write_u64(&io_ops->writes,
+								pg_atomic_read_u64(&io_ops->writes) + 1);
+			break;
+	}
+}

IIRC Thomas Munro had a patch adding a nonatomic_add or such
somewhere. Perhaps in the recovery readahead thread? Might be worth using
instead?

Greetings,

Andres Freund

#29

melanieplageman@gmail.com

over 4 years ago

In reply to: Andres Freund (#28)

2 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Oct 8, 2021 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote:
From 40c809ad1127322f3462e85be080c10534485f0d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Sep 2021 17:39:12 -0400
Subject: [PATCH v13 1/4] Allow bootstrap process to beinit

---
src/backend/utils/init/postinit.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
EnablePortalManager();
/* Initialize status reporting */
-     if (!bootstrap)
-             pgstat_beinit();
+     pgstat_beinit();
/*
* Load relcache entries for the shared system catalogs. This must create
--
2.27.0
I think it's good to remove more and more of these !bootstrap cases - they
really make it harder to understand the state of the system at various
points. Optimizing for the rarely executed bootstrap mode at the cost of
checks in very common codepaths...

What scope do you suggest for this patch set? A single patch which does
this in more locations (remove !bootstrap) or should I remove this patch
from the patchset?

From a709ddb30b2b747beb214f0b13cd1e1816094e6b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 30 Sep 2021 16:16:22 -0400
Subject: [PATCH v13 2/4] Add utility to make tuplestores for pg stat views

Most of the steps to make a tuplestore for those pg_stat views requiring
one are the same. Consolidate them into a single helper function for
clarity and to avoid bugs.
---
src/backend/utils/adt/pgstatfuncs.c | 129 ++++++++++------------------
1 file changed, 44 insertions(+), 85 deletions(-)
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..513f5aecf6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -36,6 +36,42 @@
#define HAS_PGSTAT_PERMISSIONS(role) (is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))
+/*
+ * Helper function for views with multiple rows constructed from a tuplestore
+ */
+static Tuplestorestate *
+pg_stat_make_tuplestore(FunctionCallInfo fcinfo, TupleDesc *tupdesc)
+{
+     Tuplestorestate *tupstore;
+     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+     MemoryContext per_query_ctx;
+     MemoryContext oldcontext;
+
+     /* check to see if caller supports us returning a tuplestore */
+     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+             ereport(ERROR,
+                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                              errmsg("set-valued function called in context that cannot accept a set")));
+     if (!(rsinfo->allowedModes & SFRM_Materialize))
+             ereport(ERROR,
+                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                              errmsg("materialize mode required, but it is not allowed in this context")));
+
+     /* Build a tuple descriptor for our result type */
+     if (get_call_result_type(fcinfo, NULL, tupdesc) != TYPEFUNC_COMPOSITE)
+             elog(ERROR, "return type must be a row type");
+
+     per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+     oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+     tupstore = tuplestore_begin_heap(true, false, work_mem);
+     rsinfo->returnMode = SFRM_Materialize;
+     rsinfo->setResult = tupstore;
+     rsinfo->setDesc = *tupdesc;
+     MemoryContextSwitchTo(oldcontext);
+     return tupstore;
+}
Is pgstattuple the best place for this helper? It's not really pgstatfuncs
specific...

It also looks vaguely familiar - I wonder if we have a helper roughly like
this somewhere else already...

I don't see a function which is specifically a utility to make a
tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice
very similar code to the function I added in pg_tablespace_databases()
in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c,
pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in
event_tigger.c, pg_available_extensions in extension.c, etc.

Do you think it makes sense to refactor this code out of all of these
places? If so, where would such a utility function belong?

From e9a5d2a021d429fdbb2daa58ab9d75a069f334d4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:39:45 -0400
Subject: [PATCH v13 3/4] Add system view tracking IO ops per backend type
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..0d18e7f71a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
*/
if (!AmBackgroundWriterProcess())
CheckpointerShmem->num_backend_fsync++;
+             pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
LWLockRelease(CheckpointerCommLock);
return false;
}
ISTM this doens't need to happen while holding CheckpointerCommLock?

Fixed in attached updates. I only attached the diff from my previous version.

@@ -1461,7 +1467,25 @@ pgstat_reset_shared_counters(const char *target)
errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));

pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-     pgstat_send(&msg, sizeof(msg));
+
+     if (msg.m_resettarget == RESET_BUFFERS)
+     {
+             int                     backend_type;
+             PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+             memset(ops, 0, sizeof(ops));
+             pgstat_report_live_backend_io_path_ops(ops);
+
+             for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+             {
+                     msg.m_backend_resets.backend_type = backend_type;
+                     memcpy(&msg.m_backend_resets.iop, &ops[backend_type], sizeof(msg.m_backend_resets.iop));
+                     pgstat_send(&msg, sizeof(msg));
+             }
+     }
+     else
+             pgstat_send(&msg, sizeof(msg));
+
}

I'd perhaps put this in a small helper function.

Done.

/* ----------
* pgstat_fetch_stat_dbentry() -
@@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
{
Assert(!pgstat_is_shutdown);
+     /*
+      * Only need to send stats on IO Ops for IO Paths when a process exits, as
+      * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+      * then sum this with totals from exited backends persisted by the stats
+      * collector.
+      */
+     pgstat_send_buffers();
+
/*
* If we got as far as discovering our own database ID, we can report what
* we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
#endif
}
I think it might be nicer to move pgstat_beshutdown_hook() to be a
before_shmem_exit(), and do this in there.

Not really sure the correct way to do this. A cursory attempt to do so
failed because ShutdownXLOG() is also registered as a
before_shmem_exit() and ends up being called after
pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out
PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a
checkpoint, the checkpointer increments IO op counter for writes in its
PgBackendStatus.

+/*
+ * Add live IO Op stats for all IO Paths (e.g. shared, local) to those in the
+ * equivalent stats structure for exited backends. Note that this adds and
+ * doesn't set, so the destination stats structure should be zeroed out by the
+ * caller initially. This would commonly be used to transfer all IO Op stats
+ * for all IO Paths for a particular backend type to the pgstats structure.
+ */

This seems a bit odd. Why not zero it in here? Perhaps it also should be
called something like _sum_ instead of _add_?

I wanted to be able to use the function both when it was setting the
values and when it needed to add to the values (which are the two
current callers). I have changed the name from add -> sum.

+void
+pgstat_add_io_path_ops(PgStatIOOps *dest, IOOps *src, int io_path_num_types)
+{
Why is io_path_num_types a parameter?

I imagined that maybe another caller would want to only add some IO path
types and still use the function, but I think it is more confusing than
anything else so I've changed it.

+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+     int                     io_path;
+     PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
+     PgStatIOOps *dest_io_path_ops =
+     globalStats.buffers.ops[msg->backend_type].io_path_ops;
+
+     for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+     {
+             PgStatIOOps *src = &src_io_path_ops[io_path];
+             PgStatIOOps *dest = &dest_io_path_ops[io_path];
+
+             dest->allocs += src->allocs;
+             dest->extends += src->extends;
+             dest->fsyncs += src->fsyncs;
+             dest->writes += src->writes;
+     }
+}

Could this, with a bit of finessing, use pgstat_add_io_path_ops()?

I didn't really see a good way to do this -- given that
pgstat_add_io_path_ops() adds IOOps members to PgStatIOOps members --
which requires a pg_atomic_read_u64() and pgstat_recv_io_path_ops adds
PgStatIOOps to PgStatIOOps which doesn't require pg_atomic_read_u64().
Maybe I could pass a flag which, based on the type, either does or
doesn't use pg_atomic_read_u64 to access the value? But that seems worse
to me.

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
What about writes originating in like FlushRelationBuffers()?

Yes, I have made IOPath a parameter to FlushBuffer() so that it can
distinguish between strategy buffer writes and shared buffer writes and
then pushed pgstat_inc_ioop() into FlushBuffer().

bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
{
+     /*
+      * If we decide to use the dirty buffer selected by StrategyGetBuffer(),
+      * then ensure that we count it as such in pg_stat_buffers view.
+      */
+     *from_ring = true;
+

Absolutely minor nitpick: Somehow it feelsoff to talk about the view here.

Fixed.

+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+     return BackendStatusArray;
+}

Hm, not sure this adds much?

Is there a better way to access the whole BackendStatusArray from within
pgstatfuncs.c?

+                     /*
+                      * Subtract 1 from backend_type to avoid having rows for B_INVALID
+                      * BackendType
+                      */
+                     int                     rownum = (beentry->st_backendType - 1) * IOPATH_NUM_TYPES + io_path;

Perhaps worth wrapping this in a macro or inline function? It's repeated and nontrivial.

Done.

+     /* Add stats from all exited backends */
+     backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
It's probably *not* worth it, but I do wonder if we should do the addition on the SQL
level, and actually have two functions, one returning data for exited
backends, and one for currently connected ones.

It would be easy enough to implement. I would defer to others on whether
or not this would be useful. My use case for pg_stat_buffers() is to see
what backends' IO during a benchmark or test workload. For that, I reset
the stats before and then query pg_stat_buffers after running the
benchmark. I don't know if I would use exited and live stats
individually. In a real workload, I could see using
pg_stat_buffers live and exited to see if the workload causing lots of
backends to do their own writes is ongoing. Though a given workload may
be composed of lots of different queries, with backends exiting
throughout.

+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+     IOOps      *io_ops;
+     PgBackendStatus *beentry = MyBEEntry;
+
+     Assert(beentry);
+
+     io_ops = &beentry->io_path_stats[io_path];
+     switch (io_op)
+     {
+             case IOOP_ALLOC:
+                     pg_atomic_write_u64(&io_ops->allocs,
+                                                             pg_atomic_read_u64(&io_ops->allocs) + 1);
+                     break;
+             case IOOP_EXTEND:
+                     pg_atomic_write_u64(&io_ops->extends,
+                                                             pg_atomic_read_u64(&io_ops->extends) + 1);
+                     break;
+             case IOOP_FSYNC:
+                     pg_atomic_write_u64(&io_ops->fsyncs,
+                                                             pg_atomic_read_u64(&io_ops->fsyncs) + 1);
+                     break;
+             case IOOP_WRITE:
+                     pg_atomic_write_u64(&io_ops->writes,
+                                                             pg_atomic_read_u64(&io_ops->writes) + 1);
+                     break;
+     }
+}

IIRC Thomas Munro had a patch adding a nonatomic_add or such
somewhere. Perhaps in the recovery readahead thread? Might be worth using
instead?

I've added Thomas' function in a separate commit. I looked for a better
place to add it (I was thinking somewhere in src/backend/utils/misc) but
couldn't find anywhere that made sense.

I also added a call to pgstat_inc_ioop() in ProcessSyncRequests() to capture
when the checkpointer does fsyncs.

I also added pgstat_inc_ioop() calls to callers of smgrwrite() flushing local
buffers. I don't know if that is desirable or not in this patch. They could be
removed if wrappers for smgrwrite() go in and pgstat_inc_ioop() can be called
from within those wrappers.

- Melanie

Attachments:

0002-updates.patchtext/x-patch; charset=US-ASCII; name=0002-updates.patchDownload

From 977cd26ee8b489772b99de46330c9492a1839b6d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 13:43:50 -0400
Subject: [PATCH 2/2] updates

---
 src/backend/postmaster/checkpointer.c       |   2 +-
 src/backend/postmaster/pgstat.c             | 104 +++++++++++---------
 src/backend/storage/buffer/bufmgr.c         |  25 +++--
 src/backend/storage/buffer/freelist.c       |   6 +-
 src/backend/storage/buffer/localbuf.c       |   3 +
 src/backend/storage/sync/sync.c             |   1 +
 src/backend/utils/activity/backend_status.c |   5 +-
 src/backend/utils/adt/pgstatfuncs.c         |  61 ++++++------
 src/include/pgstat.h                        |   3 +-
 src/include/utils/backend_status.h          |  12 +--
 10 files changed, 119 insertions(+), 103 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 8f2ef63ee5..dec325e40e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1083,12 +1083,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
-		LWLockRelease(CheckpointerCommLock);
 		return false;
 	}
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index fbec722a1f..e0762444af 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1435,6 +1435,28 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+		int			backend_type;
+		PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+		memset(ops, 0, sizeof(ops));
+		pgstat_report_live_backend_io_path_ops(ops);
+
+		for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+		{
+			msg->m_backend_resets.backend_type = backend_type;
+			memcpy(&msg->m_backend_resets.iop, &ops[backend_type], sizeof(msg->m_backend_resets.iop));
+			pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+		}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1452,12 +1474,17 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
-	if (strcmp(target, "archiver") == 0)
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+	else if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
-	else if (strcmp(target, "buffers") == 0)
-		msg.m_resettarget = RESET_BUFFERS;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -1466,25 +1493,8 @@ pgstat_reset_shared_counters(const char *target)
 				 errmsg("unrecognized reset target: \"%s\"", target),
 				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-
-	if (msg.m_resettarget == RESET_BUFFERS)
-	{
-		int			backend_type;
-		PgStatIOPathOps ops[BACKEND_NUM_TYPES];
 
-		memset(ops, 0, sizeof(ops));
-		pgstat_report_live_backend_io_path_ops(ops);
-
-		for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
-		{
-			msg.m_backend_resets.backend_type = backend_type;
-			memcpy(&msg.m_backend_resets.iop, &ops[backend_type], sizeof(msg.m_backend_resets.iop));
-			pgstat_send(&msg, sizeof(msg));
-		}
-	}
-	else
-		pgstat_send(&msg, sizeof(msg));
+	pgstat_send(&msg, sizeof(msg));
 
 }
 
@@ -3137,30 +3147,6 @@ pgstat_send(void *msg, int len)
 #endif
 }
 
-/*
- * Add live IO Op stats for all IO Paths (e.g. shared, local) to those in the
- * equivalent stats structure for exited backends. Note that this adds and
- * doesn't set, so the destination stats structure should be zeroed out by the
- * caller initially. This would commonly be used to transfer all IO Op stats
- * for all IO Paths for a particular backend type to the pgstats structure.
- */
-void
-pgstat_add_io_path_ops(PgStatIOOps *dest, IOOps *src, int io_path_num_types)
-{
-	int			io_path;
-
-	for (io_path = 0; io_path < io_path_num_types; io_path++)
-	{
-		dest->allocs += pg_atomic_read_u64(&src->allocs);
-		dest->extends += pg_atomic_read_u64(&src->extends);
-		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
-		dest->writes += pg_atomic_read_u64(&src->writes);
-		dest++;
-		src++;
-	}
-
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
@@ -3234,9 +3220,8 @@ pgstat_send_buffers(void)
 	memset(&msg, 0, sizeof(msg));
 	msg.backend_type = beentry->st_backendType;
 
-	pgstat_add_io_path_ops(msg.iop.io_path_ops,
-						   (IOOps *) &beentry->io_path_stats,
-						   IOPATH_NUM_TYPES);
+	pgstat_sum_io_path_ops(msg.iop.io_path_ops,
+						   (IOOps *) &beentry->io_path_stats);
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
 	pgstat_send(&msg, sizeof(msg));
@@ -3407,6 +3392,29 @@ pgstat_send_slru(void)
 	}
 }
 
+/*
+ * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
+ * local) to those in the equivalent stats structure for exited backends. Note
+ * that this adds and doesn't set, so the destination stats structure should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all IO Op stats for all IO Paths for a particular backend type to the
+ * pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
+{
+	int			io_path;
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+
+}
 
 /* ----------
  * PgstatCollectorMain() -
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b911dd9ce5..537c8dcadc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,7 +480,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -1262,7 +1262,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				 */
 
 				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
-				pgstat_inc_ioop(IOOP_WRITE, iopath);
 
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
@@ -1270,7 +1269,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2566,11 +2565,10 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * buffer is clean by the time we've locked it.)
 	 */
 
-	pgstat_inc_ioop(IOOP_WRITE, IOPATH_SHARED);
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2818,9 +2816,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2913,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3559,6 +3562,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3594,7 +3599,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3690,7 +3695,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3746,7 +3751,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3773,7 +3778,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e2e1c3bf56..c7ca8d75aa 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -689,8 +689,8 @@ bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
 	/*
-	 * If we decide to use the dirty buffer selected by StrategyGetBuffer(),
-	 * then ensure that we count it as such in pg_stat_buffers view.
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
 	 */
 	*from_ring = true;
 
@@ -712,7 +712,7 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_
 	/*
 	 * Since we will not be writing out a dirty buffer from the ring, set
 	 * from_ring to false so that the caller does not count this write as a
-	 * "strategy write" and can do proper bookkeeping for pg_stat_buffers.
+	 * "strategy write" and can do proper bookkeeping.
 	 */
 	*from_ring = false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04b3558ea3..f396a2b68d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 4a2ed414b0..8e5be66998 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -396,6 +396,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index f326297517..f853ee6c1c 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -670,9 +670,8 @@ pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
 		if (beentry->st_procpid == 0)
 			continue;
 
-		pgstat_add_io_path_ops(backend_io_path_ops[beentry->st_backendType].io_path_ops,
-							   (IOOps *) beentry->io_path_stats,
-							   IOPATH_NUM_TYPES);
+		pgstat_sum_io_path_ops(backend_io_path_ops[beentry->st_backendType].io_path_ops,
+							   (IOOps *) beentry->io_path_stats);
 
 	}
 }
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 557b2673c0..d6ac325d63 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1751,10 +1751,40 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above COLUMN_LENGTH.
+*/
+enum
+{
+	COLUMN_BACKEND_TYPE,
+	COLUMN_IO_PATH,
+	COLUMN_ALLOCS,
+	COLUMN_EXTENDS,
+	COLUMN_FSYNCS,
+	COLUMN_WRITES,
+	COLUMN_RESET_TIME,
+	COLUMN_LENGTH,
+};
+
+#define NROWS ((BACKEND_NUM_TYPES - 1) * IOPATH_NUM_TYPES)
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType backend_type, IOPath io_path)
+{
+
+	/*
+	 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+	 * BackendType
+	 */
+	return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];
+}
+
 Datum
 pg_stat_get_buffers(PG_FUNCTION_ARGS)
 {
-#define NROWS ((BACKEND_NUM_TYPES - 1) * IOPATH_NUM_TYPES)
 	PgStat_BackendIOPathOps *backend_io_path_ops;
 	int			i;
 	int			io_path,
@@ -1765,22 +1795,6 @@ pg_stat_get_buffers(PG_FUNCTION_ARGS)
 
 	Tuplestorestate *tupstore = pg_stat_make_tuplestore(fcinfo, &tupdesc);
 
-	/*
-	 * When adding a new column to the pg_stat_buffers view, add a new enum
-	 * value here above COLUMN_LENGTH.
-	 */
-	enum
-	{
-		COLUMN_BACKEND_TYPE,
-		COLUMN_IO_PATH,
-		COLUMN_ALLOCS,
-		COLUMN_EXTENDS,
-		COLUMN_FSYNCS,
-		COLUMN_WRITES,
-		COLUMN_RESET_TIME,
-		COLUMN_LENGTH,
-	};
-
 	Datum		all_values[NROWS][COLUMN_LENGTH];
 	bool		all_nulls[NROWS][COLUMN_LENGTH];
 
@@ -1805,12 +1819,7 @@ pg_stat_get_buffers(PG_FUNCTION_ARGS)
 
 		for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
 		{
-			/*
-			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
-			 * BackendType
-			 */
-			int			rownum = (beentry->st_backendType - 1) * IOPATH_NUM_TYPES + io_path;
-			Datum	   *values = all_values[rownum];
+			Datum *values = get_pg_stat_buffers_row(all_values, beentry->st_backendType, io_path);
 
 			/*
 			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_IO_PATH will
@@ -1839,11 +1848,7 @@ pg_stat_get_buffers(PG_FUNCTION_ARGS)
 
 		for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
 		{
-			/*
-			 * Subtract 1 from backend_type to avoid having rows for B_INVALID
-			 * BackendType
-			 */
-			Datum	   *values = all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];
+			Datum *values = get_pg_stat_buffers_row(all_values, backend_type, io_path);
 
 			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
 			values[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8ff87a3f54..2d72933e90 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1156,13 +1156,12 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
-extern void pgstat_add_io_path_ops(PgStatIOOps *dest,
-								   IOOps *src, int io_path_num_types);
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index c0149ce0de..f0392a07dc 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -371,20 +371,16 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	switch (io_op)
 	{
 		case IOOP_ALLOC:
-			pg_atomic_write_u64(&io_ops->allocs,
-								pg_atomic_read_u64(&io_ops->allocs) + 1);
+			inc_counter(&io_ops->allocs);
 			break;
 		case IOOP_EXTEND:
-			pg_atomic_write_u64(&io_ops->extends,
-								pg_atomic_read_u64(&io_ops->extends) + 1);
+			inc_counter(&io_ops->extends);
 			break;
 		case IOOP_FSYNC:
-			pg_atomic_write_u64(&io_ops->fsyncs,
-								pg_atomic_read_u64(&io_ops->fsyncs) + 1);
+			inc_counter(&io_ops->fsyncs);
 			break;
 		case IOOP_WRITE:
-			pg_atomic_write_u64(&io_ops->writes,
-								pg_atomic_read_u64(&io_ops->writes) + 1);
+			inc_counter(&io_ops->writes);
 			break;
 	}
 }
-- 
2.27.0

0001-Read-only-atomic-s-backend-write-function.patchtext/x-patch; charset=US-ASCII; name=0001-Read-only-atomic-s-backend-write-function.patchDownload

From 1f7a8a274aebf62e84cf7cbfdb95097ed55e7c14 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH 1/2] Read-only atomic's backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values, however, pg_atomic_fetch_add_u64() is overkill for
incrementing the counter. inc_counter() is a helper function which can
be used to increment these values safely but without unnecessary
overhead.

Author: Thomas Munro
---
 src/include/utils/backend_status.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 419de72591..c0149ce0de 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -349,6 +349,16 @@ extern void pgstat_clear_backend_activity_snapshot(void);
 /* Activity reporting functions */
 typedef struct PgStatIOPathOps PgStatIOPathOps;
 
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 {
-- 
2.27.0

#30

andres@anarazel.de

about 4 years ago

In reply to: Melanie Plageman (#29)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2021-10-11 16:48:01 -0400, Melanie Plageman wrote:

On Fri, Oct 8, 2021 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:
On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote:
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
EnablePortalManager();
/* Initialize status reporting */
-     if (!bootstrap)
-             pgstat_beinit();
+     pgstat_beinit();
/*
* Load relcache entries for the shared system catalogs. This must create
--
2.27.0
I think it's good to remove more and more of these !bootstrap cases - they
really make it harder to understand the state of the system at various
points. Optimizing for the rarely executed bootstrap mode at the cost of
checks in very common codepaths...
What scope do you suggest for this patch set? A single patch which does
this in more locations (remove !bootstrap) or should I remove this patch
from the patchset?

I think the scope is fine as-is.

Is pgstattuple the best place for this helper? It's not really pgstatfuncs
specific...

It also looks vaguely familiar - I wonder if we have a helper roughly like
this somewhere else already...

I don't see a function which is specifically a utility to make a
tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice
very similar code to the function I added in pg_tablespace_databases()
in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c,
pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in
event_tigger.c, pg_available_extensions in extension.c, etc.

Do you think it makes sense to refactor this code out of all of these
places?

Yes, I think it'd make sense. We have about 40 copies of this stuff, which is
fairly ridiculous.

If so, where would such a utility function belong?

Not quite sure. src/backend/utils/fmgr/funcapi.c maybe? I suggest starting a
separate thread about that...

@@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
{
Assert(!pgstat_is_shutdown);
+     /*
+      * Only need to send stats on IO Ops for IO Paths when a process exits, as
+      * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+      * then sum this with totals from exited backends persisted by the stats
+      * collector.
+      */
+     pgstat_send_buffers();
+
/*
* If we got as far as discovering our own database ID, we can report what
* we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
#endif
}
I think it might be nicer to move pgstat_beshutdown_hook() to be a
before_shmem_exit(), and do this in there.
Not really sure the correct way to do this. A cursory attempt to do so
failed because ShutdownXLOG() is also registered as a
before_shmem_exit() and ends up being called after
pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out
PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a
checkpoint, the checkpointer increments IO op counter for writes in its
PgBackendStatus.

I think we'll really need to do a proper redesign of the shutdown callback
mechanism :(.

+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+     int                     io_path;
+     PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
+     PgStatIOOps *dest_io_path_ops =
+     globalStats.buffers.ops[msg->backend_type].io_path_ops;
+
+     for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+     {
+             PgStatIOOps *src = &src_io_path_ops[io_path];
+             PgStatIOOps *dest = &dest_io_path_ops[io_path];
+
+             dest->allocs += src->allocs;
+             dest->extends += src->extends;
+             dest->fsyncs += src->fsyncs;
+             dest->writes += src->writes;
+     }
+}
Could this, with a bit of finessing, use pgstat_add_io_path_ops()?
I didn't really see a good way to do this -- given that
pgstat_add_io_path_ops() adds IOOps members to PgStatIOOps members --
which requires a pg_atomic_read_u64() and pgstat_recv_io_path_ops adds
PgStatIOOps to PgStatIOOps which doesn't require pg_atomic_read_u64().
Maybe I could pass a flag which, based on the type, either does or
doesn't use pg_atomic_read_u64 to access the value? But that seems worse
to me.

Yea, you're probably right, that's worse.

+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+     return BackendStatusArray;
+}
Hm, not sure this adds much?
Is there a better way to access the whole BackendStatusArray from within
pgstatfuncs.c?

Export the variable itself?

IIRC Thomas Munro had a patch adding a nonatomic_add or such
somewhere. Perhaps in the recovery readahead thread? Might be worth using
instead?

I've added Thomas' function in a separate commit. I looked for a better
place to add it (I was thinking somewhere in src/backend/utils/misc) but
couldn't find anywhere that made sense.

I think it should just live in atomics.h?

I also added pgstat_inc_ioop() calls to callers of smgrwrite() flushing local
buffers. I don't know if that is desirable or not in this patch. They could be
removed if wrappers for smgrwrite() go in and pgstat_inc_ioop() can be called
from within those wrappers.

Makes sense to me to to have it here.

Greetings,

Andres Freund

#31

[1]: /messages/by-id/CAAKRu_azyd1Z3W_r7Ou4sorTjRCs+PxeHw1CWJeXKofkE6TuZg@mail.gmail.com

melanieplageman@gmail.com

about 4 years ago

In reply to: Andres Freund (#30)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v14 attached.

On Tue, Oct 19, 2021 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:

Is pgstattuple the best place for this helper? It's not really pgstatfuncs
specific...

It also looks vaguely familiar - I wonder if we have a helper roughly like
this somewhere else already...

I don't see a function which is specifically a utility to make a
tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice
very similar code to the function I added in pg_tablespace_databases()
in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c,
pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in
event_tigger.c, pg_available_extensions in extension.c, etc.

Do you think it makes sense to refactor this code out of all of these
places?

Yes, I think it'd make sense. We have about 40 copies of this stuff, which is
fairly ridiculous.

If so, where would such a utility function belong?

Not quite sure. src/backend/utils/fmgr/funcapi.c maybe? I suggest starting a
separate thread about that...

done [1]/messages/by-id/CAAKRu_azyd1Z3W_r7Ou4sorTjRCs+PxeHw1CWJeXKofkE6TuZg@mail.gmail.com. also, I dropped that commit from this patchset.

@@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg)
{
Assert(!pgstat_is_shutdown);
+     /*
+      * Only need to send stats on IO Ops for IO Paths when a process exits, as
+      * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+      * then sum this with totals from exited backends persisted by the stats
+      * collector.
+      */
+     pgstat_send_buffers();
+
/*
* If we got as far as discovering our own database ID, we can report what
* we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len)
#endif
}
I think it might be nicer to move pgstat_beshutdown_hook() to be a
before_shmem_exit(), and do this in there.
Not really sure the correct way to do this. A cursory attempt to do so
failed because ShutdownXLOG() is also registered as a
before_shmem_exit() and ends up being called after
pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out
PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a
checkpoint, the checkpointer increments IO op counter for writes in its
PgBackendStatus.
I think we'll really need to do a proper redesign of the shutdown callback
mechanism :(.

I've left what I originally had, then.

+PgBackendStatus *
+pgstat_fetch_backend_statuses(void)
+{
+     return BackendStatusArray;
+}
Hm, not sure this adds much?
Is there a better way to access the whole BackendStatusArray from within
pgstatfuncs.c?
Export the variable itself?

done but wasn't sure about PGDLLIMPORT

IIRC Thomas Munro had a patch adding a nonatomic_add or such
somewhere. Perhaps in the recovery readahead thread? Might be worth using
instead?

I've added Thomas' function in a separate commit. I looked for a better
place to add it (I was thinking somewhere in src/backend/utils/misc) but
couldn't find anywhere that made sense.

I think it should just live in atomics.h?

done

-- melanie

Attachments:

v14-0002-Read-only-atomic-backend-write-function.patchtext/x-patch; charset=US-ASCII; name=v14-0002-Read-only-atomic-backend-write-function.patchDownload

From 41242606a03aace906e307a38fc67c5cefcaec20 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v14 2/4] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values, however, pg_atomic_fetch_add_u64() is overkill for
incrementing the counter. inc_counter() is a helper function which can
be used to increment these values safely but without unnecessary
overhead.

Author: Thomas Munro
---
 src/include/port/atomics.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..09a2575d6a 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,16 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.30.2

v14-0004-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v14-0004-Remove-superfluous-bgwriter-stats.patchDownload

From cfd8941958877a9bf3f5946e18aa098b1351d36f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:44:51 -0400
Subject: [PATCH v14 4/4] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  6 +---
 src/backend/postmaster/checkpointer.c | 26 ---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 77b044343d..1aa94c5b19 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3400,24 +3400,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3428,35 +3410,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 86ca35121b..1e93937198 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1062,18 +1062,14 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
+-- TODO: make separate pg_stat_checkpointer view
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
         pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1306b5238d..dec325e40e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1103,8 +1088,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 355690d944..e0762444af 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5616,9 +5616,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5634,9 +5632,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6926fc5742..67447f997a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2164,7 +2164,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 734079e233..641a891e34 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1809,24 +1797,6 @@ get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType back
 	return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 Datum
 pg_stat_get_buffers(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bbdb07b222..a3cdbe1dbc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,16 +5600,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5629,18 +5619,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac8aca2c61..2d72933e90 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -505,9 +505,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -520,9 +518,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -898,9 +893,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -914,9 +907,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5e5a0324ee..090a65cdb0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
-- 
2.30.2

v14-0001-Allow-bootstrap-process-to-beinit.patchtext/x-patch; charset=US-ASCII; name=v14-0001-Allow-bootstrap-process-to-beinit.patchDownload

From 45b4d9bd16c37965eb73b74073121c985b230abf Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 24 Sep 2021 17:39:12 -0400
Subject: [PATCH v14 1/4] Allow bootstrap process to beinit

---
 src/backend/utils/init/postinit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 78bc64671e..fba5864172 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	if (!bootstrap)
-		pgstat_beinit();
+	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
-- 
2.30.2

v14-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v14-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 2895819b15189495dba26fa5b3b91b8fb07f35ad Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Sep 2021 15:39:45 -0400
Subject: [PATCH v14 3/4] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend).

All backends increment a counter in their PgBackendStatus when
performing an IO operation. On exit, backends send these stats to the
stats collector to be persisted.

When stats are reset, the backend sending the reset message will loop
through and collect all of the live backends' IO op stats, sending a
reset message for each backend type containing these stats. When
receiving this message, the stats collector will 1) save these reset
values in an array of "resets" and 2) zero out the exited backends'
saved IO op counters. This is required for accurate stats after a reset
without writing to other backends' PgBackendStatuses.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that this commit does not add code to increment IO ops for all IO
paths. It includes all possible combinations in the stats view but
doesn't populate all of them.

A separate proposed patch [1] which would add wrappers for smgrwrite()
and extend() would provide a good location to call pgstat_inc_ioop() for
unbuffered IO and avoid regressions for future users of these functions.

TODO:
- catalog bump

[1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com

Discussion: https://www.postgresql.org/message-id/flat/20210415235954.qcypb4urtovzkat5%40alap3.anarazel.de#724d5cce4bcb587f9167b80a5824bc5c
---
 doc/src/sgml/monitoring.sgml                | 116 +++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/checkpointer.c       |   3 +-
 src/backend/postmaster/pgstat.c             | 161 +++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c         |  46 ++++--
 src/backend/storage/buffer/freelist.c       |  23 ++-
 src/backend/storage/buffer/localbuf.c       |   3 +
 src/backend/storage/sync/sync.c             |   1 +
 src/backend/utils/activity/backend_status.c |  60 +++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 152 ++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/miscadmin.h                     |   2 +
 src/include/pgstat.h                        |  53 +++++++
 src/include/storage/buf_internals.h         |   4 +-
 src/include/utils/backend_status.h          |  80 ++++++++++
 src/test/regress/expected/rules.out         |   8 +
 16 files changed, 701 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3173ec2566..77b044343d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3462,6 +3471,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path, containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5058,12 +5162,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
+        argument.  The argument can be <literal>bgwriter</literal> to reset all
+        the counters shown in the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view, or
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eb560955cd..86ca35121b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1076,6 +1076,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..1306b5238d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1098,13 +1098,14 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b7d0fbaefd..355690d944 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -124,9 +124,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various IO ops are tracked in PgBackendStatus while a backend
+ * is alive and then sent to stats collector before a backend exits in a
+ * PgStat_MsgIOPathOps.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -362,6 +365,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -1431,6 +1435,28 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+		int			backend_type;
+		PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+		memset(ops, 0, sizeof(ops));
+		pgstat_report_live_backend_io_path_ops(ops);
+
+		for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+		{
+			msg->m_backend_resets.backend_type = backend_type;
+			memcpy(&msg->m_backend_resets.iop, &ops[backend_type], sizeof(msg->m_backend_resets.iop));
+			pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+		}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1448,7 +1474,14 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
-	if (strcmp(target, "archiver") == 0)
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+	else if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
@@ -1460,8 +1493,9 @@ pgstat_reset_shared_counters(const char *target)
 				 errmsg("unrecognized reset target: \"%s\"", target),
 				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+
 	pgstat_send(&msg, sizeof(msg));
+
 }
 
 /* ----------
@@ -2760,6 +2794,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO op statistics for both
+ *	exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
@@ -2999,6 +3046,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IO Ops for IO Paths when a process exits, as
+	 * pg_stat_get_buffers() will read from live backends' PgBackendStatus and
+	 * then sum this with totals from exited backends persisted by the stats
+	 * collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3148,6 +3203,31 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_sum_io_path_ops(msg.iop.io_path_ops,
+						   (IOOps *) &beentry->io_path_stats);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3312,6 +3392,29 @@ pgstat_send_slru(void)
 	}
 }
 
+/*
+ * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
+ * local) to those in the equivalent stats structure for exited backends. Note
+ * that this adds and doesn't set, so the destination stats structure should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all IO Op stats for all IO Paths for a particular backend type to the
+ * pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
+{
+	int			io_path;
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+
+}
 
 /* ----------
  * PgstatCollectorMain() -
@@ -3522,6 +3625,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5221,10 +5328,30 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global, bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * Though globalStats.buffers only needs to be reset once, doing so
+		 * for every message is less brittle and the extra cost is irrelevant
+		 * given how often stats are reset.
+		 */
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+
+		memcpy(&globalStats.buffers.resets[backend_type],
+			   &msg->m_backend_resets.iop.io_path_ops, sizeof(msg->m_backend_resets.iop.io_path_ops));
+
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
@@ -5512,6 +5639,26 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	int			io_path;
+	PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
+	PgStatIOOps *dest_io_path_ops =
+	globalStats.buffers.ops[msg->backend_type].io_path_ops;
+
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		PgStatIOOps *src = &src_io_path_ops[io_path];
+		PgStatIOOps *dest = &dest_io_path_ops[io_path];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96..6926fc5742 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,7 +480,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,13 +1250,26 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular
+				 * write. When a strategy is not in use, at this point, the
+				 * write can only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2552,10 +2570,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2803,9 +2822,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2897,6 +2919,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3544,6 +3568,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3579,7 +3605,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3675,7 +3701,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3731,7 +3757,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3758,7 +3784,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..c7ca8d75aa 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04b3558ea3..f396a2b68d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..3be06d5d5a 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -420,6 +420,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..4662ea2e24 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -50,7 +50,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -236,6 +236,24 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
@@ -279,7 +297,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
@@ -293,6 +311,7 @@ pgstat_bestart(void)
 {
 	volatile PgBackendStatus *vbeentry = MyBEEntry;
 	PgBackendStatus lbeentry;
+	int			io_path;
 #ifdef USE_SSL
 	PgBackendSSLStatus lsslstatus;
 #endif
@@ -399,6 +418,15 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		IOOps	   *io_ops = &lbeentry.io_path_stats[io_path];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
@@ -621,6 +649,33 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IO
+ * Ops for all IO Paths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	int			i;
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		pgstat_sum_io_path_ops(backend_io_path_ops[beentry->st_backendType].io_path_ops,
+							   (IOOps *) beentry->io_path_stats);
+
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
@@ -1045,7 +1100,6 @@ pgstat_get_my_query_id(void)
 	return MyBEEntry->st_query_id;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ff5aedc99c..734079e233 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1778,6 +1778,37 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above COLUMN_LENGTH.
+*/
+enum
+{
+	COLUMN_BACKEND_TYPE,
+	COLUMN_IO_PATH,
+	COLUMN_ALLOCS,
+	COLUMN_EXTENDS,
+	COLUMN_FSYNCS,
+	COLUMN_WRITES,
+	COLUMN_RESET_TIME,
+	COLUMN_LENGTH,
+};
+
+#define NROWS ((BACKEND_NUM_TYPES - 1) * IOPATH_NUM_TYPES)
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType backend_type, IOPath io_path)
+{
+
+	/*
+	 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+	 * BackendType
+	 */
+	return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];
+}
+
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
@@ -1796,6 +1827,127 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	int			backend_type, io_path;
+	int			i;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[NROWS][COLUMN_LENGTH];
+	bool		all_nulls[NROWS][COLUMN_LENGTH];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their IO Ops for each IO Path
+	 */
+	beentry = BackendStatusArray;
+
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		IOOps	   *io_ops;
+
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			Datum *values = get_pg_stat_buffers_row(all_values, beentry->st_backendType, io_path);
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_IO_PATH will
+			 * all be set when looping through exited backends array
+			 */
+			values[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			values[COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			values[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			values[COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatIOOps *io_ops = backend_io_path_ops->ops[backend_type].io_path_ops;
+		PgStatIOOps *resets = backend_io_path_ops->resets[backend_type].io_path_ops;
+
+		Datum		backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			Datum *values = get_pg_stat_buffers_row(all_values, backend_type, io_path);
+
+			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
+			values[COLUMN_ALLOCS] = values[COLUMN_ALLOCS] + io_ops->allocs - resets->allocs;
+			values[COLUMN_EXTENDS] = values[COLUMN_EXTENDS] + io_ops->extends - resets->extends;
+			values[COLUMN_FSYNCS] = values[COLUMN_FSYNCS] + io_ops->fsyncs - resets->fsyncs;
+			values[COLUMN_WRITES] = values[COLUMN_WRITES] + io_ops->writes - resets->writes;
+			values[COLUMN_RESET_TIME] = reset_time;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (i = 0; i < NROWS; i++)
+	{
+		Datum	   *values = all_values[i];
+		bool	   *nulls = all_nulls[i];
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..bbdb07b222 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..6785fb3813 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..ac8aca2c61 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -138,6 +139,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -331,6 +333,51 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IO ops in the stats collector
+ */
+typedef struct PgStatIOOps
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOps;
+
+/*
+ * Structure for counting all IO ops on all types of buffers.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOps io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IO Ops for all IO
+ * Paths for a given type of a backend. This will happen when the backend exits
+ * or when stats are reset.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IO Ops for all IO Paths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -351,6 +398,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
@@ -703,6 +751,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -879,6 +928,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1118,14 +1168,17 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..2b99195e7b 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,48 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOps for a live backend.
+ */
+typedef struct IOOps
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOps;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +205,16 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IO Ops for all IO Paths for this backend. When the
+	 * pg_stat_buffers view is queried and when stats are reset, one backend
+	 * will read io_path_stats from all live backends and combine them with
+	 * io_path_stats from exited backends for each backend type. When this
+	 * backend exits, it will send io_path_stats to the stats collector to be
+	 * persisted.
+	 */
+	IOOps		io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -274,6 +321,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
@@ -289,6 +337,10 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+extern const char *GetIOPathDesc(IOPath io_path);
+
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
@@ -296,7 +348,35 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOps	   *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..5e5a0324ee 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.30.2

#32

andres@anarazel.de

about 4 years ago

In reply to: Melanie Plageman (#31)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2021-11-02 15:26:52 -0400, Melanie Plageman wrote:

Subject: [PATCH v14 1/4] Allow bootstrap process to beinit

Pushed.

+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
#undef INSIDE_ATOMICS_H

Why is this using a completely different naming scheme from the rest of the
file?

doc/src/sgml/monitoring.sgml | 116 +++++++++++++-
src/backend/catalog/system_views.sql | 11 ++
src/backend/postmaster/checkpointer.c | 3 +-
src/backend/postmaster/pgstat.c | 161 +++++++++++++++++++-
src/backend/storage/buffer/bufmgr.c | 46 ++++--
src/backend/storage/buffer/freelist.c | 23 ++-
src/backend/storage/buffer/localbuf.c | 3 +
src/backend/storage/sync/sync.c | 1 +
src/backend/utils/activity/backend_status.c | 60 +++++++-
src/backend/utils/adt/pgstatfuncs.c | 152 ++++++++++++++++++
src/include/catalog/pg_proc.dat | 9 ++
src/include/miscadmin.h | 2 +
src/include/pgstat.h | 53 +++++++
src/include/storage/buf_internals.h | 4 +-
src/include/utils/backend_status.h | 80 ++++++++++
src/test/regress/expected/rules.out | 8 +
16 files changed, 701 insertions(+), 31 deletions(-)

This is a pretty large change, I wonder if there's a way to make it a bit more
granular.

Greetings,

Andres Freund

#33

melanieplageman@gmail.com

about 4 years ago

In reply to: Andres Freund (#32)

7 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Nov 19, 2021 at 11:49 AM Andres Freund <andres@anarazel.de> wrote:

+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+     pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
#undef INSIDE_ATOMICS_H

Why is this using a completely different naming scheme from the rest of the
file?

It was what Thomas originally named it. Also, I noticed all the other
pg_atomic* in this file were wrappers around the same impl function, so
I thought maybe naming it this way would be confusing. I renamed it to
pg_atomic_inc_counter(), though maybe pg_atomic_readonly_write() would
be better?

doc/src/sgml/monitoring.sgml | 116 +++++++++++++-
src/backend/catalog/system_views.sql | 11 ++
src/backend/postmaster/checkpointer.c | 3 +-
src/backend/postmaster/pgstat.c | 161 +++++++++++++++++++-
src/backend/storage/buffer/bufmgr.c | 46 ++++--
src/backend/storage/buffer/freelist.c | 23 ++-
src/backend/storage/buffer/localbuf.c | 3 +
src/backend/storage/sync/sync.c | 1 +
src/backend/utils/activity/backend_status.c | 60 +++++++-
src/backend/utils/adt/pgstatfuncs.c | 152 ++++++++++++++++++
src/include/catalog/pg_proc.dat | 9 ++
src/include/miscadmin.h | 2 +
src/include/pgstat.h | 53 +++++++
src/include/storage/buf_internals.h | 4 +-
src/include/utils/backend_status.h | 80 ++++++++++
src/test/regress/expected/rules.out | 8 +
16 files changed, 701 insertions(+), 31 deletions(-)

This is a pretty large change, I wonder if there's a way to make it a bit more
granular.

I have done this. See v15 patch set attached.

- Melanie

Attachments:

v15-0007-small-comment-correction.patchapplication/octet-stream; name=v15-0007-small-comment-correction.patchDownload

From 1420bc569573dc8ae89e1d2f8fcf3652ebcd4d10 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:21:08 -0500
Subject: [PATCH v15 7/7] small comment correction

---
 src/backend/utils/activity/backend_status.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index c3e5d23f99..1b2d436677 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -297,7 +297,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
-- 
2.32.0

v15-0004-Add-buffers-to-pgstat_reset_shared_counters.patchapplication/octet-stream; name=v15-0004-Add-buffers-to-pgstat_reset_shared_counters.patchDownload

From 06d3fc9cda850e496170b1d02d321ef0d606a4f7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:39:48 -0500
Subject: [PATCH v15 4/7] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their PgBackendStatus.
Upon exit, they send these counts to the stats collector. Prior to this commit,
these IO Ops stats would have been reset when the target was "bgwriter".

With this commit, target "bgwriter" no longer will cause the IO operations
stats to be reset, and the IO operations stats can be reset with new target,
"buffers".
---
 doc/src/sgml/monitoring.sgml                |  2 +-
 src/backend/postmaster/pgstat.c             | 67 +++++++++++++++++++--
 src/backend/utils/activity/backend_status.c | 27 +++++++++
 src/include/pgstat.h                        | 12 +++-
 src/include/utils/backend_status.h          |  2 +
 5 files changed, 103 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index af6914872b..d4dd5d3623 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3473,7 +3473,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index aa538d175f..0d18a4dc02 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1435,6 +1435,28 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+		int			backend_type;
+		PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+		memset(ops, 0, sizeof(ops));
+		pgstat_report_live_backend_io_path_ops(ops);
+
+		for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+		{
+			msg->m_backend_resets.backend_type = backend_type;
+			memcpy(&msg->m_backend_resets.iop, &ops[backend_type], sizeof(msg->m_backend_resets.iop));
+			pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+		}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1452,7 +1474,14 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
-	if (strcmp(target, "archiver") == 0)
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+	else if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
@@ -1464,8 +1493,9 @@ pgstat_reset_shared_counters(const char *target)
 				 errmsg("unrecognized reset target: \"%s\"", target),
 				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+
 	pgstat_send(&msg, sizeof(msg));
+
 }
 
 /* ----------
@@ -5285,10 +5315,39 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global, bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		/*
+		 * Because the stats collector cannot write to live backends'
+		 * PgBackendStatuses, it maintains an array of "resets". The reset
+		 * message contains the current values of these counters for live
+		 * backends. The stats collector saves these in its "resets" array,
+		 * then zeroes out the exited backends' saved IO op counters. This is
+		 * required to calculate an accurate total for each IO op counter post
+		 * reset.
+		 */
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * Though globalStats.buffers only needs to be reset once, doing so for
+		 * every message is less brittle and the extra cost is irrelevant given
+		 * how often stats are reset.
+		 */
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+
+		memcpy(&globalStats.buffers.resets[backend_type],
+			   &msg->m_backend_resets.iop.io_path_ops, sizeof(msg->m_backend_resets.iop.io_path_ops));
+
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 0683f243dc..1617033e26 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -631,6 +631,33 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IO
+ * Ops for all IO Paths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	int			i;
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		pgstat_sum_io_path_ops(backend_io_path_ops[beentry->st_backendType].io_path_ops,
+							   (IOOps *) beentry->io_path_stats);
+
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index aeb6f52fdd..8c291f1f0d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -139,6 +139,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -353,7 +354,8 @@ typedef struct PgStatIOPathOps
 
 /*
  * Sent by a backend to the stats collector to report all IO Ops for all IO
- * Paths for a given type of a backend. This will happen when the backend exits.
+ * Paths for a given type of a backend. This will happen when the backend exits
+ * or when stats are reset.
  */
 typedef struct PgStat_MsgIOPathOps
 {
@@ -365,13 +367,18 @@ typedef struct PgStat_MsgIOPathOps
 
 /*
  * Structure used by stats collector to keep track of all types of exited
- * backends' IO Ops for all IO Paths.
+ * backends' IO Ops for all IO Paths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
  */
 typedef struct PgStat_BackendIOPathOps
 {
+	TimestampTz stat_reset_timestamp;
 	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
 } PgStat_BackendIOPathOps;
 
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -392,6 +399,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 717f58d4cc..9c997cace8 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -343,6 +343,7 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
 
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
@@ -370,6 +371,7 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	}
 }
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
-- 
2.32.0

v15-0006-Remove-superfluous-bgwriter-stats.patchapplication/octet-stream; name=v15-0006-Remove-superfluous-bgwriter-stats.patchDownload

From 0315863209316e958a27b97910f116f8fb3d3ce6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v15 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  5 ---
 src/backend/postmaster/checkpointer.c | 28 +---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 157 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 56d2fd884f..bd7e582856 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3420,24 +3420,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3448,35 +3430,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 86ca35121b..1a35e0336c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1068,12 +1068,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9feca9ada2..dec325e40e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1085,10 +1074,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1098,13 +1083,11 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1261,15 +1244,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4bcd99bd6f..685d38ba15 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5625,9 +5625,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5643,9 +5641,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6926fc5742..67447f997a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2164,7 +2164,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0c370fdce2..36813dfe41 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 /*
 * When adding a new column to the pg_stat_buffers view, add a new enum
 * value here above COLUMN_LENGTH.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fda916f911..6a83078b14 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5604,16 +5604,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5633,18 +5623,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 303cddb4c0..cbc5cb2829 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -506,9 +506,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -521,9 +519,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -899,9 +894,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -915,9 +908,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5e5a0324ee..090a65cdb0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
-- 
2.32.0

v15-0003-Send-IO-operations-to-stats-collector.patchapplication/octet-stream; name=v15-0003-Send-IO-operations-to-stats-collector.patchDownload

From 7d9f669b0287713b4a42f794ce48b633fa9cb10a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:16:55 -0500
Subject: [PATCH v15 3/7] Send IO operations to stats collector

On exit, backends send the IO operations they have done on all IO Paths
to the stats collector. The stats collector adds these counts to its
existing counts stored in a global data structure it maintains and
persists.

PgStatIOOps contains the same information as backend_status.h's IOOps,
however IOOps' members must be atomics and the stats collector has no
such requirement.
---
 src/backend/postmaster/pgstat.c | 90 +++++++++++++++++++++++++++++++--
 src/include/miscadmin.h         |  2 +
 src/include/pgstat.h            | 45 +++++++++++++++++
 3 files changed, 134 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8c166e5e16..aa538d175f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -124,9 +124,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various IO ops are tracked in PgBackendStatus while a backend
+ * is alive and then sent to stats collector before a backend exits in a
+ * PgStat_MsgIOPathOps.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -362,6 +365,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -2999,6 +3003,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IO Ops for IO Paths when a process exits.
+	 * Users requiring IO Ops for both live and exited backends can read from
+	 * live backends' PgBackendStatus and sum this with totals from exited
+	 * backends persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3148,6 +3160,31 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_sum_io_path_ops(msg.iop.io_path_ops,
+						   (IOOps *) &beentry->io_path_stats);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3312,6 +3349,29 @@ pgstat_send_slru(void)
 	}
 }
 
+/*
+ * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
+ * local) to those in the equivalent stats structure for exited backends. Note
+ * that this adds and doesn't set, so the destination stats structure should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all IO Op stats for all IO Paths for a particular backend type to the
+ * pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
+{
+	int			io_path;
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+
+}
 
 /* ----------
  * PgstatCollectorMain() -
@@ -3522,6 +3582,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5512,6 +5576,26 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	int			io_path;
+	PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
+	PgStatIOOps *dest_io_path_ops =
+	globalStats.buffers.ops[msg->backend_type].io_path_ops;
+
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		PgStatIOOps *src = &src_io_path_ops[io_path];
+		PgStatIOOps *dest = &dest_io_path_ops[io_path];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..6785fb3813 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bcd3588ea2..aeb6f52fdd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -72,6 +72,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -331,6 +332,46 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IO ops in the stats collector
+ */
+typedef struct PgStatIOOps
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOps;
+
+/*
+ * Structure for counting all IO Ops on all types of IO Paths.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOps io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IO Ops for all IO
+ * Paths for a given type of a backend. This will happen when the backend exits.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IO Ops for all IO Paths.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -703,6 +744,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -879,6 +921,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1118,8 +1161,10 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
-- 
2.32.0

v15-0005-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v15-0005-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 61bfc251e3e06c08210b01e1447fe7431b9b7fe5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:07:37 -0500
Subject: [PATCH v15 5/7] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IO Path and IO Op are enumerated
in the view but not all are populated or even possible at this point.

All backends increment a counter in their PgBackendStatus when
performing an IO operation. On exit, backends send these stats to the
stats collector to be persisted.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Discussion: https://www.postgresql.org/message-id/flat/20210415235954.qcypb4urtovzkat5%40alap3.anarazel.de#724d5cce4bcb587f9167b80a5824bc5c
---
 doc/src/sgml/monitoring.sgml                | 110 +++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/pgstat.c             |  13 ++
 src/backend/utils/activity/backend_status.c |  20 ++-
 src/backend/utils/adt/pgstatfuncs.c         | 151 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/pgstat.h                        |   1 +
 src/include/utils/backend_status.h          |   5 +
 src/test/regress/expected/rules.out         |   8 ++
 9 files changed, 325 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d4dd5d3623..56d2fd884f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3482,6 +3491,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path, containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>integer</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5082,8 +5186,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view, <literal>wal</literal>
+        to reset all the counters shown in the <structname>pg_stat_wal</structname> view, 
+        or <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eb560955cd..86ca35121b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1076,6 +1076,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0d18a4dc02..4bcd99bd6f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2794,6 +2794,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO op statistics for both
+ *	exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 1617033e26..c3e5d23f99 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -50,7 +50,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -236,6 +236,24 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e64857e540..0c370fdce2 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,157 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above COLUMN_LENGTH.
+*/
+enum
+{
+	COLUMN_BACKEND_TYPE,
+	COLUMN_IO_PATH,
+	COLUMN_ALLOCS,
+	COLUMN_EXTENDS,
+	COLUMN_FSYNCS,
+	COLUMN_WRITES,
+	COLUMN_RESET_TIME,
+	COLUMN_LENGTH,
+};
+
+#define NROWS ((BACKEND_NUM_TYPES - 1) * IOPATH_NUM_TYPES)
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType backend_type, IOPath io_path)
+{
+	/*
+	 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+	 * BackendType
+	 */
+	return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];
+}
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	int			backend_type, io_path;
+	int			i;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[NROWS][COLUMN_LENGTH];
+	bool		all_nulls[NROWS][COLUMN_LENGTH];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their IO Ops for each IO Path
+	 */
+	beentry = BackendStatusArray;
+
+	for (i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		IOOps	   *io_ops;
+
+		beentry++;
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			Datum *values = get_pg_stat_buffers_row(all_values, beentry->st_backendType, io_path);
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_IO_PATH will
+			 * all be set when looping through exited backends array
+			 */
+			values[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			values[COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			values[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			values[COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatIOOps *io_ops = backend_io_path_ops->ops[backend_type].io_path_ops;
+		PgStatIOOps *resets = backend_io_path_ops->resets[backend_type].io_path_ops;
+
+		Datum		backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			Datum *values = get_pg_stat_buffers_row(all_values, backend_type, io_path);
+
+			values[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
+			values[COLUMN_ALLOCS] = values[COLUMN_ALLOCS] + io_ops->allocs - resets->allocs;
+			values[COLUMN_EXTENDS] = values[COLUMN_EXTENDS] + io_ops->extends - resets->extends;
+			values[COLUMN_FSYNCS] = values[COLUMN_FSYNCS] + io_ops->fsyncs - resets->fsyncs;
+			values[COLUMN_WRITES] = values[COLUMN_WRITES] + io_ops->writes - resets->writes;
+			values[COLUMN_RESET_TIME] = reset_time;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (i = 0; i < NROWS; i++)
+	{
+		Datum	   *values = all_values[i];
+		bool	   *nulls = all_nulls[i];
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e934361dc3..fda916f911 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8c291f1f0d..303cddb4c0 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1179,6 +1179,7 @@ extern void pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 9c997cace8..dd983fc949 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -321,6 +321,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
@@ -336,6 +337,10 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+extern const char *GetIOPathDesc(IOPath io_path);
+
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..5e5a0324ee 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.32.0

v15-0002-Add-IO-operation-counters-to-PgBackendStatus.patchapplication/octet-stream; name=v15-0002-Add-IO-operation-counters-to-PgBackendStatus.patchDownload

From cd9d572a14aa7f9f968b0b2b41da5673a7f67054 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v15 2/7] Add IO operation counters to PgBackendStatus

Add an array of counters in PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend. Each "IO
Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct,
local, shared, or strategy). "local" and "shared" IO Path counters count
operations on local and shared buffers. The "strategy" IO Path counts
buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy.
The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, all backends increment a counter in their
PgBackendStatus when performing an IO operation. This is in preparation
for future commits which will persist these stats upon backend exit and
use the counters to provide observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A separate proposed patch [1] which would add wrappers for smgrwrite(),
smgrextend(), and smgrimmedsync() would provide a good location to call
pgstat_inc_ioop() for unbuffered IO and avoid regressions for future
users of these functions.

[1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com
---
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/storage/buffer/bufmgr.c         | 46 ++++++++++---
 src/backend/storage/buffer/freelist.c       | 23 ++++++-
 src/backend/storage/buffer/localbuf.c       |  3 +
 src/backend/storage/sync/sync.c             |  1 +
 src/backend/utils/activity/backend_status.c | 10 +++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 73 +++++++++++++++++++++
 8 files changed, 146 insertions(+), 15 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index be7366379d..9feca9ada2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1105,6 +1105,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96..6926fc5742 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,7 +480,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,13 +1250,26 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular
+				 * write. When a strategy is not in use, at this point, the
+				 * write can only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2552,10 +2570,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2803,9 +2822,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2897,6 +2919,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3544,6 +3568,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3579,7 +3605,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3675,7 +3701,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3731,7 +3757,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3758,7 +3784,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..c7ca8d75aa 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04b3558ea3..f396a2b68d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..3be06d5d5a 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -420,6 +420,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..0683f243dc 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -293,6 +293,7 @@ pgstat_bestart(void)
 {
 	volatile PgBackendStatus *vbeentry = MyBEEntry;
 	PgBackendStatus lbeentry;
+	int			io_path;
 #ifdef USE_SSL
 	PgBackendSSLStatus lsslstatus;
 #endif
@@ -399,6 +400,15 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		IOOps	   *io_ops = &lbeentry.io_path_stats[io_path];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..717f58d4cc 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,48 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOps for a live backend.
+ */
+typedef struct IOOps
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOps;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +205,16 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IO Ops for all IO Paths for this backend. When the
+	 * pg_stat_buffers view is queried and when stats are reset, one backend
+	 * will read io_path_stats from all live backends and combine them with
+	 * io_path_stats from exited backends for each backend type. When this
+	 * backend exits, it will send io_path_stats to the stats collector to be
+	 * persisted.
+	 */
+	IOOps		io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,6 +343,32 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOps	   *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
-- 
2.32.0

v15-0001-Read-only-atomic-backend-write-function.patchapplication/octet-stream; name=v15-0001-Read-only-atomic-backend-write-function.patchDownload

From c68d9d8867103fcedb874a1c3ded9ecd737e654d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v15 1/7] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values, however, pg_atomic_fetch_add_u64() is overkill for
incrementing the counter. pg_atomic_inc_counter() is a helper function
which can be used to increment these values safely but without
unnecessary overhead.

Author: Thomas Munro
---
 src/include/port/atomics.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..545d6d37c7 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,16 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void pg_atomic_inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.32.0

#34

pryzby@telsasoft.com

about 4 years ago

In reply to: Melanie Plageman (#33)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Thanks for working on this. I was just trying to find something like
"pg_stat_checkpointer".

You wrote beentry++ at the start of two loops, but I think that's wrong; it
should be at the end, as in the rest of the file (or as a loop increment).
BackendStatusArray[0] is actually used (even though its backend has
backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];"

You could put *_NUM_TYPES as the last value in these enums, like
NUM_AUXPROCTYPES, NUM_PMSIGNALS, and NUM_PROCSIGNALS:

+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)

There's extraneous blank lines in these functions:

+pgstat_sum_io_path_ops
+pgstat_report_live_backend_io_path_ops
+pgstat_recv_resetsharedcounter
+GetIOPathDesc
+StrategyRejectBuffer

This function is doubly-indented:

+pgstat_send_buffers_reset

As support for C99 is now required by postgres, variables can be declared as
part of various loops.

+       int                     io_path;
+       for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)

Rather than memset(), you could initialize msg like this.
PgStat_MsgIOPathOps msg = {0};

+pgstat_send_buffers(void)
+{
+       PgStat_MsgIOPathOps msg;
+
+       PgBackendStatus *beentry = MyBEEntry;
+
+       if (!beentry)
+               return;
+
+       memset(&msg, 0, sizeof(msg));

--
Justin

#35

pryzby@telsasoft.com

about 4 years ago

In reply to: Justin Pryzby (#34)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Nov 24, 2021 at 07:15:59PM -0600, Justin Pryzby wrote:

There's extraneous blank lines in these functions:

+pgstat_sum_io_path_ops
+pgstat_report_live_backend_io_path_ops
+pgstat_recv_resetsharedcounter
+GetIOPathDesc
+StrategyRejectBuffer

+ an extra blank line pgstat_reset_shared_counters.

In 0005:

monitoring.sgml says that the columns in pg_stat_buffers are integers, but
they're actually bigint.

+ tupstore = tuplestore_begin_heap(true, false, work_mem);

You're passing a constant randomAccess=true to tuplestore_begin_heap ;)

+Datum all_values[NROWS][COLUMN_LENGTH];

If you were to allocate this as an array, I think it could actually be 3-D:
Datum all_values[BACKEND_NUM_TYPES-1][IOPATH_NUM_TYPES][COLUMN_LENGTH];

But I don't know if this is portable across postgres' supported platforms; I
haven't seen any place which allocates a multidimensional array on the stack,
nor passes one to a function:

+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType backend_type, IOPath io_path)

Maybe the allocation half is okay (I think it's ~3kB), but it seems easier to
palloc the required amount than to research compiler behavior.

That function is only used as a one-line helper, and doesn't use
multidimensional array access anyway:

+ return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];

I think it'd be better as a macro, like (I think)
#define ROW(backend_type, io_path) all_values[NROWS*(backend_type-1)+io_path]

Maybe it should take the column type as a 3 arg.

The enum with COLUMN_LENGTH should be named.

Or maybe it should be removed, and the enum names moved to comments, like:

+                       /* backend_type */
+                       values[val++] = backend_type_desc;

+                       /* io_path */
+                       values[val++] = CStringGetTextDatum(GetIOPathDesc(io_path));

+                       /* allocs */
+                       values[val++] += io_ops->allocs - resets->allocs;
...

*Note the use of += and not =.

Also:
src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1)

I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using
lessthan-or-equal instead of lessthan as you are).

Since the valid backend types start at 1 , the "count" of backend types is
currently B_LOGGER (13) - not 14. I think you should remove the "+1" here.
Then NROWS (if it continued to exist at all) wouldn't need to subtract one.

--
Justin

#36

melanieplageman@gmail.com

about 4 years ago

In reply to: Justin Pryzby (#34)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Thanks for the review!

On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

You wrote beentry++ at the start of two loops, but I think that's wrong; it
should be at the end, as in the rest of the file (or as a loop increment).
BackendStatusArray[0] is actually used (even though its backend has
backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];"

I've fixed this in v16 which I will attach to the next email in the thread.

You could put *_NUM_TYPES as the last value in these enums, like
NUM_AUXPROCTYPES, NUM_PMSIGNALS, and NUM_PROCSIGNALS:
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)

I originally had it as you describe, but based on this feedback upthread
from Álvaro Herrera:

(It's weird to have enum values that are there just to indicate what's
the maximum value. I think that sort of thing is better done by having
a "#define LAST_THING" that takes the last valid value from the enum.
That would free you from having to handle the last value in switch
blocks, for example. LAST_OCLASS in dependency.h is a precedent on this.)

So, I changed it to use macros.

There's extraneous blank lines in these functions:

+pgstat_sum_io_path_ops

Fixed

+pgstat_report_live_backend_io_path_ops

I didn't see one here

+pgstat_recv_resetsharedcounter

I didn't see one here

+GetIOPathDesc

Fixed

+StrategyRejectBuffer

Fixed

This function is doubly-indented:

+pgstat_send_buffers_reset

Fixed. Thanks for catching this.
I also ran pgindent and manually picked a few of the formatting fixes
that were relevant to code I added.

As support for C99 is now required by postgres, variables can be declared as
part of various loops.
+       int                     io_path;
+       for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)

Fixed this and all other occurrences in my code.

Rather than memset(), you could initialize msg like this.
PgStat_MsgIOPathOps msg = {0};
+pgstat_send_buffers(void)
+{
+       PgStat_MsgIOPathOps msg;
+
+       PgBackendStatus *beentry = MyBEEntry;
+
+       if (!beentry)
+               return;
+
+       memset(&msg, 0, sizeof(msg));

though changing the initialization to universal zero initialization
seems to be the correct way, I do get this compiler warning when I make
the change

pgstat.c:3212:29: warning: suggest braces around initialization of
subobject [-Wmissing-braces]
PgStat_MsgIOPathOps msg = {0};
^
{}
I have seem some comments online that say that this is a spurious
warning present with some versions of both gcc and clang when using
-Wmissing-braces to compile code with universal zero initialization, but
I'm not sure what I should do.

v16 attached in next message

- Melanie

#37

melanieplageman@gmail.com

about 4 years ago

In reply to: Justin Pryzby (#35)

7 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v16 (also rebased) attached

On Fri, Nov 26, 2021 at 4:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Nov 24, 2021 at 07:15:59PM -0600, Justin Pryzby wrote:
There's extraneous blank lines in these functions:
+pgstat_sum_io_path_ops
+pgstat_report_live_backend_io_path_ops
+pgstat_recv_resetsharedcounter
+GetIOPathDesc
+StrategyRejectBuffer
+ an extra blank line pgstat_reset_shared_counters.

Fixed

In 0005:

monitoring.sgml says that the columns in pg_stat_buffers are integers, but
they're actually bigint.

Fixed

+ tupstore = tuplestore_begin_heap(true, false, work_mem);

You're passing a constant randomAccess=true to tuplestore_begin_heap ;)

Fixed

+Datum all_values[NROWS][COLUMN_LENGTH];

If you were to allocate this as an array, I think it could actually be 3-D:
Datum all_values[BACKEND_NUM_TYPES-1][IOPATH_NUM_TYPES][COLUMN_LENGTH];

I've changed this to a 3D array as you suggested and removed the NROWS
macro.

But I don't know if this is portable across postgres' supported platforms; I
haven't seen any place which allocates a multidimensional array on the stack,
nor passes one to a function:
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType backend_type, IOPath io_path)
Maybe the allocation half is okay (I think it's ~3kB), but it seems easier to
palloc the required amount than to research compiler behavior.

I think passing it to the function is okay. The parameter type would be
adjusted from an array to a pointer.
I am not sure if the allocation on the stack in the body of
pg_stat_get_buffers is too large. (left as is for now)

That function is only used as a one-line helper, and doesn't use
multidimensional array access anyway:

+ return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path];

with your suggested changes to a 3D array, it now does use multidimensional
array access

I think it'd be better as a macro, like (I think)
#define ROW(backend_type, io_path) all_values[NROWS*(backend_type-1)+io_path]

If I am understanding the idea of the macro, it would change the call
site from this:

+Datum *values = get_pg_stat_buffers_row(all_values,
beentry->st_backendType, io_path);

+values[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+values[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);

to this:

+Datum *row = ROW(beentry->st_backendType, io_path);

+row[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+row[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);

I usually prefer functions to macros, but I am fine with changing it.
(I did not change it in this version)
I have changed all the local variables from "values" to "row" which
I think is a bit clearer.

Maybe it should take the column type as a 3 arg.

If I am understanding this idea, the call site would look like this now:
+CELL(beentry->st_backendType, io_path, COLUMN_FSYNCS) +=
pg_atomic_read_u64(&io_ops->fsyncs);
+CELL(beentry->st_backendType, io_path, COLUMN_ALLOCS) +=
pg_atomic_read_u64(&io_ops->allocs);

I don't like this as much. Since this code is inside of a loop, it kind
of makes sense to me that you get a row at the top of the loop and then
fill in all the cells in the row using that "row" variable.

The enum with COLUMN_LENGTH should be named.

I only use the values in it, so it didn't need a name.

Or maybe it should be removed, and the enum names moved to comments, like:

+                       /* backend_type */
+                       values[val++] = backend_type_desc;

+                       /* io_path */
+                       values[val++] = CStringGetTextDatum(GetIOPathDesc(io_path));

+                       /* allocs */
+                       values[val++] += io_ops->allocs - resets->allocs;
...

I find it easier to understand with it in code instead of as a comment.

*Note the use of += and not =.

Thanks for seeing this. I have changed this (to use +=).

Also:
src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1)

I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using
lessthan-or-equal instead of lessthan as you are).

Since the valid backend types start at 1 , the "count" of backend types is
currently B_LOGGER (13) - not 14. I think you should remove the "+1" here.
Then NROWS (if it continued to exist at all) wouldn't need to subtract one.

I think what I currently have is technically correct because I start at
1 when I am using it as a loop condition. I do waste a spot in the
arrays I allocate with BACKEND_NUM_TYPES size.

I was hesitant to make the value of BACKEND_NUM_TYPES == B_LOGGER
because it seems kind of weird to have it have the same value as the
B_LOGGER enum.

I am open to changing it. (I didn't change it in this v16).

- Melanie

Attachments:

v16-0006-Remove-superfluous-bgwriter-stats.patchapplication/octet-stream; name=v16-0006-Remove-superfluous-bgwriter-stats.patchDownload

From 388ab20a59c65d879eabe045bcf6948c2349555d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v16 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  5 ---
 src/backend/postmaster/checkpointer.c | 29 ++---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 2 insertions(+), 157 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b16952d439..9e869a7fbf 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3551,24 +3551,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3579,35 +3561,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e214d23056..80d73c40ad 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1068,12 +1068,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 8440b2b802..b9c3745474 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1081,10 +1070,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1094,13 +1079,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
+
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1257,15 +1241,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4e7426d273..a73c0ceceb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5923,9 +5923,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5941,9 +5939,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6926fc5742..67447f997a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2164,7 +2164,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 74f0c22170..74d54ae313 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 /*
 * When adding a new column to the pg_stat_buffers view, add a new enum
 * value here above COLUMN_LENGTH.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 32253383ba..e303efa798 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5612,16 +5612,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5641,18 +5631,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 00975bcac6..5f294f7ef3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -511,9 +511,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -526,9 +524,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -959,9 +954,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -975,9 +968,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5869ce442f..09f495792d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
-- 
2.32.0

v16-0004-Add-buffers-to-pgstat_reset_shared_counters.patchapplication/octet-stream; name=v16-0004-Add-buffers-to-pgstat_reset_shared_counters.patchDownload

From 75eeda1b7d5ed68433ce313d368c722f9cd27f2d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:39:48 -0500
Subject: [PATCH v16 4/7] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their PgBackendStatus.
Upon exit, they send these counts to the stats collector. Prior to this commit,
these IO Ops stats would have been reset when the target was "bgwriter".

With this commit, target "bgwriter" no longer will cause the IO operations
stats to be reset, and the IO operations stats can be reset with new target,
"buffers".
---
 doc/src/sgml/monitoring.sgml                |  2 +-
 src/backend/postmaster/pgstat.c             | 65 +++++++++++++++++++--
 src/backend/utils/activity/backend_status.c | 26 +++++++++
 src/include/pgstat.h                        | 12 +++-
 src/include/utils/backend_status.h          |  2 +
 5 files changed, 100 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..bda3eef309 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3604,7 +3604,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4c5f3e9c26..cebbbc13da 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1512,6 +1512,27 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter * msg)
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+	memset(ops, 0, sizeof(ops));
+	pgstat_report_live_backend_io_path_ops(ops);
+
+	for (int backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		msg->m_backend_resets.backend_type = backend_type;
+		memcpy(&msg->m_backend_resets.iop, &ops[backend_type], sizeof(msg->m_backend_resets.iop));
+		pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+	}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1529,7 +1550,14 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
-	if (strcmp(target, "archiver") == 0)
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+	else if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
@@ -1541,7 +1569,7 @@ pgstat_reset_shared_counters(const char *target)
 				 errmsg("unrecognized reset target: \"%s\"", target),
 				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -5577,10 +5605,39 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global, bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		/*
+		 * Because the stats collector cannot write to live backends'
+		 * PgBackendStatuses, it maintains an array of "resets". The reset
+		 * message contains the current values of these counters for live
+		 * backends. The stats collector saves these in its "resets" array,
+		 * then zeroes out the exited backends' saved IO op counters. This is
+		 * required to calculate an accurate total for each IO op counter post
+		 * reset.
+		 */
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * Though globalStats.buffers only needs to be reset once, doing so
+		 * for every message is less brittle and the extra cost is irrelevant
+		 * given how often stats are reset.
+		 */
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+
+		memcpy(&globalStats.buffers.resets[backend_type],
+			   &msg->m_backend_resets.iop.io_path_ops, sizeof(msg->m_backend_resets.iop.io_path_ops));
+
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 413cc605f8..29f6604488 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -630,6 +630,32 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IO
+ * Ops for all IO Paths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		pgstat_sum_io_path_ops(backend_io_path_ops[beentry->st_backendType].io_path_ops,
+							   (IOOps *) beentry->io_path_stats);
+
+		beentry++;
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5d7cb9502d..e8ec0f9b48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -142,6 +142,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -357,7 +358,8 @@ typedef struct PgStatIOPathOps
 
 /*
  * Sent by a backend to the stats collector to report all IO Ops for all IO
- * Paths for a given type of a backend. This will happen when the backend exits.
+ * Paths for a given type of a backend. This will happen when the backend exits
+ * or when stats are reset.
  */
 typedef struct PgStat_MsgIOPathOps
 {
@@ -369,13 +371,18 @@ typedef struct PgStat_MsgIOPathOps
 
 /*
  * Structure used by stats collector to keep track of all types of exited
- * backends' IO Ops for all IO Paths.
+ * backends' IO Ops for all IO Paths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
  */
 typedef struct PgStat_BackendIOPathOps
 {
+	TimestampTz stat_reset_timestamp;
 	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
 } PgStat_BackendIOPathOps;
 
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -396,6 +403,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 717f58d4cc..9c997cace8 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -343,6 +343,7 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
 
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
@@ -370,6 +371,7 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	}
 }
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
-- 
2.32.0

v16-0007-small-comment-correction.patchapplication/octet-stream; name=v16-0007-small-comment-correction.patchDownload

From 06f3bb3ae902f55f8bc6837c34ba9020afe92faf Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:21:08 -0500
Subject: [PATCH v16 7/7] small comment correction

---
 src/backend/utils/activity/backend_status.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 285279b8ed..4054ee453f 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -296,7 +296,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
-- 
2.32.0

v16-0005-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v16-0005-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From b7848cac2c6d40193587b6e352b61b6d86025ca2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:07:37 -0500
Subject: [PATCH v16 5/7] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IO Path and IO Op are enumerated
in the view but not all are populated or even possible at this point.

All backends increment a counter in their PgBackendStatus when
performing an IO operation. On exit, backends send these stats to the
stats collector to be persisted.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Discussion: https://www.postgresql.org/message-id/flat/20210415235954.qcypb4urtovzkat5%40alap3.anarazel.de#724d5cce4bcb587f9167b80a5824bc5c
---
 doc/src/sgml/monitoring.sgml                | 110 +++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/pgstat.c             |  13 ++
 src/backend/utils/activity/backend_status.c |  19 ++-
 src/backend/utils/adt/pgstatfuncs.c         | 153 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/pgstat.h                        |   1 +
 src/include/utils/backend_status.h          |   5 +
 src/test/regress/expected/rules.out         |   8 +
 9 files changed, 326 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bda3eef309..b16952d439 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3613,6 +3622,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path, containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5213,8 +5317,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view, <literal>wal</literal>
+        to reset all the counters shown in the <structname>pg_stat_wal</structname> view, 
+        or <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 61b515cdb8..e214d23056 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1076,6 +1076,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index cebbbc13da..4e7426d273 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2916,6 +2916,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO op statistics for both
+ *	exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 29f6604488..285279b8ed 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -50,7 +50,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -236,6 +236,23 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f529c1561a..74f0c22170 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,159 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above COLUMN_LENGTH.
+*/
+enum
+{
+	COLUMN_BACKEND_TYPE,
+	COLUMN_IO_PATH,
+	COLUMN_ALLOCS,
+	COLUMN_EXTENDS,
+	COLUMN_FSYNCS,
+	COLUMN_WRITES,
+	COLUMN_RESET_TIME,
+	COLUMN_LENGTH,
+};
+
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[BACKEND_NUM_TYPES - 1][IOPATH_NUM_TYPES][COLUMN_LENGTH],
+		BackendType backend_type, IOPath io_path)
+{
+	/*
+	 * Subtract 1 from backend_type to avoid having rows for B_INVALID
+	 * BackendType
+	 */
+	return all_values[backend_type - 1][io_path];
+}
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[BACKEND_NUM_TYPES - 1][IOPATH_NUM_TYPES][COLUMN_LENGTH];
+	bool		all_nulls[BACKEND_NUM_TYPES - 1][IOPATH_NUM_TYPES][COLUMN_LENGTH];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+	tupstore = tuplestore_begin_heap((bool) (rsinfo->allowedModes & SFRM_Materialize_Random),
+			false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	/*
+	 * Loop through all live backends and count their IO Ops for each IO Path
+	 */
+	beentry = BackendStatusArray;
+
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++)
+	{
+		IOOps	   *io_ops;
+
+		/* Don't count dead backends. They should already be counted */
+		if (beentry->st_procpid == 0)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, beentry->st_backendType, io_path);
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_IO_PATH will
+			 * all be set when looping through exited backends array
+			 */
+			row[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			row[COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			row[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			row[COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+
+		beentry++;
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	/* 0 is not a valid BackendType */
+	for (int backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatIOOps *io_ops = backend_io_path_ops->ops[backend_type].io_path_ops;
+		PgStatIOOps *resets = backend_io_path_ops->resets[backend_type].io_path_ops;
+
+		Datum		backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, backend_type, io_path);
+
+			row[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			row[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
+			row[COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+			row[COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+			row[COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+			row[COLUMN_WRITES] += io_ops->writes - resets->writes;
+			row[COLUMN_RESET_TIME] = reset_time;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (int i = 0; i < BACKEND_NUM_TYPES - 1 ; i++)
+	{
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum	   *values = all_values[i][j];
+			bool	   *nulls = all_nulls[i][j];
+
+			tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+		}
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 79d787cd26..32253383ba 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5654,6 +5654,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e8ec0f9b48..00975bcac6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1276,6 +1276,7 @@ extern void pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 9c997cace8..dd983fc949 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -321,6 +321,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
@@ -336,6 +337,10 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+extern const char *GetIOPathDesc(IOPath io_path);
+
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b58b062b10..5869ce442f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.32.0

v16-0003-Send-IO-operations-to-stats-collector.patchapplication/octet-stream; name=v16-0003-Send-IO-operations-to-stats-collector.patchDownload

From d80754bfc5f8e56cd21720fef4b1c88b0a5c9923 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:16:55 -0500
Subject: [PATCH v16 3/7] Send IO operations to stats collector

On exit, backends send the IO operations they have done on all IO Paths
to the stats collector. The stats collector adds these counts to its
existing counts stored in a global data structure it maintains and
persists.

PgStatIOOps contains the same information as backend_status.h's IOOps,
however IOOps' members must be atomics and the stats collector has no
such requirement.
---
 src/backend/postmaster/pgstat.c | 88 +++++++++++++++++++++++++++++++--
 src/include/miscadmin.h         |  2 +
 src/include/pgstat.h            | 45 +++++++++++++++++
 3 files changed, 132 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7264d2c727..4c5f3e9c26 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -126,9 +126,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various IO ops are tracked in PgBackendStatus while a backend
+ * is alive and then sent to stats collector before a backend exits in a
+ * PgStat_MsgIOPathOps.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -369,6 +372,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -3152,6 +3156,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IO Ops for IO Paths when a process exits.
+	 * Users requiring IO Ops for both live and exited backends can read from
+	 * live backends' PgBackendStatus and sum this with totals from exited
+	 * backends persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3301,6 +3313,31 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_sum_io_path_ops(msg.iop.io_path_ops,
+						   (IOOps *) &beentry->io_path_stats);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3483,6 +3520,28 @@ pgstat_send_subscription_purge(PgStat_MsgSubscriptionPurge *msg)
 	pgstat_send(msg, len);
 }
 
+/*
+ * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
+ * local) to those in the equivalent stats structure for exited backends. Note
+ * that this adds and doesn't set, so the destination stats structure should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all IO Op stats for all IO Paths for a particular backend type to the
+ * pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
+{
+	for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+}
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -3692,6 +3751,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5813,6 +5876,25 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops;
+	PgStatIOOps *dest_io_path_ops =
+	globalStats.buffers.ops[msg->backend_type].io_path_ops;
+
+	for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		PgStatIOOps *src = &src_io_path_ops[io_path];
+		PgStatIOOps *dest = &dest_io_path_ops[io_path];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..6785fb3813 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5b51b58e5a..5d7cb9502d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -73,6 +73,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -335,6 +336,46 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IO ops in the stats collector
+ */
+typedef struct PgStatIOOps
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOps;
+
+/*
+ * Structure for counting all IO Ops on all types of IO Paths.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOps io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IO Ops for all IO
+ * Paths for a given type of a backend. This will happen when the backend exits.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IO Ops for all IO Paths.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -756,6 +797,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -939,6 +981,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1215,8 +1258,10 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
-- 
2.32.0

v16-0002-Add-IO-operation-counters-to-PgBackendStatus.patchapplication/octet-stream; name=v16-0002-Add-IO-operation-counters-to-PgBackendStatus.patchDownload

From 0f7fe1e56c44b6787db863c78e9d92bac32a4a25 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v16 2/7] Add IO operation counters to PgBackendStatus

Add an array of counters in PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend. Each "IO
Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct,
local, shared, or strategy). "local" and "shared" IO Path counters count
operations on local and shared buffers. The "strategy" IO Path counts
buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy.
The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, all backends increment a counter in their
PgBackendStatus when performing an IO operation. This is in preparation
for future commits which will persist these stats upon backend exit and
use the counters to provide observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A separate proposed patch [1] which would add wrappers for smgrwrite(),
smgrextend(), and smgrimmedsync() would provide a good location to call
pgstat_inc_ioop() for unbuffered IO and avoid regressions for future
users of these functions.

[1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com
---
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/storage/buffer/bufmgr.c         | 46 ++++++++++---
 src/backend/storage/buffer/freelist.c       | 22 ++++++-
 src/backend/storage/buffer/localbuf.c       |  3 +
 src/backend/storage/sync/sync.c             |  1 +
 src/backend/utils/activity/backend_status.c |  9 +++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 73 +++++++++++++++++++++
 8 files changed, 144 insertions(+), 15 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 25a18b7a14..8440b2b802 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1101,6 +1101,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96..6926fc5742 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,7 +480,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,13 +1250,26 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular
+				 * write. When a strategy is not in use, at this point, the
+				 * write can only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2552,10 +2570,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2803,9 +2822,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2897,6 +2919,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3544,6 +3568,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3579,7 +3605,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3675,7 +3701,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3731,7 +3757,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3758,7 +3784,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..45d73995b2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04b3558ea3..f396a2b68d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..3be06d5d5a 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -420,6 +420,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..413cc605f8 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -399,6 +399,15 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		IOOps	   *io_ops = &lbeentry.io_path_stats[io_path];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..717f58d4cc 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,48 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOps for a live backend.
+ */
+typedef struct IOOps
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOps;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +205,16 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IO Ops for all IO Paths for this backend. When the
+	 * pg_stat_buffers view is queried and when stats are reset, one backend
+	 * will read io_path_stats from all live backends and combine them with
+	 * io_path_stats from exited backends for each backend type. When this
+	 * backend exits, it will send io_path_stats to the stats collector to be
+	 * persisted.
+	 */
+	IOOps		io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,6 +343,32 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOps	   *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
-- 
2.32.0

v16-0001-Read-only-atomic-backend-write-function.patchapplication/octet-stream; name=v16-0001-Read-only-atomic-backend-write-function.patchDownload

From 8295bae890fbcf18fff8ebe0b3a8a51571b7c065 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v16 1/7] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values, however, pg_atomic_fetch_add_u64() is overkill for
incrementing the counter. pg_atomic_inc_counter() is a helper function
which can be used to increment these values safely but without
unnecessary overhead.

Author: Thomas Munro
---
 src/include/port/atomics.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..9bbc0322c9 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void
+pg_atomic_inc_counter(pg_atomic_uint64 * counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.32.0

#38

pryzby@telsasoft.com

about 4 years ago

In reply to: Melanie Plageman (#37)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Dec 01, 2021 at 05:00:14PM -0500, Melanie Plageman wrote:

Also:
src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1)

I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using
lessthan-or-equal instead of lessthan as you are).

Since the valid backend types start at 1 , the "count" of backend types is
currently B_LOGGER (13) - not 14. I think you should remove the "+1" here.
Then NROWS (if it continued to exist at all) wouldn't need to subtract one.

I think what I currently have is technically correct because I start at
1 when I am using it as a loop condition. I do waste a spot in the
arrays I allocate with BACKEND_NUM_TYPES size.

I was hesitant to make the value of BACKEND_NUM_TYPES == B_LOGGER
because it seems kind of weird to have it have the same value as the
B_LOGGER enum.

I don't mean to say that the code is misbehaving - I mean "num_x" means "the
number of x's" - how many there are. Since the first, valid backend type is 1,
and they're numbered consecutively and without duplicates, then "the number of
backend types" is the same as the value of the last one (B_LOGGER). It's
confusing if there's a macro called BACKEND_NUM_TYPES which is greater than the
number of backend types.

Most loops say for (int i=0; i<NUM; ++i)
If it's 1-based, they say for (int i=1; i<=NUM; ++i)
You have two different loops like:

+       for (int i = 0; i < BACKEND_NUM_TYPES - 1 ; i++)
+       for (int backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)

Both of these iterate over the correct number of backend types, but they both
*look* wrong, which isn't desirable.

--
Justin

#39

pryzby@telsasoft.com

about 4 years ago

In reply to: Melanie Plageman (#36)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote:

Thanks for the review!

On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

You wrote beentry++ at the start of two loops, but I think that's wrong; it
should be at the end, as in the rest of the file (or as a loop increment).
BackendStatusArray[0] is actually used (even though its backend has
backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];"

I've fixed this in v16 which I will attach to the next email in the thread.
You could put *_NUM_TYPES as the last value in these enums, like
NUM_AUXPROCTYPES, NUM_PMSIGNALS, and NUM_PROCSIGNALS:
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+#define BACKEND_NUM_TYPES (B_LOGGER + 1)
I originally had it as you describe, but based on this feedback upthread
from ï¿½lvaro Herrera:

I saw that after I made my suggestion. Sorry for the noise.
Both ways already exist in postgres and seem to be acceptable.

There's extraneous blank lines in these functions:
+pgstat_recv_resetsharedcounter

I didn't see one here

=> The extra blank line is after the RESET_BUFFERS memset.

+        * Reset the global, bgwriter and checkpointer statistics for the
+        * cluster.

The first comma in this comment was introduced in 1bc8e7b09, and seems to be
extraneous, since bgwriter and checkpointer are both global. With the comma,
it looks like it should be memsetting 3 things.

+ /* Don't count dead backends. They should already be counted */

Maybe this comment should say ".. they'll be added below"

+                       row[COLUMN_BACKEND_TYPE] = backend_type_desc;
+                       row[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
+                       row[COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+                       row[COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+                       row[COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+                       row[COLUMN_WRITES] += io_ops->writes - resets->writes;
+                       row[COLUMN_RESET_TIME] = reset_time;

It'd be clearer if RESET_TIME were set adjacent to BACKEND_TYPE and IO_PATH.

Rather than memset(), you could initialize msg like this.
PgStat_MsgIOPathOps msg = {0};

though changing the initialization to universal zero initialization
seems to be the correct way, I do get this compiler warning when I make
the change

pgstat.c:3212:29: warning: suggest braces around initialization of subobject [-Wmissing-braces]

I have seem some comments online that say that this is a spurious
warning present with some versions of both gcc and clang when using
-Wmissing-braces to compile code with universal zero initialization, but
I'm not sure what I should do.

I think gcc is suggesting to write something like {{0}}, and I think the online
commentary you found is saying that the warning is a false positive.
So I think you should ignore my suggestion - it's not worth the bother.

This message needs to be updated:
errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")))

When I query the view, I see reset times as: 1999-12-31 18:00:00-06.
I guess it should be initialized like this one:
globalStats.bgwriter.stat_reset_timestamp = ts

The cfbot shows failures now (I thought it was passing with the previous patch,
but I suppose I'm wrong.)

It looks like running recovery during single user mode hits this assertion.
TRAP: FailedAssertion("beentry", File: "../../../../src/include/utils/backend_status.h", Line: 359, PID: 3499)

--
Justin

#40

pryzby@telsasoft.com

about 4 years ago

In reply to: Melanie Plageman (#36)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote:

Thanks for the review!

On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

You wrote beentry++ at the start of two loops, but I think that's wrong; it
should be at the end, as in the rest of the file (or as a loop increment).
BackendStatusArray[0] is actually used (even though its backend has
backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];"

I've fixed this in v16 which I will attach to the next email in the thread.

I just noticed that since beentry++ is now at the end of the loop, it's being
missed when you "continue":

+               if (beentry->st_procpid == 0)
+                       continue;

Also, I saw that pgindent messed up and added spaces after pointers in function
declarations, due to new typedefs not in typedefs.list:

-pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter * msg)

-static inline void pg_atomic_inc_counter(pg_atomic_uint64 *counter)
+static inline void
+pg_atomic_inc_counter(pg_atomic_uint64 * counter)

--
Justin

#41

melanieplageman@gmail.com

about 4 years ago

In reply to: Justin Pryzby (#40)

7 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Thanks again! I really appreciate the thorough review.

I have combined responses to all three of your emails below.
Let me know if it is more confusing to do it this way.

On Wed, Dec 1, 2021 at 6:59 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Dec 01, 2021 at 05:00:14PM -0500, Melanie Plageman wrote:

Also:
src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1)

I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using
lessthan-or-equal instead of lessthan as you are).

Since the valid backend types start at 1 , the "count" of backend types is
currently B_LOGGER (13) - not 14. I think you should remove the "+1" here.
Then NROWS (if it continued to exist at all) wouldn't need to subtract one.

I think what I currently have is technically correct because I start at
1 when I am using it as a loop condition. I do waste a spot in the
arrays I allocate with BACKEND_NUM_TYPES size.

I was hesitant to make the value of BACKEND_NUM_TYPES == B_LOGGER
because it seems kind of weird to have it have the same value as the
B_LOGGER enum.

I don't mean to say that the code is misbehaving - I mean "num_x" means "the
number of x's" - how many there are. Since the first, valid backend type is 1,
and they're numbered consecutively and without duplicates, then "the number of
backend types" is the same as the value of the last one (B_LOGGER). It's
confusing if there's a macro called BACKEND_NUM_TYPES which is greater than the
number of backend types.

Most loops say for (int i=0; i<NUM; ++i)
If it's 1-based, they say for (int i=1; i<=NUM; ++i)
You have two different loops like:
+       for (int i = 0; i < BACKEND_NUM_TYPES - 1 ; i++)
+       for (int backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++)
Both of these iterate over the correct number of backend types, but they both
*look* wrong, which isn't desirable.

I've changed this and added comments wherever I could to make it clear.
Whenever the parameter was of type BackendType, I tried to use the
correct (not adjusted by subtracting 1) number and wherever the type was
int and being used as an index into the array, I used the adjusted value
and added the idx suffix to make it clear that the number does not
reflect the actual BackendType:

On Wed, Dec 1, 2021 at 10:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote:

Thanks for the review!

On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

There's extraneous blank lines in these functions:
+pgstat_recv_resetsharedcounter

I didn't see one here

=> The extra blank line is after the RESET_BUFFERS memset.

Fixed.

+        * Reset the global, bgwriter and checkpointer statistics for the
+        * cluster.
The first comma in this comment was introduced in 1bc8e7b09, and seems to be
extraneous, since bgwriter and checkpointer are both global. With the comma,
it looks like it should be memsetting 3 things.

Fixed.

+ /* Don't count dead backends. They should already be counted */

Maybe this comment should say ".. they'll be added below"

Fixed.

+                       row[COLUMN_BACKEND_TYPE] = backend_type_desc;
+                       row[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
+                       row[COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+                       row[COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+                       row[COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+                       row[COLUMN_WRITES] += io_ops->writes - resets->writes;
+                       row[COLUMN_RESET_TIME] = reset_time;

It'd be clearer if RESET_TIME were set adjacent to BACKEND_TYPE and IO_PATH.

If you mean just in the order here (not in the column order in the
view), then I have changed it as you recommended.

This message needs to be updated:
errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")))

Done.

When I query the view, I see reset times as: 1999-12-31 18:00:00-06.
I guess it should be initialized like this one:
globalStats.bgwriter.stat_reset_timestamp = ts

Done.

The cfbot shows failures now (I thought it was passing with the previous patch,
but I suppose I'm wrong.)

It looks like running recovery during single user mode hits this assertion.
TRAP: FailedAssertion("beentry", File: "../../../../src/include/utils/backend_status.h", Line: 359, PID: 3499)

Yes, thank you for catching this.
I have moved up pgstat_beinit and pgstat_bestart so that single user
mode process will also have PgBackendStatus. I also have to guard
against sending these stats to the collector since there is no room for
B_INVALID backendtype in the array of IO Op values.

With this change `make check-world` passes on my machine.

On Wed, Dec 1, 2021 at 11:06 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote:

Thanks for the review!

On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

You wrote beentry++ at the start of two loops, but I think that's wrong; it
should be at the end, as in the rest of the file (or as a loop increment).
BackendStatusArray[0] is actually used (even though its backend has
backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];"

I've fixed this in v16 which I will attach to the next email in the thread.

I just noticed that since beentry++ is now at the end of the loop, it's being
missed when you "continue":
+               if (beentry->st_procpid == 0)
+                       continue;

Fixed.

Also, I saw that pgindent messed up and added spaces after pointers in function
declarations, due to new typedefs not in typedefs.list:
-pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter * msg)
-static inline void pg_atomic_inc_counter(pg_atomic_uint64 *counter)
+static inline void
+pg_atomic_inc_counter(pg_atomic_uint64 * counter)

Fixed.

-- Melanie

Attachments:

v17-0007-small-comment-correction.patchapplication/octet-stream; name=v17-0007-small-comment-correction.patchDownload

From e917d83ca3760a9d2660178af3d3e45392e3e06b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:21:08 -0500
Subject: [PATCH v17 7/7] small comment correction

---
 src/backend/utils/activity/backend_status.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index f1997441bc..fe8fc74121 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -296,7 +296,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
-- 
2.32.0

v17-0006-Remove-superfluous-bgwriter-stats.patchapplication/octet-stream; name=v17-0006-Remove-superfluous-bgwriter-stats.patchDownload

From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 47 ---------------------------
 src/backend/catalog/system_views.sql  |  5 ---
 src/backend/postmaster/checkpointer.c | 29 ++---------------
 src/backend/postmaster/pgstat.c       |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 ----
 src/backend/utils/adt/pgstatfuncs.c   | 30 -----------------
 src/include/catalog/pg_proc.dat       | 22 -------------
 src/include/pgstat.h                  | 10 ------
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 2 insertions(+), 157 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b16952d439..9e869a7fbf 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3551,24 +3551,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3579,35 +3561,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e214d23056..80d73c40ad 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1068,12 +1068,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 8440b2b802..b9c3745474 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1081,10 +1070,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1094,13 +1079,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
+
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1257,15 +1241,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9b041cb1b9..b1ea10a77a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5945,9 +5945,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5963,9 +5961,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6926fc5742..67447f997a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2164,7 +2164,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9fd7a8cdb9..29d3dc6a79 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 /*
 * When adding a new column to the pg_stat_buffers view, add a new enum
 * value here above COLUMN_LENGTH.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 32253383ba..e303efa798 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5612,16 +5612,6 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5641,18 +5631,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e1c635108..865dd6d201 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -513,9 +513,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -528,9 +526,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -961,9 +956,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -977,9 +970,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5869ce442f..09f495792d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,12 +1821,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
-- 
2.32.0

v17-0005-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v17-0005-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 6d84255845e9f945f96a30e787befe93fb162695 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:07:37 -0500
Subject: [PATCH v17 5/7] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IO Path and IO Op are enumerated
in the view but not all are populated or even possible at this point.

All backends increment a counter in their PgBackendStatus when
performing an IO operation. On exit, backends send these stats to the
stats collector to be persisted.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Discussion: https://www.postgresql.org/message-id/flat/20210415235954.qcypb4urtovzkat5%40alap3.anarazel.de#724d5cce4bcb587f9167b80a5824bc5c
---
 doc/src/sgml/monitoring.sgml                | 110 +++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/pgstat.c             |  13 ++
 src/backend/utils/activity/backend_status.c |  19 ++-
 src/backend/utils/adt/pgstatfuncs.c         | 154 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/pgstat.h                        |   1 +
 src/include/utils/backend_status.h          |   5 +
 src/test/regress/expected/rules.out         |   8 +
 9 files changed, 327 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bda3eef309..b16952d439 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3613,6 +3622,101 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path, containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-wal-view">
    <title><structname>pg_stat_wal</structname></title>
 
@@ -5213,8 +5317,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view, <literal>wal</literal>
+        to reset all the counters shown in the <structname>pg_stat_wal</structname> view, 
+        or <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 61b515cdb8..e214d23056 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1076,6 +1076,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c40b375b9a..9b041cb1b9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2925,6 +2925,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO op statistics for both
+ *	exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 9e9ca3e5a6..f1997441bc 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -50,7 +50,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -236,6 +236,23 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f529c1561a..9fd7a8cdb9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,160 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above COLUMN_LENGTH.
+*/
+enum
+{
+	COLUMN_BACKEND_TYPE,
+	COLUMN_IO_PATH,
+	COLUMN_ALLOCS,
+	COLUMN_EXTENDS,
+	COLUMN_FSYNCS,
+	COLUMN_WRITES,
+	COLUMN_RESET_TIME,
+	COLUMN_LENGTH,
+};
+
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][COLUMN_LENGTH],
+		BackendType backend_type, IOPath io_path)
+{
+	/*
+	 * Caller must pass in a valid BackendType, as there are no rows for
+	 * B_INVALID. Subtract 1 to arrive at the correct row.
+	 */
+	Assert(backend_type > B_INVALID);
+	return all_values[backend_type - 1][io_path];
+}
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][COLUMN_LENGTH];
+	bool		all_nulls[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][COLUMN_LENGTH];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+	tupstore = tuplestore_begin_heap((bool) (rsinfo->allowedModes & SFRM_Materialize_Random),
+			false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	 /* Loop through all live backends and count their IO Ops for each IO Path */
+	beentry = BackendStatusArray;
+
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		IOOps	   *io_ops;
+
+		/* Don't count dead backends. They will be added below */
+		Assert(beentry->st_backendType >= B_INVALID);
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, beentry->st_backendType, io_path);
+
+			/*
+			 * COLUMN_RESET_TIME, COLUMN_BACKEND_TYPE, and COLUMN_IO_PATH will
+			 * all be set when looping through exited backends array
+			 */
+			row[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			row[COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			row[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			row[COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++)
+	{
+		PgStatIOOps *io_ops = backend_io_path_ops->ops[backend_type_idx].io_path_ops;
+		PgStatIOOps *resets = backend_io_path_ops->resets[backend_type_idx].io_path_ops;
+
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(backend_type_idx + 1));
+
+		for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			/*
+			 * all_values does not include rows for the B_INVALID BackendType
+			 * and get_pg_stat_buffers_row() expects a valid BackendType, so
+			 * add 1 to the backend_type_idx to ensure the correct rows are
+			 * returned for the desired BackendType.
+			 */
+			Datum *row = get_pg_stat_buffers_row(all_values, backend_type_idx + 1, io_path);
+
+			row[COLUMN_BACKEND_TYPE] = backend_type_desc;
+			row[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path));
+			row[COLUMN_RESET_TIME] = reset_time;
+			row[COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+			row[COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+			row[COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+			row[COLUMN_WRITES] += io_ops->writes - resets->writes;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++)
+	{
+		for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+		{
+			Datum	   *values = all_values[backend_type_idx][io_path];
+			bool	   *nulls = all_nulls[backend_type_idx][io_path];
+
+			tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 79d787cd26..32253383ba 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5654,6 +5654,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2496d7e071..9e1c635108 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1278,6 +1278,7 @@ extern void pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index acb4a85eef..4dbf32d48e 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -317,6 +317,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
@@ -332,6 +333,10 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+extern const char *GetIOPathDesc(IOPath io_path);
+
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b58b062b10..5869ce442f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.32.0

v17-0003-Send-IO-operations-to-stats-collector.patchapplication/octet-stream; name=v17-0003-Send-IO-operations-to-stats-collector.patchDownload

From b13037ff82a9871b37296eaccc39c7574d38d20f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:16:55 -0500
Subject: [PATCH v17 3/7] Send IO operations to stats collector

On exit, backends send the IO operations they have done on all IO Paths
to the stats collector. The stats collector adds these counts to its
existing counts stored in a global data structure it maintains and
persists.

PgStatIOOps contains the same information as backend_status.h's IOOps,
however IOOps' members must be atomics and the stats collector has no
such requirement.
---
 src/backend/postmaster/pgstat.c | 103 +++++++++++++++++++++++++++++++-
 src/include/miscadmin.h         |   2 +
 src/include/pgstat.h            |  49 +++++++++++++++
 3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7264d2c727..05097fc7bd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -126,9 +126,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and I/O global statistics counters. I/O global
+ * statistics on various IO ops are tracked in PgBackendStatus while a backend
+ * is alive and then sent to stats collector before a backend exits in a
+ * PgStat_MsgIOPathOps.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around.  We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -369,6 +372,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -3152,6 +3156,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IO Ops for IO Paths when a process exits.
+	 * Users requiring IO Ops for both live and exited backends can read from
+	 * live backends' PgBackendStatus and sum this with totals from exited
+	 * backends persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3301,6 +3313,37 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	/*
+	 * Though some backends with type B_INVALID (such as the single-user mode
+	 * process) do initialize and increment IO operations stats, there is no
+	 * spot in the array of IO operations for backends of type B_INVALID. As
+	 * such, do not send these to the stats collector.
+	 */
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_sum_io_path_ops(msg.iop.io_path_ops,
+						   (IOOps *) &beentry->io_path_stats);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3483,6 +3526,28 @@ pgstat_send_subscription_purge(PgStat_MsgSubscriptionPurge *msg)
 	pgstat_send(msg, len);
 }
 
+/*
+ * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
+ * local) to those in the equivalent stats structure for exited backends. Note
+ * that this adds and doesn't set, so the destination stats structure should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all IO Op stats for all IO Paths for a particular backend type to the
+ * pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
+{
+	for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+}
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -3692,6 +3757,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5813,6 +5882,34 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	PgStatIOOps *src_io_path_ops;
+	PgStatIOOps *dest_io_path_ops;
+
+	/*
+	 * Subtract 1 from message's BackendType to get a valid index into the
+	 * array of IO Ops which does not include an entry for B_INVALID
+	 * BackendType.
+	 */
+	Assert(msg->backend_type > B_INVALID);
+
+	src_io_path_ops = msg->iop.io_path_ops;
+	dest_io_path_ops = globalStats.buffers.ops[msg->backend_type - 1].io_path_ops;
+
+	for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		PgStatIOOps *src = &src_io_path_ops[io_path];
+		PgStatIOOps *dest = &dest_io_path_ops[io_path];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..662170c72e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_LOGGER
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5b51b58e5a..f99be84db6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -73,6 +73,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -335,6 +336,50 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IO ops in the stats collector
+ */
+typedef struct PgStatIOOps
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOps;
+
+/*
+ * Structure for counting all IO Ops on all types of IO Paths.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOps io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IO Ops for all IO
+ * Paths for a given type of a backend. This will happen when the backend exits.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IO Ops for all IO Paths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector. Be sure to subtract 1 from BackendType when accessing
+ * the array "ops" or "resets", as they do not contain entries for B_INVALID
+ * BackendType.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -756,6 +801,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -939,6 +985,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1215,8 +1262,10 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
-- 
2.32.0

v17-0004-Add-buffers-to-pgstat_reset_shared_counters.patchapplication/octet-stream; name=v17-0004-Add-buffers-to-pgstat_reset_shared_counters.patchDownload

From f972ea87270feaed464a74fb6541ac04b4fc7d98 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:39:48 -0500
Subject: [PATCH v17 4/7] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their PgBackendStatus.
Upon exit, they send these counts to the stats collector. Prior to this commit,
these IO Ops stats would have been reset when the target was "bgwriter".

With this commit, target "bgwriter" no longer will cause the IO operations
stats to be reset, and the IO operations stats can be reset with new target,
"buffers".
---
 doc/src/sgml/monitoring.sgml                |  2 +-
 src/backend/postmaster/pgstat.c             | 83 +++++++++++++++++++--
 src/backend/utils/activity/backend_status.c | 29 +++++++
 src/include/pgstat.h                        |  8 +-
 src/include/utils/backend_status.h          |  2 +
 5 files changed, 117 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..bda3eef309 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3604,7 +3604,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 05097fc7bd..c40b375b9a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1512,6 +1512,35 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+	memset(ops, 0, sizeof(ops));
+	pgstat_report_live_backend_io_path_ops(ops);
+
+	/*
+	 * Iterate through the array of IO Ops for all IO Paths for each
+	 * BackendType. Because the array does not include a spot for BackendType
+	 * B_INVALID, add 1 to the index when setting backend_type so that there is
+	 * no confusion as to the BackendType with which this reset message
+	 * corresponds.
+	 */
+	for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++)
+	{
+		msg->m_backend_resets.backend_type = backend_type_idx + 1;
+		memcpy(&msg->m_backend_resets.iop, &ops[backend_type_idx],
+				sizeof(msg->m_backend_resets.iop));
+		pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+	}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1529,7 +1558,14 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
-	if (strcmp(target, "archiver") == 0)
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+	else if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
@@ -1539,9 +1575,10 @@ pgstat_reset_shared_counters(const char *target)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
+				 errhint(
+					 "Target must be \"archiver\", \"bgwriter\", \"buffers\", or \"wal\".")));
+
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -4418,6 +4455,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 */
 	ts = GetCurrentTimestamp();
 	globalStats.bgwriter.stat_reset_timestamp = ts;
+	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
 
@@ -5583,10 +5621,45 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		/*
+		 * Because the stats collector cannot write to live backends'
+		 * PgBackendStatuses, it maintains an array of "resets". The reset
+		 * message contains the current values of these counters for live
+		 * backends. The stats collector saves these in its "resets" array,
+		 * then zeroes out the exited backends' saved IO op counters. This is
+		 * required to calculate an accurate total for each IO op counter post
+		 * reset.
+		 */
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * Though globalStats.buffers only needs to be reset once, doing so
+		 * for every message is less brittle and the extra cost is irrelevant
+		 * given how often stats are reset.
+		 */
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+
+		/*
+		 * Subtract 1 from backend_type as the sender sent a valid BackendType
+		 * but the resets array does not contain an entry for B_INVALID
+		 * BackendType.
+		 */
+		Assert(backend_type > B_INVALID);
+		memcpy(&globalStats.buffers.resets[backend_type - 1],
+				&msg->m_backend_resets.iop.io_path_ops,
+				sizeof(msg->m_backend_resets.iop.io_path_ops));
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 413cc605f8..9e9ca3e5a6 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -630,6 +630,35 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IO
+ * Ops for all IO Paths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		/* Don't count dead backends or those with type B_INVALID. */
+		Assert(beentry->st_backendType >= B_INVALID);
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		/*
+		 * Subtract 1 from the BackendType to arrive at a valid index in the
+		 * array, as it does not contain a spot for B_INVALID BackendType.
+		 */
+		pgstat_sum_io_path_ops(backend_io_path_ops[beentry->st_backendType - 1].io_path_ops,
+							   (IOOps *) beentry->io_path_stats);
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f99be84db6..2496d7e071 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -142,6 +142,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -357,7 +358,8 @@ typedef struct PgStatIOPathOps
 
 /*
  * Sent by a backend to the stats collector to report all IO Ops for all IO
- * Paths for a given type of a backend. This will happen when the backend exits.
+ * Paths for a given type of a backend. This will happen when the backend exits
+ * or when stats are reset.
  */
 typedef struct PgStat_MsgIOPathOps
 {
@@ -377,9 +379,12 @@ typedef struct PgStat_MsgIOPathOps
  */
 typedef struct PgStat_BackendIOPathOps
 {
+	TimestampTz stat_reset_timestamp;
 	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
 } PgStat_BackendIOPathOps;
 
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -400,6 +405,7 @@ typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 2e5e949453..acb4a85eef 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -339,6 +339,7 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
 
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
@@ -366,6 +367,7 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	}
 }
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
-- 
2.32.0

v17-0002-Add-IO-operation-counters-to-PgBackendStatus.patchapplication/octet-stream; name=v17-0002-Add-IO-operation-counters-to-PgBackendStatus.patchDownload

From b0e193cfa08f0b8cf1be929f26fe38f06a39aeae Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v17 2/7] Add IO operation counters to PgBackendStatus

Add an array of counters in PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend. Each "IO
Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct,
local, shared, or strategy). "local" and "shared" IO Path counters count
operations on local and shared buffers. The "strategy" IO Path counts
buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy.
The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, all backends increment a counter in their
PgBackendStatus when performing an IO operation. This is in preparation
for future commits which will persist these stats upon backend exit and
use the counters to provide observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A separate proposed patch [1] which would add wrappers for smgrwrite(),
smgrextend(), and smgrimmedsync() would provide a good location to call
pgstat_inc_ioop() for unbuffered IO and avoid regressions for future
users of these functions.

[1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com
---
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/storage/buffer/bufmgr.c         | 46 +++++++++++---
 src/backend/storage/buffer/freelist.c       | 22 ++++++-
 src/backend/storage/buffer/localbuf.c       |  3 +
 src/backend/storage/sync/sync.c             |  1 +
 src/backend/utils/activity/backend_status.c |  9 +++
 src/backend/utils/init/postinit.c           |  6 +-
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 69 +++++++++++++++++++++
 9 files changed, 142 insertions(+), 19 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 25a18b7a14..8440b2b802 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1101,6 +1101,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08ebabfe96..6926fc5742 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,7 +480,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,13 +1250,26 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular
+				 * write. When a strategy is not in use, at this point, the
+				 * write can only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2552,10 +2570,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2803,9 +2822,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2897,6 +2919,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3544,6 +3568,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3579,7 +3605,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3675,7 +3701,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3731,7 +3757,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3758,7 +3784,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..45d73995b2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04b3558ea3..f396a2b68d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..3be06d5d5a 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -420,6 +420,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..413cc605f8 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -399,6 +399,15 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
+	for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		IOOps	   *io_ops = &lbeentry.io_path_stats[io_path];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
 
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 646126edee..93f1b4bcfc 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -623,6 +623,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
 	}
 
+	pgstat_beinit();
 	/*
 	 * Initialize local process's access to XLOG.
 	 */
@@ -649,6 +650,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 */
 		CreateAuxProcessResourceOwner();
 
+		pgstat_bestart();
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
 		ReleaseAuxProcessResources(true);
@@ -676,7 +678,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
@@ -914,10 +915,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 * transaction we started before returning.
 		 */
 		if (!bootstrap)
-		{
-			pgstat_bestart();
 			CommitTransactionCommand();
-		}
 		return;
 	}
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..2e5e949453 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,48 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOps for a live backend.
+ */
+typedef struct IOOps
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOps;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +205,12 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IO Ops for all IO Paths for this backend. These should be
+	 * incremented whenever an IO Operation is performed.
+	 */
+	IOOps		io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,6 +339,32 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOps	   *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
-- 
2.32.0

v17-0001-Read-only-atomic-backend-write-function.patchapplication/octet-stream; name=v17-0001-Read-only-atomic-backend-write-function.patchDownload

From e0f7f3dfd60a68fa01f3c023bcdb69305ade3738 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v17 1/7] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values, however, pg_atomic_fetch_add_u64() is overkill for
incrementing the counter. pg_atomic_inc_counter() is a helper function
which can be used to increment these values safely but without
unnecessary overhead.

Author: Thomas Munro
---
 src/include/port/atomics.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..39ffff24dd 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void
+pg_atomic_inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.32.0

#42

andres@anarazel.de

about 4 years ago

In reply to: Melanie Plageman (#41)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2021-12-03 15:02:24 -0500, Melanie Plageman wrote:

From e0f7f3dfd60a68fa01f3c023bcdb69305ade3738 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v17 1/7] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values, however, pg_atomic_fetch_add_u64() is overkill for
incrementing the counter. pg_atomic_inc_counter() is a helper function
which can be used to increment these values safely but without
unnecessary overhead.

Author: Thomas Munro
---
src/include/port/atomics.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..39ffff24dd 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
}
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void
+pg_atomic_inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}

I wonder if it's worth putting something in the name indicating that this is
not actual atomic RMW operation. Perhaps adding _unlocked?

From b0e193cfa08f0b8cf1be929f26fe38f06a39aeae Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v17 2/7] Add IO operation counters to PgBackendStatus

Add an array of counters in PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend. Each "IO
Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct,
local, shared, or strategy). "local" and "shared" IO Path counters count
operations on local and shared buffers. The "strategy" IO Path counts
buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy.
The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, all backends increment a counter in their
PgBackendStatus when performing an IO operation. This is in preparation
for future commits which will persist these stats upon backend exit and
use the counters to provide observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A separate proposed patch [1] which would add wrappers for smgrwrite(),
smgrextend(), and smgrimmedsync() would provide a good location to call
pgstat_inc_ioop() for unbuffered IO and avoid regressions for future
users of these functions.

[1] /messages/by-id/CAAKRu_aw72w70X1P=ba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g@mail.gmail.com

On longer thread it's nice for committers to already have Reviewed-By: in the
commit message.

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..413cc605f8 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -399,6 +399,15 @@ pgstat_bestart(void)
lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
lbeentry.st_progress_command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
+	for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{
+		IOOps	   *io_ops = &lbeentry.io_path_stats[io_path];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}

/*
* we don't zero st_progress_param here to save cycles; nobody should

nit: I think we nearly always have a blank line before loops

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 646126edee..93f1b4bcfc 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -623,6 +623,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
}

+ pgstat_beinit();
/*
* Initialize local process's access to XLOG.
*/

nit: same with multi-line comments.

@@ -649,6 +650,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
*/
CreateAuxProcessResourceOwner();

+ pgstat_bestart();
StartupXLOG();
/* Release (and warn about) any buffer pins leaked in StartupXLOG */
ReleaseAuxProcessResources(true);
@@ -676,7 +678,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
EnablePortalManager();

/* Initialize status reporting */
- pgstat_beinit();

I'd like to see changes like moving this kind of thing around broken around
and committed separately. It's much easier to pinpoint breakage if the CF
breaks after moving just pgstat_beinit() around, rather than when committing
this considerably larger patch. And reordering subsystem initialization has
the habit of causing problems...

+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
[...]
+/*
+ * Structure for counting all types of IOOps for a live backend.
+ */
+typedef struct IOOps
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOps;

To me IOop and IOOps sound to much alike - even though they're really kind of
separate things. s/IOOps/IOOpCounters/ maybe?

@@ -3152,6 +3156,14 @@ pgstat_shutdown_hook(int code, Datum arg)
{
Assert(!pgstat_is_shutdown);

+	/*
+	 * Only need to send stats on IO Ops for IO Paths when a process exits.
+	 * Users requiring IO Ops for both live and exited backends can read from
+	 * live backends' PgBackendStatus and sum this with totals from exited
+	 * backends persisted by the stats collector.
+	 */
+	pgstat_send_buffers();

Perhaps something like this comment belongs somewhere at the top of the file,
or in the header, or ...? It's a fairly central design piece, and it's not
obvious one would need to look in the shutdown hook for it?

+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	/*
+	 * Though some backends with type B_INVALID (such as the single-user mode
+	 * process) do initialize and increment IO operations stats, there is no
+	 * spot in the array of IO operations for backends of type B_INVALID. As
+	 * such, do not send these to the stats collector.
+	 */
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;

Why does single user mode use B_INVALID? That doesn't seem quite right.

+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	pgstat_sum_io_path_ops(msg.iop.io_path_ops,
+						   (IOOps *) &beentry->io_path_stats);
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}

It seems worth having a path skipping sending the message if there was no IO?

+/*
+ * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
+ * local) to those in the equivalent stats structure for exited backends. Note
+ * that this adds and doesn't set, so the destination stats structure should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all IO Op stats for all IO Paths for a particular backend type to the
+ * pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
+{
+	for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+	{

Sacriligeous, but I find io_path a harder to understand variable name for the
counter than i (or io_path_off or ...) ;)

+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	PgStatIOOps *src_io_path_ops;
+	PgStatIOOps *dest_io_path_ops;
+
+	/*
+	 * Subtract 1 from message's BackendType to get a valid index into the
+	 * array of IO Ops which does not include an entry for B_INVALID
+	 * BackendType.
+	 */
+	Assert(msg->backend_type > B_INVALID);

Probably worth also asserting the upper boundary?

From f972ea87270feaed464a74fb6541ac04b4fc7d98 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:39:48 -0500
Subject: [PATCH v17 4/7] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their PgBackendStatus.
Upon exit, they send these counts to the stats collector. Prior to this commit,
these IO Ops stats would have been reset when the target was "bgwriter".

With this commit, target "bgwriter" no longer will cause the IO operations
stats to be reset, and the IO operations stats can be reset with new target,
"buffers".
---
doc/src/sgml/monitoring.sgml | 2 +-
src/backend/postmaster/pgstat.c | 83 +++++++++++++++++++--
src/backend/utils/activity/backend_status.c | 29 +++++++
src/include/pgstat.h | 8 +-
src/include/utils/backend_status.h | 2 +
5 files changed, 117 insertions(+), 7 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..bda3eef309 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3604,7 +3604,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<structfield>stats_reset</structfield> <type>timestamp with time zone</type>
</para>
<para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
</para></entry>
</row>
</tbody>

Hm?

Shouldn't this new reset target be documented?

+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+	memset(ops, 0, sizeof(ops));
+	pgstat_report_live_backend_io_path_ops(ops);
+
+	/*
+	 * Iterate through the array of IO Ops for all IO Paths for each
+	 * BackendType. Because the array does not include a spot for BackendType
+	 * B_INVALID, add 1 to the index when setting backend_type so that there is
+	 * no confusion as to the BackendType with which this reset message
+	 * corresponds.
+	 */
+	for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++)
+	{
+		msg->m_backend_resets.backend_type = backend_type_idx + 1;
+		memcpy(&msg->m_backend_resets.iop, &ops[backend_type_idx],
+				sizeof(msg->m_backend_resets.iop));
+		pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+	}
+}

Probably worth explaining why multiple messages are sent?

@@ -5583,10 +5621,45 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
{
if (msg->m_resettarget == RESET_BGWRITER)
{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
}

Oh, is this a live bug?

+		/*
+		 * Subtract 1 from the BackendType to arrive at a valid index in the
+		 * array, as it does not contain a spot for B_INVALID BackendType.
+		 */

Instead of repeating a comment about +- 1 in a bunch of places, would it look
better to have two helper inline functions for this purpose?

+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above COLUMN_LENGTH.
+*/
+enum
+{
+	COLUMN_BACKEND_TYPE,
+	COLUMN_IO_PATH,
+	COLUMN_ALLOCS,
+	COLUMN_EXTENDS,
+	COLUMN_FSYNCS,
+	COLUMN_WRITES,
+	COLUMN_RESET_TIME,
+	COLUMN_LENGTH,
+};

COLUMN_LENGTH seems like a fairly generic name...

From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter

When do you think it makes sense to tackle these wrt committing some of the
patches?

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6926fc5742..67447f997a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2164,7 +2164,6 @@ BufferSync(int flags)
if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
{
TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
num_written++;
}
}
@@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);

-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
/*
* If we're not running the LRU scan, just stop after doing the stats
* stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
reusable_buffers++;
}

- PendingBgWriterStats.m_buf_written_clean += num_written;
-

Isn't num_written unused now, unless tracepoints are enabled? I'd expect some
compilers to warn... Perhaps we should just remove information from the
tracepoint?

Greetings,

Andres Freund

#43

melanieplageman@gmail.com

about 4 years ago

In reply to: Andres Freund (#42)

8 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v18 attached.

On Thu, Dec 9, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-12-03 15:02:24 -0500, Melanie Plageman wrote:
From e0f7f3dfd60a68fa01f3c023bcdb69305ade3738 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v17 1/7] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values, however, pg_atomic_fetch_add_u64() is overkill for
incrementing the counter. pg_atomic_inc_counter() is a helper function
which can be used to increment these values safely but without
unnecessary overhead.

Author: Thomas Munro
---
src/include/port/atomics.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..39ffff24dd 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
}
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void
+pg_atomic_inc_counter(pg_atomic_uint64 *counter)
+{
+     pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
I wonder if it's worth putting something in the name indicating that this is
not actual atomic RMW operation. Perhaps adding _unlocked?

Done.

From b0e193cfa08f0b8cf1be929f26fe38f06a39aeae Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v17 2/7] Add IO operation counters to PgBackendStatus

Add an array of counters in PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend. Each "IO
Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct,
local, shared, or strategy). "local" and "shared" IO Path counters count
operations on local and shared buffers. The "strategy" IO Path counts
buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy.
The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, all backends increment a counter in their
PgBackendStatus when performing an IO operation. This is in preparation
for future commits which will persist these stats upon backend exit and
use the counters to provide observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A separate proposed patch [1] which would add wrappers for smgrwrite(),
smgrextend(), and smgrimmedsync() would provide a good location to call
pgstat_inc_ioop() for unbuffered IO and avoid regressions for future
users of these functions.

[1] /messages/by-id/CAAKRu_aw72w70X1P=ba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g@mail.gmail.com

On longer thread it's nice for committers to already have Reviewed-By: in the
commit message.

Done.

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..413cc605f8 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -399,6 +399,15 @@ pgstat_bestart(void)
lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
lbeentry.st_progress_command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
+     for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+     {
+             IOOps      *io_ops = &lbeentry.io_path_stats[io_path];
+
+             pg_atomic_init_u64(&io_ops->allocs, 0);
+             pg_atomic_init_u64(&io_ops->extends, 0);
+             pg_atomic_init_u64(&io_ops->fsyncs, 0);
+             pg_atomic_init_u64(&io_ops->writes, 0);
+     }

/*
* we don't zero st_progress_param here to save cycles; nobody should

nit: I think we nearly always have a blank line before loops

Done.

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 646126edee..93f1b4bcfc 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -623,6 +623,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
}

+ pgstat_beinit();
/*
* Initialize local process's access to XLOG.
*/

nit: same with multi-line comments.

Done.

@@ -649,6 +650,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
*/
CreateAuxProcessResourceOwner();

+ pgstat_bestart();
StartupXLOG();
/* Release (and warn about) any buffer pins leaked in StartupXLOG */
ReleaseAuxProcessResources(true);
@@ -676,7 +678,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
EnablePortalManager();

/* Initialize status reporting */
- pgstat_beinit();

I'd like to see changes like moving this kind of thing around broken around
and committed separately. It's much easier to pinpoint breakage if the CF
breaks after moving just pgstat_beinit() around, rather than when committing
this considerably larger patch. And reordering subsystem initialization has
the habit of causing problems...

Done

+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+     IOOP_ALLOC,
+     IOOP_EXTEND,
+     IOOP_FSYNC,
+     IOOP_WRITE,
+} IOOp;
[...]
+/*
+ * Structure for counting all types of IOOps for a live backend.
+ */
+typedef struct IOOps
+{
+     pg_atomic_uint64 allocs;
+     pg_atomic_uint64 extends;
+     pg_atomic_uint64 fsyncs;
+     pg_atomic_uint64 writes;
+} IOOps;

To me IOop and IOOps sound to much alike - even though they're really kind of
separate things. s/IOOps/IOOpCounters/ maybe?

Done.

@@ -3152,6 +3156,14 @@ pgstat_shutdown_hook(int code, Datum arg)
{
Assert(!pgstat_is_shutdown);
+     /*
+      * Only need to send stats on IO Ops for IO Paths when a process exits.
+      * Users requiring IO Ops for both live and exited backends can read from
+      * live backends' PgBackendStatus and sum this with totals from exited
+      * backends persisted by the stats collector.
+      */
+     pgstat_send_buffers();
Perhaps something like this comment belongs somewhere at the top of the file,
or in the header, or ...? It's a fairly central design piece, and it's not
obvious one would need to look in the shutdown hook for it?

now in pgstat.h above the declaration of pgstat_send_buffers()

+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+     PgStat_MsgIOPathOps msg;
+
+     PgBackendStatus *beentry = MyBEEntry;
+
+     /*
+      * Though some backends with type B_INVALID (such as the single-user mode
+      * process) do initialize and increment IO operations stats, there is no
+      * spot in the array of IO operations for backends of type B_INVALID. As
+      * such, do not send these to the stats collector.
+      */
+     if (!beentry || beentry->st_backendType == B_INVALID)
+             return;

Why does single user mode use B_INVALID? That doesn't seem quite right.

I think PgBackendStatus->st_backendType is set from MyBackendType which
isn't set for the single user mode process. What BackendType would you
expect to see?

+     memset(&msg, 0, sizeof(msg));
+     msg.backend_type = beentry->st_backendType;
+
+     pgstat_sum_io_path_ops(msg.iop.io_path_ops,
+                                                (IOOps *) &beentry->io_path_stats);
+
+     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+     pgstat_send(&msg, sizeof(msg));
+}

It seems worth having a path skipping sending the message if there was no IO?

Makes sense. I've updated pgstat_send_buffers() to do a loop after calling
pgstat_sum_io_path_ops() and check if it should skip sending.

I also thought about having pgstat_sum_io_path_ops() return a value to
indicate if everything was 0 -- which could be useful to future callers
potentially?

I didn't do this because I am not sure what the return value would be.
It could be a bool and be true if any IO was done and false if none was
done -- but that doesn't really make sense given the function's name it
would be called like
if (!pgstat_sum_io_path_ops())
return
which I'm not sure is very clear

+/*
+ * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared,
+ * local) to those in the equivalent stats structure for exited backends. Note
+ * that this adds and doesn't set, so the destination stats structure should be
+ * zeroed out by the caller initially. This would commonly be used to transfer
+ * all IO Op stats for all IO Paths for a particular backend type to the
+ * pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src)
+{
+     for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++)
+     {

Sacriligeous, but I find io_path a harder to understand variable name for the
counter than i (or io_path_off or ...) ;)

I've updated almost all my non-standard loop index variable names.

+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+     PgStatIOOps *src_io_path_ops;
+     PgStatIOOps *dest_io_path_ops;
+
+     /*
+      * Subtract 1 from message's BackendType to get a valid index into the
+      * array of IO Ops which does not include an entry for B_INVALID
+      * BackendType.
+      */
+     Assert(msg->backend_type > B_INVALID);

Probably worth also asserting the upper boundary?

Done.

From f972ea87270feaed464a74fb6541ac04b4fc7d98 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:39:48 -0500
Subject: [PATCH v17 4/7] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their PgBackendStatus.
Upon exit, they send these counts to the stats collector. Prior to this commit,
these IO Ops stats would have been reset when the target was "bgwriter".

With this commit, target "bgwriter" no longer will cause the IO operations
stats to be reset, and the IO operations stats can be reset with new target,
"buffers".
---
doc/src/sgml/monitoring.sgml | 2 +-
src/backend/postmaster/pgstat.c | 83 +++++++++++++++++++--
src/backend/utils/activity/backend_status.c | 29 +++++++
src/include/pgstat.h | 8 +-
src/include/utils/backend_status.h | 2 +
5 files changed, 117 insertions(+), 7 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..bda3eef309 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3604,7 +3604,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<structfield>stats_reset</structfield> <type>timestamp with time zone</type>
</para>
<para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
</para></entry>
</row>
</tbody>
Hm?

Shouldn't this new reset target be documented?

It is in the commit adding the view. I didn't include it in this commit
because the pg_stat_buffers view doesn't exist yet, as of this commit,
and I thought it would be odd to mention it in the docs (in this
commit).
As an aside, I shouldn't have left this correction in this commit. I
moved it now to the other one.

+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+     PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+     memset(ops, 0, sizeof(ops));
+     pgstat_report_live_backend_io_path_ops(ops);
+
+     /*
+      * Iterate through the array of IO Ops for all IO Paths for each
+      * BackendType. Because the array does not include a spot for BackendType
+      * B_INVALID, add 1 to the index when setting backend_type so that there is
+      * no confusion as to the BackendType with which this reset message
+      * corresponds.
+      */
+     for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++)
+     {
+             msg->m_backend_resets.backend_type = backend_type_idx + 1;
+             memcpy(&msg->m_backend_resets.iop, &ops[backend_type_idx],
+                             sizeof(msg->m_backend_resets.iop));
+             pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+     }
+}

Probably worth explaining why multiple messages are sent?

Done.

@@ -5583,10 +5621,45 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
{
if (msg->m_resettarget == RESET_BGWRITER)
{
-             /* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-             memset(&globalStats, 0, sizeof(globalStats));
+             /*
+              * Reset the global bgwriter and checkpointer statistics for the
+              * cluster.
+              */
+             memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+             memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
}

Oh, is this a live bug?

I don't think it is a bug. globalStats only contained bgwriter and
checkpointer stats and those were all only displayed in
pg_stat_bgwriter(), so memsetting the whole thing seems fine.

+             /*
+              * Subtract 1 from the BackendType to arrive at a valid index in the
+              * array, as it does not contain a spot for B_INVALID BackendType.
+              */
Instead of repeating a comment about +- 1 in a bunch of places, would it look
better to have two helper inline functions for this purpose?

Done.

+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above COLUMN_LENGTH.
+*/
+enum
+{
+     COLUMN_BACKEND_TYPE,
+     COLUMN_IO_PATH,
+     COLUMN_ALLOCS,
+     COLUMN_EXTENDS,
+     COLUMN_FSYNCS,
+     COLUMN_WRITES,
+     COLUMN_RESET_TIME,
+     COLUMN_LENGTH,
+};

COLUMN_LENGTH seems like a fairly generic name...

Changed.

From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter

When do you think it makes sense to tackle these wrt committing some of the
patches?

Well, the new stats are a superset of the old stats (no stats have been
removed that are not represented in the new or old views). So, I don't
see that as a blocker for committing these patches.

Since it is weird that pg_stat_bgwriter had mostly checkpointer stats,
I've edited this commit to rename that view to pg_stat_checkpointer.

I have not made a separate view just for maxwritten_clean (presumably
called pg_stat_bgwriter), but I would not be opposed to doing this if
you thought having a view with a single column isn't a problem (in the
event that we don't get around to adding more bgwriter stats right
away).

I noticed after changing the docs on the "bgwriter" target for
pg_stat_reset_shared to say "checkpointer", that it still said "bgwriter" in
src/backend/po/ko.po
src/backend/po/it.po
...
I presume these are automatically updated with some incantation, but I wasn't
sure what it was nor could I find documentation on this.

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6926fc5742..67447f997a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2164,7 +2164,6 @@ BufferSync(int flags)
if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
{
TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                             PendingCheckpointerStats.m_buf_written_checkpoints++;
num_written++;
}
}
@@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);

-     /* Report buffer alloc counts to pgstat */
-     PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
/*
* If we're not running the LRU scan, just stop after doing the stats
* stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
reusable_buffers++;
}

- PendingBgWriterStats.m_buf_written_clean += num_written;
-

Isn't num_written unused now, unless tracepoints are enabled? I'd expect some
compilers to warn... Perhaps we should just remove information from the
tracepoint?

The local variable num_written is used in BgBufferSync() to determine
whether or not to increment maxwritten_clean which is still represented
in the view pg_stat_checkpointer (formerly pg_stat_bgwriter).

A local variable num_written is used in BufferSync() to increment
CheckpointStats.ckpt_bufs_written which is logged in LogCheckpointEnd(),
so I'm not sure that can be removed.

- Melanie

Attachments:

v18-0001-Read-only-atomic-backend-write-function.patchtext/x-patch; charset=US-ASCII; name=v18-0001-Read-only-atomic-backend-write-function.patchDownload

From 0a35d1bf2d70131b298adf0aa35349f13d8ad8aa Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v18 1/8] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values, however, pg_atomic_fetch_add_u64() is overkill for
incrementing the counter. pg_atomic_inc_counter() is a helper function
which can be used to increment these values safely but without
unnecessary overhead.

Author: Thomas Munro <tmunro@postgresql.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ06d3h5JeOtAv4h52n0vG1jOPZxqMCn5FySJQUVZA32w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/include/port/atomics.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..dac9767b1c 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void
+pg_atomic_unlocked_inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.30.2

v18-0002-Move-backend-pgstat-initialization-earlier.patchtext/x-patch; charset=US-ASCII; name=v18-0002-Move-backend-pgstat-initialization-earlier.patchDownload

From 316b073738e08d26db3da743d6da327b5bd1e390 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 14 Dec 2021 12:26:56 -0500
Subject: [PATCH v18 2/8] Move backend pgstat initialization earlier

Initialize pgstats subsystem earlier during process initialization so
that more process types have a backend activity state.

Conditionally initializing backend activity state in some types of
processes and not in others necessitates surprising special cases later.

This particular commit was motivated by single user mode missing a
backend activity state.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/postinit.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7292e51f7d..11f1fec17e 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -623,6 +623,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
 	}
 
+	pgstat_beinit();
+
 	/*
 	 * If this is either a bootstrap process nor a standalone backend, start
 	 * up the XLOG machinery, and register to have it closed down at exit.
@@ -638,6 +640,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 */
 		CreateAuxProcessResourceOwner();
 
+		pgstat_bestart();
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
 		ReleaseAuxProcessResources(true);
@@ -665,7 +668,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
@@ -903,10 +905,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 * transaction we started before returning.
 		 */
 		if (!bootstrap)
-		{
-			pgstat_bestart();
 			CommitTransactionCommand();
-		}
 		return;
 	}
 
-- 
2.30.2

v18-0008-small-comment-correction.patchtext/x-patch; charset=US-ASCII; name=v18-0008-small-comment-correction.patchDownload

From 2d8ad99743e50c0dc021d116782fe8d63329fc31 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:21:08 -0500
Subject: [PATCH v18 8/8] small comment correction

---
 src/backend/utils/activity/backend_status.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 08e3f8f167..9666c962b4 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -296,7 +296,7 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
+ *	Called from InitPostgres and AuxiliaryProcessMain
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
-- 
2.30.2

v18-0005-Add-buffers-to-pgstat_reset_shared_counters.patchtext/x-patch; charset=US-ASCII; name=v18-0005-Add-buffers-to-pgstat_reset_shared_counters.patchDownload

From ebd21154bc83b02a31ce185bbfd2c51ad9c7d565 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:39:48 -0500
Subject: [PATCH v18 5/8] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their
PgBackendStatus. Upon exit, they send these counts to the stats
collector. Prior to this commit, these IO operation stats would have
been reset when the target was "bgwriter".

With this commit, target "bgwriter" no longer will cause the IO
operations stats to be reset, and the IO operations stats can be reset
with new target, "buffers".

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c             | 77 +++++++++++++++++++--
 src/backend/utils/activity/backend_status.c | 27 ++++++++
 src/include/pgstat.h                        | 30 ++++++--
 src/include/utils/backend_status.h          |  2 +
 4 files changed, 125 insertions(+), 11 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2102c0be98..8b8e3ccfcb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1545,6 +1545,36 @@ BackendType idx_get_backend_type(int idx)
 	return backend_type;
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+	memset(ops, 0, sizeof(ops));
+	pgstat_report_live_backend_io_path_ops(ops);
+
+	/*
+	 * Iterate through the array of all IOOps for all IOPaths for each
+	 * BackendType.
+	 *
+	 * An individual message is sent for each backend type because sending all
+	 * IO operations in one message would exceed the PGSTAT_MAX_MSG_SIZE of
+	 * 1000.
+	 */
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		msg->m_backend_resets.backend_type = idx_get_backend_type(i);
+		memcpy(&msg->m_backend_resets.iop, &ops[i],
+				sizeof(msg->m_backend_resets.iop));
+		pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+	}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1562,7 +1592,14 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
-	if (strcmp(target, "archiver") == 0)
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+	else if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
@@ -1572,9 +1609,10 @@ pgstat_reset_shared_counters(const char *target)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
+				 errhint(
+					 "Target must be \"archiver\", \"bgwriter\", \"buffers\", or \"wal\".")));
+
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -4464,6 +4502,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 */
 	ts = GetCurrentTimestamp();
 	globalStats.bgwriter.stat_reset_timestamp = ts;
+	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
 
@@ -5629,10 +5668,38 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		/*
+		 * Because the stats collector cannot write to live backends'
+		 * PgBackendStatuses, it maintains an array of "resets". The reset
+		 * message contains the current values of these counters for live
+		 * backends. The stats collector saves these in its "resets" array,
+		 * then zeroes out the exited backends' saved IO operations counters.
+		 * This is required to calculate an accurate total for each IO
+		 * operations counter post reset.
+		 */
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * Though globalStats.buffers only needs to be reset once, doing so for
+		 * every message is less brittle and the extra cost is irrelevant given
+		 * how often stats are reset.
+		 */
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+		memcpy(&globalStats.buffers.resets[backend_type_get_idx(backend_type)],
+				&msg->m_backend_resets.iop.io_path_ops,
+				sizeof(msg->m_backend_resets.iop.io_path_ops));
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 44c9a6e1a6..9fb888f2ca 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -631,6 +631,33 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IOOps
+ * for all IOPaths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		int idx;
+
+		/* Don't count dead backends or those with type B_INVALID. */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		idx = backend_type_get_idx(beentry->st_backendType);
+		pgstat_sum_io_path_ops(backend_io_path_ops[idx].io_path_ops,
+				(IOOpCounters *) beentry->io_path_stats);
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5c6ec5b9ad..c7c304c4d8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -142,6 +142,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -357,7 +358,8 @@ typedef struct PgStatIOPathOps
 
 /*
  * Sent by a backend to the stats collector to report all IOOps for all IOPaths
- * for a given type of a backend. This will happen when the backend exits.
+ * for a given type of a backend. This will happen when the backend exits or
+ * when stats are reset.
  */
 typedef struct PgStat_MsgIOPathOps
 {
@@ -369,15 +371,18 @@ typedef struct PgStat_MsgIOPathOps
 
 /*
  * Structure used by stats collector to keep track of all types of exited
- * backends' IO Ops for all IO Paths as well as all stats from live backends at
+ * backends' IOOps for all IOPaths as well as all stats from live backends at
  * the time of stats reset. resets is populated using a reset message sent to
  * the stats collector.
  */
 typedef struct PgStat_BackendIOPathOps
 {
+	TimestampTz stat_reset_timestamp;
 	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
 } PgStat_BackendIOPathOps;
 
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -389,15 +394,28 @@ typedef struct PgStat_MsgResetcounter
 	Oid			m_databaseid;
 } PgStat_MsgResetcounter;
 
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *								to reset a shared counter
- * ----------
+/*
+ * Sent by the backend to tell the collector to reset a shared counter.
+ *
+ * In addition to the message header and reset target, the message also
+ * contains an array with all of the IO operations for all IO paths done by a
+ * particular backend type.
+ *
+ * This is needed because the IO operation stats for live backends cannot be
+ * safely modified by other processes. Therefore, to correctly calculate the
+ * total IO operations for a particular backend type after a reset, the balance
+ * of IO operations for live backends at the time of prior resets must be
+ * subtracted from the total IO operations.
+ *
+ * To satisfy this requirement the process initiating the reset will read the
+ * IO operations from live backends and send them to the stats collector which
+ * maintains an array of reset values.
  */
 typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 0ee450cda1..55e30022f6 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -342,6 +342,7 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
 
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
@@ -369,6 +370,7 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	}
 }
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
-- 
2.30.2

v18-0007-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v18-0007-Remove-superfluous-bgwriter-stats.patchDownload

From baae117915069f3807c67c98b649dbeab71fc47c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v18 7/8] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers. Rename pg_stat_bgwriter to pg_stat_checkpointer
since most of the stats now concern checkpointer.

TODO:
- make pg_stat_bgwriter view again and move maxwritten_clean into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml          | 69 +++++----------------------
 src/backend/catalog/system_views.sql  | 11 ++---
 src/backend/postmaster/checkpointer.c | 29 +----------
 src/backend/postmaster/pgstat.c       | 13 ++---
 src/backend/storage/buffer/bufmgr.c   |  6 ---
 src/backend/utils/adt/pgstatfuncs.c   | 34 +------------
 src/include/catalog/pg_proc.dat       | 34 +++----------
 src/include/pgstat.h                  | 12 +----
 src/test/regress/expected/rules.out   | 17 +++----
 9 files changed, 35 insertions(+), 190 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b16952d439..412f0d2502 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -427,11 +427,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </row>
 
      <row>
-      <entry><structname>pg_stat_bgwriter</structname><indexterm><primary>pg_stat_bgwriter</primary></indexterm></entry>
+      <entry><structname>pg_stat_checkpointer</structname><indexterm><primary>pg_stat_checkpointer</primary></indexterm></entry>
       <entry>One row only, showing statistics about the
        background writer process's activity. See
-       <link linkend="monitoring-pg-stat-bgwriter-view">
-       <structname>pg_stat_bgwriter</structname></link> for details.
+       <link linkend="monitoring-pg-stat-checkpointer-view">
+       <structname>pg_stat_checkpointer</structname></link> for details.
      </entry>
      </row>
 
@@ -3485,20 +3485,20 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
- <sect2 id="monitoring-pg-stat-bgwriter-view">
-  <title><structname>pg_stat_bgwriter</structname></title>
+ <sect2 id="monitoring-pg-stat-checkpointer-view">
+  <title><structname>pg_stat_checkpointer</structname></title>
 
   <indexterm>
-   <primary>pg_stat_bgwriter</primary>
+   <primary>pg_stat_checkpointer</primary>
   </indexterm>
 
   <para>
-   The <structname>pg_stat_bgwriter</structname> view will always have a
+   The <structname>pg_stat_checkpointer</structname> view will always have a
    single row, containing global data for the cluster.
   </para>
 
-  <table id="pg-stat-bgwriter-view" xreflabel="pg_stat_bgwriter">
-   <title><structname>pg_stat_bgwriter</structname> View</title>
+  <table id="pg-stat-checkpointer-view" xreflabel="pg_stat_checkpointer">
+   <title><structname>pg_stat_checkpointer</structname> View</title>
    <tgroup cols="1">
     <thead>
      <row>
@@ -3551,24 +3551,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written during checkpoints
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written by the background writer
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>maxwritten_clean</structfield> <type>bigint</type>
@@ -3579,35 +3561,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers written directly by a backend
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
-      </para>
-      <para>
-       Number of buffers allocated
-      </para></entry>
-     </row>
-
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
@@ -5313,9 +5266,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
+        argument.  The argument can be <literal>checkpointer</literal> to reset
         all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
+        the <structname>pg_stat_checkpointer</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view, <literal>wal</literal>
         to reset all the counters shown in the <structname>pg_stat_wal</structname> view, 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e214d23056..6258c2770c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1062,18 +1062,13 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
-CREATE VIEW pg_stat_bgwriter AS
+CREATE VIEW pg_stat_checkpointer AS
     SELECT
-        pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
-        pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
+        pg_stat_get_timed_checkpoints() AS checkpoints_timed,
+        pg_stat_get_requested_checkpoints() AS checkpoints_req,
         pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
         pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 8440b2b802..b9c3745474 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1081,10 +1070,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1094,13 +1079,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
+
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1257,15 +1241,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3545e197ce..13e46f497f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1601,8 +1601,8 @@ pgstat_reset_shared_counters(const char *target)
 	}
 	else if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
-	else if (strcmp(target, "bgwriter") == 0)
-		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "checkpointer") == 0)
+		msg.m_resettarget = RESET_CHECKPOINTER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -1610,7 +1610,7 @@ pgstat_reset_shared_counters(const char *target)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
 				 errhint(
-					 "Target must be \"archiver\", \"bgwriter\", \"buffers\", or \"wal\".")));
+					 "Target must be \"archiver\", \"checkpointer\", \"buffers\", or \"wal\".")));
 
 
 	pgstat_send(&msg, sizeof(msg));
@@ -5679,7 +5679,7 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 static void
 pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
-	if (msg->m_resettarget == RESET_BGWRITER)
+	if (msg->m_resettarget == RESET_CHECKPOINTER)
 	{
 		/*
 		 * Reset the global bgwriter and checkpointer statistics for the
@@ -5985,9 +5985,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -6003,9 +6001,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 949f1088b6..8120567de1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2164,7 +2164,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 26730230b6..52500c8af4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1727,29 +1727,17 @@ pg_stat_get_db_sessions_killed(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
+pg_stat_get_timed_checkpoints(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
-pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
+pg_stat_get_requested_checkpoints(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1778,24 +1766,6 @@ pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 /*
 * When adding a new column to the pg_stat_buffers view, add a new enum
 * value here above BUFFERS_NUM_COLUMNS.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 148208d242..293df23040 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5585,25 +5585,15 @@
   proargnames => '{archived_count,last_archived_wal,last_archived_time,failed_count,last_failed_wal,last_failed_time,stats_reset}',
   prosrc => 'pg_stat_get_archiver' },
 { oid => '2769',
-  descr => 'statistics: number of timed checkpoints started by the bgwriter',
-  proname => 'pg_stat_get_bgwriter_timed_checkpoints', provolatile => 's',
+  descr => 'statistics: number of scheduled checkpoints performed',
+  proname => 'pg_stat_get_timed_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_timed_checkpoints' },
+  prosrc => 'pg_stat_get_timed_checkpoints' },
 { oid => '2770',
-  descr => 'statistics: number of backend requested checkpoints started by the bgwriter',
-  proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
+  descr => 'statistics: number of backend requested checkpoints performed',
+  proname => 'pg_stat_get_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
+  prosrc => 'pg_stat_get_requested_checkpoints' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5623,18 +5613,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8e07584767..020cd4ca87 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -141,7 +141,7 @@ typedef struct PgStat_TableCounts
 typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
-	RESET_BGWRITER,
+	RESET_CHECKPOINTER,
 	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
@@ -523,9 +523,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -538,9 +536,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -971,9 +966,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -987,9 +980,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5869ce442f..13f48722c3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1817,17 +1817,6 @@ pg_stat_archiver| SELECT s.archived_count,
     s.last_failed_time,
     s.stats_reset
    FROM pg_stat_get_archiver() s(archived_count, last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time, stats_reset);
-pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
-    pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
-    pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
-    pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
-    pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
-    pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
     b.alloc,
@@ -1836,6 +1825,12 @@ pg_stat_buffers| SELECT b.backend_type,
     b.write,
     b.stats_reset
    FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
+pg_stat_checkpointer| SELECT pg_stat_get_timed_checkpoints() AS checkpoints_timed,
+    pg_stat_get_requested_checkpoints() AS checkpoints_req,
+    pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
+    pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
+    pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
+    pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.30.2

v18-0004-Send-IO-operations-to-stats-collector.patchtext/x-patch; charset=US-ASCII; name=v18-0004-Send-IO-operations-to-stats-collector.patchDownload

From e26f274748e3857515e476d43370df4376ca770c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:16:55 -0500
Subject: [PATCH v18 4/8] Send IO operations to stats collector

On exit, backends send the IO operations they have done on all IO Paths
to the stats collector. The stats collector adds these counts to its
existing counts stored in a global data structure it maintains and
persists.

PgStatIOOpCounters contains the same information as backend_status.h's
IOOpCounters, however IOOpCounters' members must be atomics and the
stats collector has no such requirement.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c    | 143 ++++++++++++++++++++++++++++-
 src/include/miscadmin.h            |   2 +
 src/include/pgstat.h               |  56 +++++++++++
 src/include/utils/backend_status.h |   4 +
 4 files changed, 202 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7264d2c727..2102c0be98 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -126,9 +126,12 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
+ * BgWriter, Checkpointer, WAL, and IO global statistics counters. IO global
+ * statistics on various IO operations are tracked in PgBackendStatus while a
+ * backend is alive and then sent to stats collector before a backend exits in
+ * a PgStat_MsgIOPathOps.
+ * All others are stored directly in a stats message structure so they can be
+ * sent without needing to copy things around. We assume these init to zeroes.
  */
 PgStat_MsgBgWriter PendingBgWriterStats;
 PgStat_MsgCheckpointer PendingCheckpointerStats;
@@ -369,6 +372,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -1508,6 +1512,39 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -3152,6 +3189,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IOOps for IOPaths when a process exits. Users
+	 * requiring IOOps for both live and exited backends can read from live
+	 * backends' PgBackendStatus and sum this with totals from exited backends
+	 * persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3301,6 +3346,49 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO operations statistics to the
+ * collector so that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStatIOOpCounters *io_path_ops;
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+	PgStat_Counter sum = 0;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	io_path_ops = msg.iop.io_path_ops;
+	pgstat_sum_io_path_ops(io_path_ops, (IOOpCounters *)
+			&beentry->io_path_stats);
+
+	/*
+	 * Check if no IO was done. If so, don't bother sending anything to the
+	 * stats collector.
+	 */
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		sum += io_path_ops[i].allocs;
+		sum += io_path_ops[i].extends;
+		sum += io_path_ops[i].fsyncs;
+		sum += io_path_ops[i].writes;
+	}
+
+	if (sum == 0)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3483,6 +3571,29 @@ pgstat_send_subscription_purge(PgStat_MsgSubscriptionPurge *msg)
 	pgstat_send(msg, len);
 }
 
+/*
+ * Helper function to sum all IO operations stats for all IOPaths (e.g. shared,
+ * local) from live backends to those in the equivalent stats structure for
+ * exited backends.
+ * Note that this adds and doesn't set, so the destination stats structure
+ * should be zeroed out by the caller initially.
+ * This would commonly be used to transfer all IOOp stats for all IOPaths for a
+ * particular backend type to the pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src)
+{
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+}
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -3692,6 +3803,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5813,6 +5928,28 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	PgStatIOOpCounters *src_io_path_ops;
+	PgStatIOOpCounters *dest_io_path_ops;
+
+	src_io_path_ops = msg->iop.io_path_ops;
+	dest_io_path_ops =
+		globalStats.buffers.ops[backend_type_get_idx(msg->backend_type)].io_path_ops;
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStatIOOpCounters *src = &src_io_path_ops[i];
+		PgStatIOOpCounters *dest = &dest_io_path_ops[i];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..662170c72e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_LOGGER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_LOGGER
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5b51b58e5a..5c6ec5b9ad 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -73,6 +73,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -335,6 +336,48 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IOOps in the stats collector
+ */
+typedef struct PgStatIOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOpCounters;
+
+/*
+ * Structure for counting all IOOps on all types of IOPaths.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOpCounters io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IOOps for all IOPaths
+ * for a given type of a backend. This will happen when the backend exits.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IO Ops for all IO Paths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -756,6 +799,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -939,6 +983,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1215,8 +1260,19 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+/*
+ * While some processes send some types of statistics to the collector at
+ * regular intervals (e.g. CheckpointerMain() calling
+ * pgstat_send_checkpointer()), IO operations stats are only sent by
+ * pgstat_send_buffers() when a process exits (in pgstat_shutdown_hook()). IO
+ * operations stats from live backends can be read from their PgBackendStatuses
+ * and, if desired, summed with totals from exited backends persisted by the
+ * stats collector.
+ */
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 56a0f25296..0ee450cda1 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -331,6 +331,10 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+extern int backend_type_get_idx(BackendType backend_type);
+extern BackendType idx_get_backend_type(int idx);
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
-- 
2.30.2

v18-0006-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v18-0006-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 015b4e3b25aea17c9f8143fce7bb8488cfa12511 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:07:37 -0500
Subject: [PATCH v18 6/8] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IO Path and IO Op are enumerated
in the view but not all are populated or even possible at this point.

All backends increment a counter in their PgBackendStatus when
performing an IO operation. On exit, backends send these stats to the
stats collector to be persisted.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                | 112 ++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/pgstat.c             |  13 ++
 src/backend/utils/activity/backend_status.c |  19 ++-
 src/backend/utils/adt/pgstatfuncs.c         | 150 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/pgstat.h                        |   1 +
 src/include/utils/backend_status.h          |   2 +
 src/test/regress/expected/rules.out         |   8 ++
 9 files changed, 321 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..b16952d439 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3604,7 +3613,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path, containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5213,8 +5317,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view, <literal>wal</literal>
+        to reset all the counters shown in the <structname>pg_stat_wal</structname> view, 
+        or <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 61b515cdb8..e214d23056 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1076,6 +1076,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8b8e3ccfcb..3545e197ce 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2959,6 +2959,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO operations statistics for
+ *	both exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 9fb888f2ca..08e3f8f167 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -50,7 +50,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -236,6 +236,23 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f529c1561a..26730230b6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,156 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above BUFFERS_NUM_COLUMNS.
+*/
+enum
+{
+	BUFFERS_COLUMN_BACKEND_TYPE,
+	BUFFERS_COLUMN_IO_PATH,
+	BUFFERS_COLUMN_ALLOCS,
+	BUFFERS_COLUMN_EXTENDS,
+	BUFFERS_COLUMN_FSYNCS,
+	BUFFERS_COLUMN_WRITES,
+	BUFFERS_COLUMN_RESET_TIME,
+	BUFFERS_NUM_COLUMNS,
+};
+
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS],
+		BackendType backend_type, IOPath io_path)
+{
+	return all_values[backend_type_get_idx(backend_type)][io_path];
+}
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+	bool		all_nulls[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+	tupstore = tuplestore_begin_heap((bool) (rsinfo->allowedModes & SFRM_Materialize_Random),
+			false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	 /* Loop through all live backends and count their IO Ops for each IO Path */
+	beentry = BackendStatusArray;
+
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		IOOpCounters   *io_ops;
+
+		/*
+		 * Don't count dead backends. They will be added below There are no
+		 * rows in the view for BackendType B_INVALID, so skip those as well.
+		 */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, beentry->st_backendType, i);
+
+			/*
+			 * BUFFERS_COLUMN_RESET_TIME, BUFFERS_COLUMN_BACKEND_TYPE, and
+			 * BUFFERS_COLUMN_IO_PATH will all be set when looping through
+			 * exited backends array
+			 */
+			row[BUFFERS_COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			row[BUFFERS_COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			row[BUFFERS_COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			row[BUFFERS_COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		BackendType backend_type = idx_get_backend_type(i);
+
+		PgStatIOOpCounters *io_ops =
+			backend_io_path_ops->ops[i].io_path_ops;
+		PgStatIOOpCounters *resets =
+			backend_io_path_ops->resets[i].io_path_ops;
+
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, backend_type, j);
+
+			row[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			row[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(j));
+			row[BUFFERS_COLUMN_RESET_TIME] = reset_time;
+			row[BUFFERS_COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+			row[BUFFERS_COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+			row[BUFFERS_COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+			row[BUFFERS_COLUMN_WRITES] += io_ops->writes - resets->writes;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum	   *values = all_values[i][j];
+			bool	   *nulls = all_nulls[i][j];
+
+			tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4d992dc224..148208d242 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5636,6 +5636,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done to all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c7c304c4d8..8e07584767 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1297,6 +1297,7 @@ extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 55e30022f6..a3e8a9f44e 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -316,6 +316,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
@@ -334,6 +335,7 @@ extern void CreateSharedBackendStatus(void);
 /* Utility functions */
 extern int backend_type_get_idx(BackendType backend_type);
 extern BackendType idx_get_backend_type(int idx);
+extern const char *GetIOPathDesc(IOPath io_path);
 
 /* Initialization functions */
 extern void pgstat_beinit(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b58b062b10..5869ce442f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.30.2

v18-0003-Add-IO-operation-counters-to-PgBackendStatus.patchtext/x-patch; charset=US-ASCII; name=v18-0003-Add-IO-operation-counters-to-PgBackendStatus.patchDownload

From d6521c85854e7c8254930ebf5911e50204b9d3b5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v18 3/8] Add IO operation counters to PgBackendStatus

Add an array of counters in PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend. Each "IO
Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct,
local, shared, or strategy). "local" and "shared" IO Path counters count
operations on local and shared buffers. The "strategy" IO Path counts
buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy.
The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, all backends increment a counter in their
PgBackendStatus when performing an IO operation. This is in preparation
for future commits which will persist these stats upon backend exit and
use the counters to provide observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call pgstat_inc_ioop()
for unbuffered IO and avoid regressions for future users of these
functions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/storage/buffer/bufmgr.c         | 46 +++++++++++---
 src/backend/storage/buffer/freelist.c       | 22 ++++++-
 src/backend/storage/buffer/localbuf.c       |  3 +
 src/backend/storage/sync/sync.c             |  1 +
 src/backend/utils/activity/backend_status.c | 10 +++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 68 +++++++++++++++++++++
 8 files changed, 140 insertions(+), 15 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 25a18b7a14..8440b2b802 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1101,6 +1101,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 16de918e2e..949f1088b6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,7 +480,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,13 +1250,26 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular
+				 * write. When a strategy is not in use, at this point, the
+				 * write can only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2552,10 +2570,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2803,9 +2822,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2897,6 +2919,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3533,6 +3557,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3568,7 +3594,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3664,7 +3690,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3720,7 +3746,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3747,7 +3773,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..45d73995b2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04b3558ea3..f396a2b68d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..3be06d5d5a 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -420,6 +420,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..44c9a6e1a6 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -400,6 +400,16 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
 
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		IOOpCounters *io_ops = &lbeentry.io_path_stats[i];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
+
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
 	 * examine it until st_progress_command has been set to something other
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 33fcaf5c9a..7e385135db 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..56a0f25296 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,47 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOp for a live backend.
+ */
+typedef struct IOOpCounters
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOpCounters;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +204,12 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IOOps for all IOPaths for this backend. These should be
+	 * incremented whenever an IO Operation is performed.
+	 */
+	IOOpCounters	io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,6 +338,32 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOpCounters *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_unlocked_inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_unlocked_inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_unlocked_inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_unlocked_inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
-- 
2.30.2

#44

pryzby@telsasoft.com

about 4 years ago

In reply to: Melanie Plageman (#41)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Dec 03, 2021 at 03:02:24PM -0500, Melanie Plageman wrote:

Thanks again! I really appreciate the thorough review.

I have combined responses to all three of your emails below.
Let me know if it is more confusing to do it this way.

One email is better than three - I'm just not a model citizen ;)

Thanks for updating the patch. I checked that all my previous review comments
were addressed (except for the part about passing the 3D array to a function -
I know that technically the pointer is being passed).

+int backend_type_get_idx(BackendType backend_type)                                                                                                                                               
+BackendType idx_get_backend_type(int idx)

=> I think it'd be desirable for these to be either static functions (which
won't work for your needs) or macros, or inline functions in the header.

-       if (strcmp(target, "archiver") == 0)                                                                                                                                                      
+       pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);                                                                                                                            
+       if (strcmp(target, "buffers") == 0)

=> This should be added in alphabetical order. Which is unimportant, but it
will also makes the patch 2 lines shorter. The doc patch should also be in
order.

+ * Don't count dead backends. They will be added below There are no

=> Missing a period.

--
Justin

#45

andres@anarazel.de

about 4 years ago

In reply to: Melanie Plageman (#43)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2021-12-15 16:40:27 -0500, Melanie Plageman wrote:

+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+     PgStat_MsgIOPathOps msg;
+
+     PgBackendStatus *beentry = MyBEEntry;
+
+     /*
+      * Though some backends with type B_INVALID (such as the single-user mode
+      * process) do initialize and increment IO operations stats, there is no
+      * spot in the array of IO operations for backends of type B_INVALID. As
+      * such, do not send these to the stats collector.
+      */
+     if (!beentry || beentry->st_backendType == B_INVALID)
+             return;

Why does single user mode use B_INVALID? That doesn't seem quite right.

I think PgBackendStatus->st_backendType is set from MyBackendType which
isn't set for the single user mode process. What BackendType would you
expect to see?

Either B_BACKEND or something new like B_SINGLE_USER_BACKEND?

I also thought about having pgstat_sum_io_path_ops() return a value to
indicate if everything was 0 -- which could be useful to future callers
potentially?

I didn't do this because I am not sure what the return value would be.
It could be a bool and be true if any IO was done and false if none was
done -- but that doesn't really make sense given the function's name it
would be called like
if (!pgstat_sum_io_path_ops())
return
which I'm not sure is very clear

Yea, I think it's ok to not do something fancier here for nwo.

From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter

When do you think it makes sense to tackle these wrt committing some of the
patches?

Well, the new stats are a superset of the old stats (no stats have been
removed that are not represented in the new or old views). So, I don't
see that as a blocker for committing these patches.

Since it is weird that pg_stat_bgwriter had mostly checkpointer stats,
I've edited this commit to rename that view to pg_stat_checkpointer.

I have not made a separate view just for maxwritten_clean (presumably
called pg_stat_bgwriter), but I would not be opposed to doing this if
you thought having a view with a single column isn't a problem (in the
event that we don't get around to adding more bgwriter stats right
away).

How about keeping old bgwriter values in place in the view , but generated
from the new stats stuff?

I noticed after changing the docs on the "bgwriter" target for
pg_stat_reset_shared to say "checkpointer", that it still said "bgwriter" in
src/backend/po/ko.po
src/backend/po/it.po
...
I presume these are automatically updated with some incantation, but I wasn't
sure what it was nor could I find documentation on this.

Yes, they are - and often some languages lag updating things. There's a bit
of docs at https://www.postgresql.org/docs/devel/nls.html

Greetings,

Andres Freund

#46

alvherre@alvh.no-ip.org

about 4 years ago

In reply to: Melanie Plageman (#43)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On 2021-Dec-15, Melanie Plageman wrote:

I noticed after changing the docs on the "bgwriter" target for
pg_stat_reset_shared to say "checkpointer", that it still said "bgwriter" in
src/backend/po/ko.po
src/backend/po/it.po
...
I presume these are automatically updated with some incantation, but I wasn't
sure what it was nor could I find documentation on this.

Yes, feel free to ignore those files completely. They are updated using
an external workflow that you don't need to concern yourself with.

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
"World domination is proceeding according to plan" (Andrew Morton)

#47

melanieplageman@gmail.com

about 4 years ago

In reply to: Andres Freund (#45)

9 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Combined responses to both Justin and Andres here.
v19 attached.

On Wed, Dec 15, 2021 at 5:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

+int backend_type_get_idx(BackendType backend_type)
+BackendType idx_get_backend_type(int idx)
=> I think it'd be desirable for these to be either static functions (which
won't work for your needs) or macros, or inline functions in the header.
-       if (strcmp(target, "archiver") == 0)
+       pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+       if (strcmp(target, "buffers") == 0)

Done

=> This should be added in alphabetical order. Which is unimportant, but it
will also makes the patch 2 lines shorter. The doc patch should also be in
order.

Thanks for catching this.
I've corrected the order in most locations. The exception is in
pgstat_reset_shared_counters():

if (strcmp(target, "buffers") == 0)
{
msg.m_resettarget = RESET_BUFFERS;
pgstat_send_buffers_reset(&msg);
return;
}

Because "buffers" is a special case
which uses a different send function, I prefer to have it first.

+ * Don't count dead backends. They will be added below There are no

=> Missing a period.

Fixed.

On Thu, Dec 16, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-12-15 16:40:27 -0500, Melanie Plageman wrote:
+/*
+ * Before exiting, a backend sends its IO op statistics to the collector so
+ * that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+     PgStat_MsgIOPathOps msg;
+
+     PgBackendStatus *beentry = MyBEEntry;
+
+     /*
+      * Though some backends with type B_INVALID (such as the single-user mode
+      * process) do initialize and increment IO operations stats, there is no
+      * spot in the array of IO operations for backends of type B_INVALID. As
+      * such, do not send these to the stats collector.
+      */
+     if (!beentry || beentry->st_backendType == B_INVALID)
+             return;
Why does single user mode use B_INVALID? That doesn't seem quite right.
I think PgBackendStatus->st_backendType is set from MyBackendType which
isn't set for the single user mode process. What BackendType would you
expect to see?
Either B_BACKEND or something new like B_SINGLE_USER_BACKEND?

I added B_STANDALONE_BACKEND and set it in InitStandaloneBackend() (as
opposed to in PostgresSingleUserMain()) so that the bootstrap process
could also use it.

From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter

When do you think it makes sense to tackle these wrt committing some of the
patches?

Well, the new stats are a superset of the old stats (no stats have been
removed that are not represented in the new or old views). So, I don't
see that as a blocker for committing these patches.

Since it is weird that pg_stat_bgwriter had mostly checkpointer stats,
I've edited this commit to rename that view to pg_stat_checkpointer.

I have not made a separate view just for maxwritten_clean (presumably
called pg_stat_bgwriter), but I would not be opposed to doing this if
you thought having a view with a single column isn't a problem (in the
event that we don't get around to adding more bgwriter stats right
away).

How about keeping old bgwriter values in place in the view , but generated
from the new stats stuff?

I tried this, but I actually don't think it is the right way to go. In
order to maintain the old view with the new source code, I had to add
new code to maintain a separate resets array just for the bgwriter view.
It adds some fiddly code that will be annoying to maintain (the reset
logic is confusing enough as is).
And, besides the implementation complexity, if a user resets
pg_stat_bgwriter and not pg_stat_buffers (or vice versa), they will
see totally different numbers for "buffers_backend" in pg_stat_bgwriter
than shared buffers written by B_BACKEND in pg_stat_buffers. I would
find that confusing.

Instead, what I did was create the separate pg_stat_checkpointer view
and move most of the old pg_stat_bgwriter stats over there.

Because that left us with a pg_stat_bgwriter view with one column, I
added a few stats to it which could later be expanded.

In pg_stat_bgwriter, I renamed "maxwritten_clean" to "rounds_hit_limit".
I added "rounds_cleaned_estimate" and "rounds_lapped_clock" which are
the other two exit conditions from the LRU scan loop in BgBufferSync().

There are other stats related to bgwriter that might be more interesting
(e.g. number of times bgwriter was woken up to clean, % of time bgwriter
spends in hibernation vs cleaning, etc); however the stats I ended up
adding were available in the same scope as maxwritten_clean and seemed
like a non-intrusive way to start building out pg_stat_bgwriter.

BgBufferSync() has a *lot* of local variables that are all getting
incremented and reset in a complicated way, so I'm not 100% sure that
the new stats I added are actually correct.

I noticed after changing the docs on the "bgwriter" target for
pg_stat_reset_shared to say "checkpointer", that it still said "bgwriter" in
src/backend/po/ko.po
src/backend/po/it.po
...
I presume these are automatically updated with some incantation, but I wasn't
sure what it was nor could I find documentation on this.

Yes, they are - and often some languages lag updating things. There's a bit
of docs at https://www.postgresql.org/docs/devel/nls.html

I noticed that the po files for pgstat.c are not updated (the msgid I am
concerned with has the old line number in pgstat.c and the old message).
So, I tried running `make update-po`, but it didn't have the documented
effect:

"to be called if the messages in the program source have changed, in
order to merge the changes into the existing .po files"

No po.new files were created.

I can look into it more, though if it is part of an external workflow, as
Álvaro suggested, perhaps I shouldn't?

- Melanie

Attachments:

v19-0007-Remove-superfluous-bgwriter-stats.patchtext/x-patch; charset=US-ASCII; name=v19-0007-Remove-superfluous-bgwriter-stats.patchDownload

From 1900d7456f958d45c695ca2e848ba08d01df57de Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v19 7/9] Remove superfluous bgwriter stats

- Remove stats from pg_stat_bgwriter which are now more clearly
  expressed in pg_stat_buffers.
- Move checkpointer related stats to new view, pg_stat_checkpointer.
---
 doc/src/sgml/monitoring.sgml          | 146 ++++++++++++++------------
 src/backend/catalog/system_views.sql  |  18 ++--
 src/backend/postmaster/checkpointer.c |  29 +----
 src/backend/postmaster/pgstat.c       |  32 +++---
 src/backend/storage/buffer/bufmgr.c   |   6 --
 src/backend/utils/adt/pgstatfuncs.c   |  44 ++------
 src/include/catalog/pg_proc.dat       |  47 +++------
 src/include/pgstat.h                  |  12 +--
 src/test/regress/expected/rules.out   |  16 ++-
 9 files changed, 136 insertions(+), 214 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 40884dbc27..84c37151b8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -444,6 +444,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_checkpointer</structname><indexterm><primary>pg_stat_checkpointer</primary></indexterm></entry>
+      <entry>One row only, showing statistics about the
+       checkpointer process's activity. See
+       <link linkend="monitoring-pg-stat-checkpointer-view">
+       <structname>pg_stat_checkpointer</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3514,97 +3523,106 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <tbody>
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>checkpoints_timed</structfield> <type>bigint</type>
+       <structfield>maxwritten_clean</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of scheduled checkpoints that have been performed
+       Number of times the background writer stopped a cleaning
+       scan because it had written too many buffers
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>checkpoints_req</structfield> <type>bigint</type>
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Number of requested checkpoints that have been performed
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
+    </tbody>
+   </tgroup>
+  </table>
 
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>checkpoint_write_time</structfield> <type>double precision</type>
-      </para>
-      <para>
-       Total amount of time that has been spent in the portion of
-       checkpoint processing where files are written to disk, in milliseconds
-      </para></entry>
-     </row>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
 
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>checkpoint_sync_time</structfield> <type>double precision</type>
+       Column Type
       </para>
       <para>
-       Total amount of time that has been spent in the portion of
-       checkpoint processing where files are synchronized to disk, in
-       milliseconds
+       Description
       </para></entry>
      </row>
-
+    </thead>
+    <tbody>
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_checkpoint</structfield> <type>bigint</type>
+       <structfield>backend_type</structfield> <type>text</type>
       </para>
       <para>
-       Number of buffers written during checkpoints
+       Type of backend (e.g. background worker, autovacuum worker).
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_clean</structfield> <type>bigint</type>
+       <structfield>io_path</structfield> <type>text</type>
       </para>
       <para>
-       Number of buffers written by the background writer
+       IO path taken (e.g. shared buffers, direct).
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>maxwritten_clean</structfield> <type>bigint</type>
+       <structfield>alloc</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times the background writer stopped a cleaning
-       scan because it had written too many buffers
+       Number of buffers allocated.
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend</structfield> <type>bigint</type>
+       <structfield>extend</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of buffers written directly by a backend
+       Number of buffers extended.
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_backend_fsync</structfield> <type>bigint</type>
+       <structfield>fsync</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times a backend had to execute its own
-       <function>fsync</function> call (normally the background writer handles those
-       even when the backend does its own write)
+       Number of buffers fsynced.
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>buffers_alloc</structfield> <type>bigint</type>
+       <structfield>write</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of buffers allocated
+       Number of buffers written.
       </para></entry>
      </row>
 
@@ -3622,21 +3640,20 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
- <sect2 id="monitoring-pg-stat-buffers-view">
-  <title><structname>pg_stat_buffers</structname></title>
+ <sect2 id="monitoring-pg-stat-checkpointer-view">
+  <title><structname>pg_stat_checkpointer</structname></title>
 
   <indexterm>
-   <primary>pg_stat_buffers</primary>
+   <primary>pg_stat_checkpointer</primary>
   </indexterm>
 
   <para>
-   The <structname>pg_stat_buffers</structname> view has a row for each backend
-   type for each possible IO path containing global data for the cluster for
-   that backend and IO path.
+   The <structname>pg_stat_checkpointer</structname> view will always have a
+   single row, containing global data for the cluster.
   </para>
 
-  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
-   <title><structname>pg_stat_buffers</structname> View</title>
+  <table id="pg-stat-checkpointer-view" xreflabel="pg_stat_checkpointer">
+   <title><structname>pg_stat_checkpointer</structname> View</title>
    <tgroup cols="1">
     <thead>
      <row>
@@ -3648,58 +3665,44 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
     </thead>
-    <tbody>
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>backend_type</structfield> <type>text</type>
-      </para>
-      <para>
-       Type of backend (e.g. background worker, autovacuum worker).
-      </para></entry>
-     </row>
-
-     <row>
-      <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>io_path</structfield> <type>text</type>
-      </para>
-      <para>
-       IO path taken (e.g. shared buffers, direct).
-      </para></entry>
-     </row>
 
+    <tbody>
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>alloc</structfield> <type>bigint</type>
+       <structfield>checkpoints_timed</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of buffers allocated.
+       Number of scheduled checkpoints that have been performed
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>extend</structfield> <type>bigint</type>
+       <structfield>checkpoints_req</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of buffers extended.
+       Number of requested checkpoints that have been performed
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>fsync</structfield> <type>bigint</type>
+       <structfield>checkpoint_write_time</structfield> <type>double precision</type>
       </para>
       <para>
-       Number of buffers fsynced.
+       Total amount of time that has been spent in the portion of
+       checkpoint processing where files are written to disk, in milliseconds
       </para></entry>
      </row>
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>write</structfield> <type>bigint</type>
+       <structfield>checkpoint_sync_time</structfield> <type>double precision</type>
       </para>
       <para>
-       Number of buffers written.
+       Total amount of time that has been spent in the portion of
+       checkpoint processing where files are synchronized to disk, in
+       milliseconds
       </para></entry>
      </row>
 
@@ -5315,10 +5318,13 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         Resets some cluster-wide statistics counters to zero, depending on the
         argument.  The argument can be <literal>archiver</literal> to reset all
         the counters shown in the <structname>pg_stat_archiver</structname>
-        view, <literal>bgwriter</literal> to reset all the counters shown in
-        the <structname>pg_stat_bgwriter</structname> view,
+        view, 
+        <literal>bgwriter</literal> to reset all the counters shown in the
+        <structname>pg_stat_bgwriter</structname> view,
         <literal>buffers</literal> to reset all the counters shown in the
-        <structname>pg_stat_buffers</structname> view, or
+        <structname>pg_stat_buffers</structname> view,
+        <literal>checkpointer</literal> to reset all the counters shown in the
+        <structname>pg_stat_checkpointer</structname> view, or
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view.
        </para>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e214d23056..0caf50421c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1064,16 +1064,7 @@ CREATE VIEW pg_stat_archiver AS
 
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
-        pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
-        pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
-        pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
-        pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
@@ -1087,6 +1078,15 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_buffers() b;
 
+CREATE VIEW pg_stat_checkpointer AS
+    SELECT
+        pg_stat_get_timed_checkpoints() AS checkpoints_timed,
+        pg_stat_get_requested_checkpoints() AS checkpoints_req,
+        pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
+        pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
+        pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 8440b2b802..b9c3745474 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1081,10 +1070,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1094,13 +1079,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
+
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1257,15 +1241,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8ea024059e..a080ec78b4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1568,6 +1568,8 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "checkpointer") == 0)
+		msg.m_resettarget = RESET_CHECKPOINTER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
@@ -1575,7 +1577,8 @@ pgstat_reset_shared_counters(const char *target)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
 				 errhint(
-					 "Target must be \"archiver\", \"bgwriter\", \"buffers\", or \"wal\".")));
+					 "Target must be \"archiver\", \"bgwriter\", \"buffers\", "
+					 "\"checkpointer\", or \"wal\".")));
 
 	pgstat_send(&msg, sizeof(msg));
 }
@@ -5640,13 +5643,15 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 static void
 pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
-	if (msg->m_resettarget == RESET_BGWRITER)
+	if (msg->m_resettarget == RESET_ARCHIVER)
 	{
-		/*
-		 * Reset the global bgwriter and checkpointer statistics for the
-		 * cluster.
-		 */
-		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		/* Reset the archiver statistics for the cluster. */
+		memset(&archiverStats, 0, sizeof(archiverStats));
+		archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+	}
+	else if (msg->m_resettarget == RESET_BGWRITER)
+	{
+		/* Reset the global bgwriter statistics for the cluster */
 		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
@@ -5674,11 +5679,11 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 				&msg->m_backend_resets.iop.io_path_ops,
 				sizeof(msg->m_backend_resets.iop.io_path_ops));
 	}
-	else if (msg->m_resettarget == RESET_ARCHIVER)
+	else if (msg->m_resettarget == RESET_CHECKPOINTER)
 	{
-		/* Reset the archiver statistics for the cluster. */
-		memset(&archiverStats, 0, sizeof(archiverStats));
-		archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+		 /* Reset the global checkpointer statistics for the cluster */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		globalStats.checkpointer.stat_reset_timestamp = GetCurrentTimestamp();
 	}
 	else if (msg->m_resettarget == RESET_WAL)
 	{
@@ -5946,9 +5951,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5964,9 +5967,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 528995e7fb..6ec15ea00e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2165,7 +2165,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2274,9 +2273,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2473,8 +2469,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index c16dc3ba9a..6ee9705ea3 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1727,33 +1727,27 @@ pg_stat_get_db_sessions_killed(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
+pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
+	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
+pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
+	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
+pg_stat_get_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
+	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
-pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
+pg_stat_get_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
+	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
@@ -1773,27 +1767,9 @@ pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
-}
-
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
+pg_stat_get_checkpointer_stat_reset_time(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
+	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_checkpointer()->stat_reset_timestamp);
 }
 
 /*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 24d1fc29a2..d4911ada45 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5584,26 +5584,7 @@
   proargmodes => '{o,o,o,o,o,o,o}',
   proargnames => '{archived_count,last_archived_wal,last_archived_time,failed_count,last_failed_wal,last_failed_time,stats_reset}',
   prosrc => 'pg_stat_get_archiver' },
-{ oid => '2769',
-  descr => 'statistics: number of timed checkpoints started by the bgwriter',
-  proname => 'pg_stat_get_bgwriter_timed_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_timed_checkpoints' },
-{ oid => '2770',
-  descr => 'statistics: number of backend requested checkpoints started by the bgwriter',
-  proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
+
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
@@ -5613,6 +5594,16 @@
   proname => 'pg_stat_get_bgwriter_stat_reset_time', provolatile => 's',
   proparallel => 'r', prorettype => 'timestamptz', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_stat_reset_time' },
+{ oid => '2769',
+  descr => 'statistics: number of scheduled checkpoints performed',
+  proname => 'pg_stat_get_timed_checkpoints', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_timed_checkpoints' },
+{ oid => '2770',
+  descr => 'statistics: number of backend requested checkpoints performed',
+  proname => 'pg_stat_get_requested_checkpoints', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_requested_checkpoints' },
 { oid => '3160',
   descr => 'statistics: checkpoint time spent writing buffers to disk, in milliseconds',
   proname => 'pg_stat_get_checkpoint_write_time', provolatile => 's',
@@ -5623,18 +5614,10 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
+{ oid => '8460', descr => 'statistics: last reset for the checkpointer',
+  proname => 'pg_stat_get_checkpointer_stat_reset_time', provolatile => 's',
+  proparallel => 'r', prorettype => 'timestamptz', proargtypes => '',
+  prosrc => 'pg_stat_get_checkpointer_stat_reset_time' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9369e4a408..36d981c9ea 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -143,6 +143,7 @@ typedef enum PgStat_Shared_Reset_Target
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
 	RESET_BUFFERS,
+	RESET_CHECKPOINTER,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -523,9 +524,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -538,9 +537,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -971,9 +967,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
@@ -987,9 +981,7 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
+	TimestampTz stat_reset_timestamp;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5869ce442f..97af066c6b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1817,16 +1817,7 @@ pg_stat_archiver| SELECT s.archived_count,
     s.last_failed_time,
     s.stats_reset
    FROM pg_stat_get_archiver() s(archived_count, last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time, stats_reset);
-pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
-    pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
-    pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
-    pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
-    pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
+pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
@@ -1836,6 +1827,11 @@ pg_stat_buffers| SELECT b.backend_type,
     b.write,
     b.stats_reset
    FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
+pg_stat_checkpointer| SELECT pg_stat_get_timed_checkpoints() AS checkpoints_timed,
+    pg_stat_get_requested_checkpoints() AS checkpoints_req,
+    pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
+    pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
+    pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.30.2

v19-0009-Add-stats-to-pg_stat_bgwriter.patchtext/x-patch; charset=US-ASCII; name=v19-0009-Add-stats-to-pg_stat_bgwriter.patchDownload

From 88cf6d9237f98b55d75947700d541afaeb7ee4ca Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 20 Dec 2021 16:48:06 -0500
Subject: [PATCH v19 9/9] Add stats to pg_stat_bgwriter

Rename pg_stat_bgwriter stat "maxwritten_clean" to "rounds_hit_limit".
This stat tracks the number of rounds in which the bgwriter hit the
limit imposed by bgwriter_lru_maxpages for the number of pages which are
allowed to be cleaned in a round.

Also, add two new additional stats to this view. The first,
"rounds_cleaned_estimate", is the number of rounds in which the bgwriter
was able to clean the number of buffers which it has estimated will be
allocated in the next cycle.

The second, "rounds_lapped_clock", is the number of rounds in which the
bgwriter lapped the freelist clock sweep scan.

Both of these are conditions to exit the LRU cleaning scan.
---
 doc/src/sgml/monitoring.sgml         | 22 +++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  4 +++-
 src/backend/postmaster/pgstat.c      |  5 ++++-
 src/backend/storage/buffer/bufmgr.c  | 21 +++++++++++++++------
 src/backend/utils/adt/pgstatfuncs.c  | 16 ++++++++++++++--
 src/include/catalog/pg_proc.dat      | 15 +++++++++++++--
 src/include/pgstat.h                 |  8 ++++++--
 src/test/regress/expected/rules.out  |  4 +++-
 8 files changed, 79 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 84c37151b8..e029a596c9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3523,7 +3523,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <tbody>
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>maxwritten_clean</structfield> <type>bigint</type>
+       <structfield>rounds_cleaned_estimate</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of rounds in which the bgwriter cleaned the number of buffers it
+       estimated would be allocated in the next cycle. 
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>rounds_hit_limit</structfield> <type>bigint</type>
       </para>
       <para>
        Number of times the background writer stopped a cleaning
@@ -3531,6 +3541,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>rounds_lapped_clock</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of rounds in which the bgwriter lapped the clock sweep scan while
+       cleaning buffers.
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0caf50421c..8ac1988d21 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1064,7 +1064,9 @@ CREATE VIEW pg_stat_archiver AS
 
 CREATE VIEW pg_stat_bgwriter AS
     SELECT
-        pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
+        pg_stat_get_bgwriter_rounds_cleaned_estimate() AS rounds_cleaned_estimate,
+        pg_stat_get_bgwriter_rounds_hit_limit() AS rounds_hit_limit,
+        pg_stat_get_bgwriter_rounds_lapped_clock() AS rounds_lapped_clock,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a080ec78b4..a441216d25 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5951,7 +5951,10 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
+	globalStats.bgwriter.rounds_cleaned_estimate +=
+		msg->m_rounds_cleaned_estimate;
+	globalStats.bgwriter.rounds_hit_limit += msg->m_rounds_hit_limit;
+	globalStats.bgwriter.rounds_lapped_clock += msg->m_rounds_lapped_clock;
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6ec15ea00e..5830b971d6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2454,20 +2454,29 @@ BgBufferSync(WritebackContext *wb_context)
 			next_to_clean = 0;
 			next_passes++;
 		}
-		num_to_scan--;
 
-		if (sync_state & BUF_WRITTEN)
+		if (--num_to_scan == 0)
+			PendingBgWriterStats.m_rounds_lapped_clock++;
+
+		if (sync_state & BUF_WRITTEN || sync_state & BUF_REUSABLE)
 		{
 			reusable_buffers++;
-			if (++num_written >= bgwriter_lru_maxpages)
+			if (reusable_buffers >= upcoming_alloc_est)
+				PendingBgWriterStats.m_rounds_cleaned_estimate++;
+		}
+
+		if (sync_state & BUF_WRITTEN)
+		{
+			num_written++;
+			if (num_written >= bgwriter_lru_maxpages)
 			{
-				PendingBgWriterStats.m_maxwritten_clean++;
+				PendingBgWriterStats.m_rounds_hit_limit++;
 				break;
 			}
 		}
-		else if (sync_state & BUF_REUSABLE)
-			reusable_buffers++;
+
 	}
+	
 
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6ee9705ea3..eafcaf4abf 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1727,9 +1727,21 @@ pg_stat_get_db_sessions_killed(PG_FUNCTION_ARGS)
 }
 
 Datum
-pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
+pg_stat_get_bgwriter_rounds_cleaned_estimate(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
+	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->rounds_cleaned_estimate);
+}
+
+Datum
+pg_stat_get_bgwriter_rounds_hit_limit(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->rounds_hit_limit);
+}
+
+Datum
+pg_stat_get_bgwriter_rounds_lapped_clock(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->rounds_lapped_clock);
 }
 
 Datum
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d4911ada45..64efbeaa7e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5585,11 +5585,22 @@
   proargnames => '{archived_count,last_archived_wal,last_archived_time,failed_count,last_failed_wal,last_failed_time,stats_reset}',
   prosrc => 'pg_stat_get_archiver' },
 
+{ oid => '8461',
+  descr => 'statistics: number of times the bgwriter cleaned the number of buffers in a round estimated to be allocated in the next cycle.',
+  proname => 'pg_stat_get_bgwriter_rounds_cleaned_estimate', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_bgwriter_rounds_cleaned_estimate' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
-  proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
+  proname => 'pg_stat_get_bgwriter_rounds_hit_limit', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => '',
+  prosrc => 'pg_stat_get_bgwriter_rounds_hit_limit' },
+{ oid => '8462',
+  descr => 'statistics: number of times the bgwriter lapped the clock sweep scan while cleaning buffers during a round',
+  proname => 'pg_stat_get_bgwriter_rounds_lapped_clock', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_maxwritten_clean' },
+  prosrc => 'pg_stat_get_bgwriter_rounds_lapped_clock' },
+
 { oid => '3075', descr => 'statistics: last reset for the bgwriter',
   proname => 'pg_stat_get_bgwriter_stat_reset_time', provolatile => 's',
   proparallel => 'r', prorettype => 'timestamptz', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 36d981c9ea..89bf8c6b81 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -524,7 +524,9 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_maxwritten_clean;
+	PgStat_Counter m_rounds_cleaned_estimate;
+	PgStat_Counter m_rounds_hit_limit;
+	PgStat_Counter m_rounds_lapped_clock;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -967,7 +969,9 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter maxwritten_clean;
+	PgStat_Counter rounds_cleaned_estimate;
+	PgStat_Counter rounds_hit_limit;
+	PgStat_Counter rounds_lapped_clock;
 	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 97af066c6b..5373fed143 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1817,7 +1817,9 @@ pg_stat_archiver| SELECT s.archived_count,
     s.last_failed_time,
     s.stats_reset
    FROM pg_stat_get_archiver() s(archived_count, last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time, stats_reset);
-pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
+pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_rounds_cleaned_estimate() AS rounds_cleaned_estimate,
+    pg_stat_get_bgwriter_rounds_hit_limit() AS rounds_hit_limit,
+    pg_stat_get_bgwriter_rounds_lapped_clock() AS rounds_lapped_clock,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
-- 
2.30.2

v19-0008-small-comment-correction.patchtext/x-patch; charset=US-ASCII; name=v19-0008-small-comment-correction.patchDownload

From 26de96df4e4bad1a2fde28a42795c596b93dc57d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:21:08 -0500
Subject: [PATCH v19 8/9] small comment correction

Naming callers in function comment is brittle and unnecessary.
---
 src/backend/utils/activity/backend_status.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 08e3f8f167..d508b819b1 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -254,11 +254,11 @@ GetIOPathDesc(IOPath io_path)
 }
 
 /*
- * Initialize pgstats backend activity state, and set up our on-proc-exit
- * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
- * process, MyBackendId is invalid. Otherwise, MyBackendId must be set, but we
- * must not have started any transaction yet (since the exit hook must run
- * after the last transaction exit).
+ * Initialize pgstats backend activity state, and set up our on-proc-exit hook.
+ *
+ * For auxiliary process, MyBackendId is invalid. Otherwise, MyBackendId must
+ * be set, but we must not have started any transaction yet (since the exit
+ * hook must run after the last transaction exit).
  *
  * NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
  */
@@ -296,7 +296,6 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
-- 
2.30.2

v19-0005-Add-buffers-to-pgstat_reset_shared_counters.patchtext/x-patch; charset=US-ASCII; name=v19-0005-Add-buffers-to-pgstat_reset_shared_counters.patchDownload

From 23ed7f8a4dfed1638daeb2e9b7a60ac4929fd99f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:39:48 -0500
Subject: [PATCH v19 5/9] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their
PgBackendStatus. Upon exit, they send these counts to the stats
collector. Prior to this commit, these IO operation stats would have
been reset when the target was "bgwriter".

With this commit, target "bgwriter" no longer will cause the IO
operations stats to be reset, and the IO operations stats can be reset
with new target, "buffers".

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c             | 75 +++++++++++++++++++--
 src/backend/utils/activity/backend_status.c | 27 ++++++++
 src/include/pgstat.h                        | 28 ++++++--
 src/include/utils/backend_status.h          |  2 +
 4 files changed, 123 insertions(+), 9 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 11db91f62b..2328b17bdf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1509,6 +1509,36 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+	memset(ops, 0, sizeof(ops));
+	pgstat_report_live_backend_io_path_ops(ops);
+
+	/*
+	 * Iterate through the array of all IOOps for all IOPaths for each
+	 * BackendType.
+	 *
+	 * An individual message is sent for each backend type because sending all
+	 * IO operations in one message would exceed the PGSTAT_MAX_MSG_SIZE of
+	 * 1000.
+	 */
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		msg->m_backend_resets.backend_type = idx_get_backend_type(i);
+		memcpy(&msg->m_backend_resets.iop, &ops[i],
+				sizeof(msg->m_backend_resets.iop));
+		pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+	}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1526,6 +1556,14 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+
 	if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
@@ -1536,9 +1574,9 @@ pgstat_reset_shared_counters(const char *target)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
+				 errhint(
+					 "Target must be \"archiver\", \"bgwriter\", \"buffers\", or \"wal\".")));
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -4425,6 +4463,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 */
 	ts = GetCurrentTimestamp();
 	globalStats.bgwriter.stat_reset_timestamp = ts;
+	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
 
@@ -5590,10 +5629,38 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
 	if (msg->m_resettarget == RESET_BGWRITER)
 	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
+		/*
+		 * Reset the global bgwriter and checkpointer statistics for the
+		 * cluster.
+		 */
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		/*
+		 * Because the stats collector cannot write to live backends'
+		 * PgBackendStatuses, it maintains an array of "resets". The reset
+		 * message contains the current values of these counters for live
+		 * backends. The stats collector saves these in its "resets" array,
+		 * then zeroes out the exited backends' saved IO operations counters.
+		 * This is required to calculate an accurate total for each IO
+		 * operations counter post reset.
+		 */
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * Though globalStats.buffers only needs to be reset once, doing so for
+		 * every message is less brittle and the extra cost is irrelevant given
+		 * how often stats are reset.
+		 */
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+		memcpy(&globalStats.buffers.resets[backend_type_get_idx(backend_type)],
+				&msg->m_backend_resets.iop.io_path_ops,
+				sizeof(msg->m_backend_resets.iop.io_path_ops));
+	}
 	else if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 44c9a6e1a6..9fb888f2ca 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -631,6 +631,33 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IOOps
+ * for all IOPaths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		int idx;
+
+		/* Don't count dead backends or those with type B_INVALID. */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		idx = backend_type_get_idx(beentry->st_backendType);
+		pgstat_sum_io_path_ops(backend_io_path_ops[idx].io_path_ops,
+				(IOOpCounters *) beentry->io_path_stats);
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 23496afffc..08dea71537 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -142,6 +142,7 @@ typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
 	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -357,7 +358,8 @@ typedef struct PgStatIOPathOps
 
 /*
  * Sent by a backend to the stats collector to report all IOOps for all IOPaths
- * for a given type of a backend. This will happen when the backend exits.
+ * for a given type of a backend. This will happen when the backend exits or
+ * when stats are reset.
  */
 typedef struct PgStat_MsgIOPathOps
 {
@@ -375,9 +377,12 @@ typedef struct PgStat_MsgIOPathOps
  */
 typedef struct PgStat_BackendIOPathOps
 {
+	TimestampTz stat_reset_timestamp;
 	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
 } PgStat_BackendIOPathOps;
 
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -389,15 +394,28 @@ typedef struct PgStat_MsgResetcounter
 	Oid			m_databaseid;
 } PgStat_MsgResetcounter;
 
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *								to reset a shared counter
- * ----------
+/*
+ * Sent by the backend to tell the collector to reset a shared counter.
+ *
+ * In addition to the message header and reset target, the message also
+ * contains an array with all of the IO operations for all IO paths done by a
+ * particular backend type.
+ *
+ * This is needed because the IO operation stats for live backends cannot be
+ * safely modified by other processes. Therefore, to correctly calculate the
+ * total IO operations for a particular backend type after a reset, the balance
+ * of IO operations for live backends at the time of prior resets must be
+ * subtracted from the total IO operations.
+ *
+ * To satisfy this requirement, the process initiating the reset will read the
+ * IO operations counters from live backends and send them to the stats
+ * collector which maintains an array of reset values.
  */
 typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index ae4597c5fe..92f00e1fce 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -375,6 +375,7 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
 
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
@@ -402,6 +403,7 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	}
 }
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
-- 
2.30.2

v19-0006-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v19-0006-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From b85c02757fab93c841aaa58ebc08aaac4ad828bb Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:07:37 -0500
Subject: [PATCH v19 6/9] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO Path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IOPath and IOOp are enumerated in
the view but not all are populated or even possible at this point.

All backends increment a counter in an array of IO stat counters in
their PgBackendStatus when performing an IO operation. On exit, backends
send these stats to the stats collector to be persisted.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO Path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                | 120 +++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/pgstat.c             |  13 ++
 src/backend/utils/activity/backend_status.c |  19 ++-
 src/backend/utils/adt/pgstatfuncs.c         | 150 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/pgstat.h                        |   1 +
 src/include/utils/backend_status.h          |   1 +
 src/test/regress/expected/rules.out         |   8 ++
 9 files changed, 324 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..40884dbc27 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3604,7 +3613,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5209,12 +5313,14 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
-        view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        argument.  The argument can be <literal>archiver</literal> to reset all
+        the counters shown in the <structname>pg_stat_archiver</structname>
+        view, <literal>bgwriter</literal> to reset all the counters shown in
+        the <structname>pg_stat_bgwriter</structname> view,
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view, or
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 61b515cdb8..e214d23056 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1076,6 +1076,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2328b17bdf..8ea024059e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2923,6 +2923,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO operations statistics for
+ *	both exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 9fb888f2ca..08e3f8f167 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -50,7 +50,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -236,6 +236,23 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f529c1561a..c16dc3ba9a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,156 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above BUFFERS_NUM_COLUMNS.
+*/
+enum
+{
+	BUFFERS_COLUMN_BACKEND_TYPE,
+	BUFFERS_COLUMN_IO_PATH,
+	BUFFERS_COLUMN_ALLOCS,
+	BUFFERS_COLUMN_EXTENDS,
+	BUFFERS_COLUMN_FSYNCS,
+	BUFFERS_COLUMN_WRITES,
+	BUFFERS_COLUMN_RESET_TIME,
+	BUFFERS_NUM_COLUMNS,
+};
+
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS],
+		BackendType backend_type, IOPath io_path)
+{
+	return all_values[backend_type_get_idx(backend_type)][io_path];
+}
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+	bool		all_nulls[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+	tupstore = tuplestore_begin_heap((bool) (rsinfo->allowedModes & SFRM_Materialize_Random),
+			false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	 /* Loop through all live backends and count their IO Ops for each IO Path */
+	beentry = BackendStatusArray;
+
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		IOOpCounters   *io_ops;
+
+		/*
+		 * Don't count dead backends. They will be added below. There are no
+		 * rows in the view for BackendType B_INVALID, so skip those as well.
+		 */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, beentry->st_backendType, i);
+
+			/*
+			 * BUFFERS_COLUMN_RESET_TIME, BUFFERS_COLUMN_BACKEND_TYPE, and
+			 * BUFFERS_COLUMN_IO_PATH will all be set when looping through
+			 * exited backends array
+			 */
+			row[BUFFERS_COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			row[BUFFERS_COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			row[BUFFERS_COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			row[BUFFERS_COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		BackendType backend_type = idx_get_backend_type(i);
+
+		PgStatIOOpCounters *io_ops =
+			backend_io_path_ops->ops[i].io_path_ops;
+		PgStatIOOpCounters *resets =
+			backend_io_path_ops->resets[i].io_path_ops;
+
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, backend_type, j);
+
+			row[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			row[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(j));
+			row[BUFFERS_COLUMN_RESET_TIME] = reset_time;
+			row[BUFFERS_COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+			row[BUFFERS_COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+			row[BUFFERS_COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+			row[BUFFERS_COLUMN_WRITES] += io_ops->writes - resets->writes;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum	   *values = all_values[i][j];
+			bool	   *nulls = all_nulls[i][j];
+
+			tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4d992dc224..24d1fc29a2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5636,6 +5636,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 08dea71537..9369e4a408 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1297,6 +1297,7 @@ extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 92f00e1fce..ecd0668161 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -316,6 +316,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b58b062b10..5869ce442f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.30.2

v19-0001-Read-only-atomic-backend-write-function.patchtext/x-patch; charset=US-ASCII; name=v19-0001-Read-only-atomic-backend-write-function.patchDownload

From b1cd3358991553b50a087e6c0b865d8733d82951 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v19 1/9] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values; however, pg_atomic_fetch_add_u64() is overkill for
incrementing such counters. pg_atomic_unlocked_inc_counter() is a helper
function which can be used to increment these values safely without
unnecessary overhead.

Author: Thomas Munro <tmunro@postgresql.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ06d3h5JeOtAv4h52n0vG1jOPZxqMCn5FySJQUVZA32w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/include/port/atomics.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..af30da32e5 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++. On some older systems
+ * there might be more to it (due to an inability to read and write 64-bit
+ * values atomically).
+ */
+static inline void
+pg_atomic_unlocked_inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.30.2

v19-0002-Move-backend-pgstat-initialization-earlier.patchtext/x-patch; charset=US-ASCII; name=v19-0002-Move-backend-pgstat-initialization-earlier.patchDownload

From f11345e1bed8955ac335e3bc510d248590659dc1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 14 Dec 2021 12:26:56 -0500
Subject: [PATCH v19 2/9] Move backend pgstat initialization earlier

Initialize the pgstats subsystem earlier during process initialization
so that more process types have a backend activity state
(PgBackendStatus).

Conditionally initializing backend activity state in some types of
processes and not in others necessitates surprising special cases in the
code.

This particular commit was motivated by single user mode missing a
backend activity state.

This commit also adds a new BackendType for standalone backends,
B_STANDALONE_BACKEND (and alphabetizes the BackendTypes). Both the
bootstrap backend and single user mode backends will have BackendType
B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 23 ++++++++++++++---------
 src/backend/utils/init/postinit.c |  7 +++----
 src/include/miscadmin.h           |  7 ++++---
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..41d0b023cd 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,9 +278,18 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
+		case B_STATS_COLLECTOR:
+			backendDesc = "stats collector";
+			break;
 		case B_WAL_RECEIVER:
 			backendDesc = "walreceiver";
 			break;
@@ -285,15 +299,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_STATS_COLLECTOR:
-			backendDesc = "stats collector";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7292e51f7d..11f1fec17e 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -623,6 +623,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
 	}
 
+	pgstat_beinit();
+
 	/*
 	 * If this is either a bootstrap process nor a standalone backend, start
 	 * up the XLOG machinery, and register to have it closed down at exit.
@@ -638,6 +640,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 */
 		CreateAuxProcessResourceOwner();
 
+		pgstat_bestart();
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
 		ReleaseAuxProcessResources(true);
@@ -665,7 +668,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
@@ -903,10 +905,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 * transaction we started before returning.
 		 */
 		if (!bootstrap)
-		{
-			pgstat_bestart();
 			CommitTransactionCommand();
-		}
 		return;
 	}
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..1d688f9e51 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,19 +323,20 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
+	B_STATS_COLLECTOR,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_STATS_COLLECTOR,
-	B_LOGGER,
 } BackendType;
 
 extern BackendType MyBackendType;
-- 
2.30.2

v19-0003-Add-IO-operation-counters-to-PgBackendStatus.patchtext/x-patch; charset=US-ASCII; name=v19-0003-Add-IO-operation-counters-to-PgBackendStatus.patchDownload

From 8588e4afc3e3869dc2478d53bfb970ea9a6dbd9a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v19 3/9] Add IO operation counters to PgBackendStatus

Add an array of counters to PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend.

Each "IO Op" (alloc, fsync, extend, write) is counted per "IO Path"
(direct, local, shared, or strategy).

"local" and "shared" IO Path counters count operations on local and
shared buffers.

The "strategy" IO Path counts buffers alloc'd/written/read/fsync'd as
part of a BufferAccessStrategy.

The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, backends increment a counter in the array in their
PgBackendStatus when performing an IO operation.

Future patches will persist the IO stat counters from a backend's
PgBackendStatus upon backend exit and use the counters to provide
observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call pgstat_inc_ioop()
for unbuffered IO and avoid regressions for future users of these
functions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/storage/buffer/bufmgr.c         | 47 +++++++++++---
 src/backend/storage/buffer/freelist.c       | 22 ++++++-
 src/backend/storage/buffer/localbuf.c       |  3 +
 src/backend/storage/sync/sync.c             |  1 +
 src/backend/utils/activity/backend_status.c | 10 +++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 68 +++++++++++++++++++++
 8 files changed, 141 insertions(+), 15 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 25a18b7a14..8440b2b802 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1101,6 +1101,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..528995e7fb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,7 +480,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,13 +1250,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 *
+				 * When a strategy is not in use, at this point the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2552,10 +2571,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2803,9 +2823,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2897,6 +2920,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3533,6 +3558,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3568,7 +3595,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3664,7 +3691,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3720,7 +3747,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3747,7 +3774,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..45d73995b2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04b3558ea3..f396a2b68d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..3be06d5d5a 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -420,6 +420,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..44c9a6e1a6 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -400,6 +400,16 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
 
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		IOOpCounters *io_ops = &lbeentry.io_path_stats[i];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
+
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
 	 * examine it until st_progress_command has been set to something other
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7c6653311a..15f5724fa1 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..56a0f25296 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,47 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOp for a live backend.
+ */
+typedef struct IOOpCounters
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOpCounters;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +204,12 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IOOps for all IOPaths for this backend. These should be
+	 * incremented whenever an IO Operation is performed.
+	 */
+	IOOpCounters	io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,6 +338,32 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOpCounters *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_unlocked_inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_unlocked_inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_unlocked_inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_unlocked_inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
-- 
2.30.2

v19-0004-Send-IO-operations-to-stats-collector.patchtext/x-patch; charset=US-ASCII; name=v19-0004-Send-IO-operations-to-stats-collector.patchDownload

From 7d02db46a4bd7f2bff5384abdd224bd749792a2d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:16:55 -0500
Subject: [PATCH v19 4/9] Send IO operations to stats collector

On exit, backends send the IO operations they have done on all IO Paths
to the stats collector. The stats collector adds these counts to its
existing counts stored in a global data structure it maintains and
persists.

PgStatIOOpCounters contains the same information as backend_status.h's
IOOpCounters, however IOOpCounters' members must be atomics and the
stats collector has no such requirement.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c    | 100 ++++++++++++++++++++++++++++-
 src/include/miscadmin.h            |   2 +
 src/include/pgstat.h               |  56 ++++++++++++++++
 src/include/utils/backend_status.h |  37 +++++++++++
 4 files changed, 194 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7264d2c727..11db91f62b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -126,7 +126,7 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
+ * BgWriter, Checkpointer, and WAL global statistics counters.
  * Stored directly in a stats message structure so they can be sent
  * without needing to copy things around.  We assume these init to zeroes.
  */
@@ -369,6 +369,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -3152,6 +3153,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IOOps for IOPaths when a process exits. Users
+	 * requiring IOOps for both live and exited backends can read from live
+	 * backends' PgBackendStatuses and sum this with totals from exited
+	 * backends persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3301,6 +3310,46 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO operations statistics to the
+ * collector so that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStatIOOpCounters *io_path_ops;
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+	PgStat_Counter sum = 0;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	io_path_ops = msg.iop.io_path_ops;
+	pgstat_sum_io_path_ops(io_path_ops, (IOOpCounters *)
+			&beentry->io_path_stats);
+
+	/* If no IO was done, don't bother sending anything to the stats collector. */
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		sum += io_path_ops[i].allocs;
+		sum += io_path_ops[i].extends;
+		sum += io_path_ops[i].fsyncs;
+		sum += io_path_ops[i].writes;
+	}
+
+	if (sum == 0)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3483,6 +3532,29 @@ pgstat_send_subscription_purge(PgStat_MsgSubscriptionPurge *msg)
 	pgstat_send(msg, len);
 }
 
+/*
+ * Helper function to sum all IO operations stats for all IOPaths (e.g. shared,
+ * local) from live backends with those in the equivalent stats structure for
+ * exited backends.
+ * Note that this adds and doesn't set, so the destination stats structure
+ * should be zeroed out by the caller initially.
+ * This would commonly be used to transfer all IOOp stats for all IOPaths for a
+ * particular backend type to the pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src)
+{
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+}
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -3692,6 +3764,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5813,6 +5889,28 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	PgStatIOOpCounters *src_io_path_ops;
+	PgStatIOOpCounters *dest_io_path_ops;
+
+	src_io_path_ops = msg->iop.io_path_ops;
+	dest_io_path_ops =
+		globalStats.buffers.ops[backend_type_get_idx(msg->backend_type)].io_path_ops;
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStatIOOpCounters *src = &src_io_path_ops[i];
+		PgStatIOOpCounters *dest = &dest_io_path_ops[i];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1d688f9e51..85fe522780 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -339,6 +339,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5b51b58e5a..23496afffc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -73,6 +73,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -335,6 +336,48 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IOOps in the stats collector
+ */
+typedef struct PgStatIOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOpCounters;
+
+/*
+ * Structure for counting all IOOps on all types of IOPaths.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOpCounters io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IOOps for all IOPaths
+ * for a given type of a backend. This will happen when the backend exits.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IOOps for all IOPaths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -756,6 +799,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -939,6 +983,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1215,8 +1260,19 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+/*
+ * While some processes send some types of statistics to the collector at
+ * regular intervals (e.g. CheckpointerMain() calling
+ * pgstat_send_checkpointer()), IO operations stats are only sent by
+ * pgstat_send_buffers() when a process exits (in pgstat_shutdown_hook()). IO
+ * operations stats from live backends can be read from their PgBackendStatuses
+ * and, if desired, summed with totals from exited backends persisted by the
+ * stats collector.
+ */
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 56a0f25296..ae4597c5fe 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -331,6 +331,43 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
+
+extern const char *GetIOPathDesc(IOPath io_path);
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
-- 
2.30.2

#48

melanieplageman@gmail.com

about 4 years ago

In reply to: Melanie Plageman (#47)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Dec 21, 2021 at 8:32 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Dec 16, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote:

From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter

When do you think it makes sense to tackle these wrt committing some of the
patches?

Well, the new stats are a superset of the old stats (no stats have been
removed that are not represented in the new or old views). So, I don't
see that as a blocker for committing these patches.

Since it is weird that pg_stat_bgwriter had mostly checkpointer stats,
I've edited this commit to rename that view to pg_stat_checkpointer.

I have not made a separate view just for maxwritten_clean (presumably
called pg_stat_bgwriter), but I would not be opposed to doing this if
you thought having a view with a single column isn't a problem (in the
event that we don't get around to adding more bgwriter stats right
away).

How about keeping old bgwriter values in place in the view , but generated
from the new stats stuff?

I tried this, but I actually don't think it is the right way to go. In
order to maintain the old view with the new source code, I had to add
new code to maintain a separate resets array just for the bgwriter view.
It adds some fiddly code that will be annoying to maintain (the reset
logic is confusing enough as is).
And, besides the implementation complexity, if a user resets
pg_stat_bgwriter and not pg_stat_buffers (or vice versa), they will
see totally different numbers for "buffers_backend" in pg_stat_bgwriter
than shared buffers written by B_BACKEND in pg_stat_buffers. I would
find that confusing.

In a quick chat off-list, Andres suggested it might be okay to have a
single reset target for both the pg_stat_buffers view and legacy
pg_stat_bgwriter view. So, I am planning to share a new patchset which
has only the new "buffers" target which will also reset the legacy
pg_stat_bgwriter view.

I'll also remove the bgwriter stats I proposed and the
pg_stat_checkpointer view to keep things simple for now.

- Melanie

#49

melanieplageman@gmail.com

about 4 years ago

In reply to: Melanie Plageman (#48)

8 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Dec 30, 2021 at 3:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Dec 21, 2021 at 8:32 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Dec 16, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote:

From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:20:10 -0500
Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter

When do you think it makes sense to tackle these wrt committing some of the
patches?

Well, the new stats are a superset of the old stats (no stats have been
removed that are not represented in the new or old views). So, I don't
see that as a blocker for committing these patches.

Since it is weird that pg_stat_bgwriter had mostly checkpointer stats,
I've edited this commit to rename that view to pg_stat_checkpointer.

I have not made a separate view just for maxwritten_clean (presumably
called pg_stat_bgwriter), but I would not be opposed to doing this if
you thought having a view with a single column isn't a problem (in the
event that we don't get around to adding more bgwriter stats right
away).

How about keeping old bgwriter values in place in the view , but generated
from the new stats stuff?

I tried this, but I actually don't think it is the right way to go. In
order to maintain the old view with the new source code, I had to add
new code to maintain a separate resets array just for the bgwriter view.
It adds some fiddly code that will be annoying to maintain (the reset
logic is confusing enough as is).
And, besides the implementation complexity, if a user resets
pg_stat_bgwriter and not pg_stat_buffers (or vice versa), they will
see totally different numbers for "buffers_backend" in pg_stat_bgwriter
than shared buffers written by B_BACKEND in pg_stat_buffers. I would
find that confusing.

In a quick chat off-list, Andres suggested it might be okay to have a
single reset target for both the pg_stat_buffers view and legacy
pg_stat_bgwriter view. So, I am planning to share a new patchset which
has only the new "buffers" target which will also reset the legacy
pg_stat_bgwriter view.

I'll also remove the bgwriter stats I proposed and the
pg_stat_checkpointer view to keep things simple for now.

I've done the above in v20, attached.

- Melanie

Attachments:

v20-0006-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v20-0006-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 14e20657d6a6674cc286b5ce1a29560889cf7833 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:07:37 -0500
Subject: [PATCH v20 6/8] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO Path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IOPath and IOOp are enumerated in
the view but not all are populated or even possible at this point.

All backends increment a counter in an array of IO stat counters in
their PgBackendStatus when performing an IO operation. On exit, backends
send these stats to the stats collector to be persisted.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO Path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                | 119 +++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/pgstat.c             |  13 ++
 src/backend/utils/activity/backend_status.c |  19 ++-
 src/backend/utils/adt/pgstatfuncs.c         | 150 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/pgstat.h                        |   1 +
 src/include/utils/backend_status.h          |   1 +
 src/test/regress/expected/rules.out         |   8 ++
 9 files changed, 323 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c94eb57a59..30df3d473e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3604,7 +3613,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5209,12 +5313,13 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>buffers</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
-        view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        argument.  The argument can be <literal>archiver</literal> to reset all
+        the counters shown in the <structname>pg_stat_archiver</structname>
+        view, <literal>buffers</literal> to reset all the counters shown in
+        both the <structname>pg_stat_bgwriter</structname> view and
+        <structname>pg_stat_buffers</structname> view, or
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 61b515cdb8..e214d23056 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1076,6 +1076,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 71db0b7b14..f4bff26630 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2921,6 +2921,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO operations statistics for
+ *	both exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 9fb888f2ca..08e3f8f167 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -50,7 +50,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -236,6 +236,23 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f529c1561a..c16dc3ba9a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1796,6 +1796,156 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above BUFFERS_NUM_COLUMNS.
+*/
+enum
+{
+	BUFFERS_COLUMN_BACKEND_TYPE,
+	BUFFERS_COLUMN_IO_PATH,
+	BUFFERS_COLUMN_ALLOCS,
+	BUFFERS_COLUMN_EXTENDS,
+	BUFFERS_COLUMN_FSYNCS,
+	BUFFERS_COLUMN_WRITES,
+	BUFFERS_COLUMN_RESET_TIME,
+	BUFFERS_NUM_COLUMNS,
+};
+
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS],
+		BackendType backend_type, IOPath io_path)
+{
+	return all_values[backend_type_get_idx(backend_type)][io_path];
+}
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+	bool		all_nulls[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+	tupstore = tuplestore_begin_heap((bool) (rsinfo->allowedModes & SFRM_Materialize_Random),
+			false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	 /* Loop through all live backends and count their IO Ops for each IO Path */
+	beentry = BackendStatusArray;
+
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		IOOpCounters   *io_ops;
+
+		/*
+		 * Don't count dead backends. They will be added below. There are no
+		 * rows in the view for BackendType B_INVALID, so skip those as well.
+		 */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, beentry->st_backendType, i);
+
+			/*
+			 * BUFFERS_COLUMN_RESET_TIME, BUFFERS_COLUMN_BACKEND_TYPE, and
+			 * BUFFERS_COLUMN_IO_PATH will all be set when looping through
+			 * exited backends array
+			 */
+			row[BUFFERS_COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			row[BUFFERS_COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			row[BUFFERS_COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			row[BUFFERS_COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		BackendType backend_type = idx_get_backend_type(i);
+
+		PgStatIOOpCounters *io_ops =
+			backend_io_path_ops->ops[i].io_path_ops;
+		PgStatIOOpCounters *resets =
+			backend_io_path_ops->resets[i].io_path_ops;
+
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, backend_type, j);
+
+			row[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			row[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(j));
+			row[BUFFERS_COLUMN_RESET_TIME] = reset_time;
+			row[BUFFERS_COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+			row[BUFFERS_COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+			row[BUFFERS_COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+			row[BUFFERS_COLUMN_WRITES] += io_ops->writes - resets->writes;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum	   *values = all_values[i][j];
+			bool	   *nulls = all_nulls[i][j];
+
+			tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4d992dc224..24d1fc29a2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5636,6 +5636,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4505825c87..d0700e6efe 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1296,6 +1296,7 @@ extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 92f00e1fce..ecd0668161 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -316,6 +316,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b58b062b10..5869ce442f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1828,6 +1828,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.30.2

v20-0005-Add-buffers-to-pgstat_reset_shared_counters.patchtext/x-patch; charset=US-ASCII; name=v20-0005-Add-buffers-to-pgstat_reset_shared_counters.patchDownload

From 85234ca76f31103fe7a2106a6b99a7da2d5a991e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 3 Jan 2022 18:26:45 -0500
Subject: [PATCH v20 5/8] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their
PgBackendStatus. Upon exit, they send these counts to the stats
collector.

Prior to this commit, the IO operations stats from exited backends
persisted by the stats collector would have been been reset when
pgstat_reset_shared_counters() was invoked with target "bgwriter".
However the IO operations stats in each live backend's PgBackendStatus
would remain the same. Thus the totals calculated from both live and
exited backends would be incorrect after a reset.

Backends' PgBackendStatuses cannot be written to by another backend;
therefore, in order to calculate correct totals after a reset has
occurred, the backend sending the reset message to the stats collector
now reads the IO operation stats totals from live backends and sends
them to the stats collector to be persisted in an array of "resets"
which can be used to calculate the correct totals after a reset.

Because the IO operations statistics are broader in scope than those in
pg_stat_bgwriter, rename the reset target to "buffers". The "buffers"
target will reset all IO operations statistics and all statistics for
the pg_stat_bgwriter view maintained by the stats collector.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                |  2 +-
 src/backend/postmaster/pgstat.c             | 87 ++++++++++++++++++---
 src/backend/utils/activity/backend_status.c | 27 +++++++
 src/include/pgstat.h                        | 29 +++++--
 src/include/utils/backend_status.h          |  2 +
 5 files changed, 129 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..c94eb57a59 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5209,7 +5209,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
+        argument.  The argument can be <literal>buffers</literal> to reset
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 11db91f62b..71db0b7b14 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1509,6 +1509,36 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+	memset(ops, 0, sizeof(ops));
+	pgstat_report_live_backend_io_path_ops(ops);
+
+	/*
+	 * Iterate through the array of all IOOps for all IOPaths for each
+	 * BackendType.
+	 *
+	 * An individual message is sent for each backend type because sending all
+	 * IO operations in one message would exceed the PGSTAT_MAX_MSG_SIZE of
+	 * 1000.
+	 */
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		msg->m_backend_resets.backend_type = idx_get_backend_type(i);
+		memcpy(&msg->m_backend_resets.iop, &ops[i],
+				sizeof(msg->m_backend_resets.iop));
+		pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+	}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1526,19 +1556,25 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+
 	if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
-	else if (strcmp(target, "bgwriter") == 0)
-		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
+				 errhint(
+					 "Target must be \"archiver\", \"buffers\", or \"wal\".")));
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -4425,6 +4461,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 */
 	ts = GetCurrentTimestamp();
 	globalStats.bgwriter.stat_reset_timestamp = ts;
+	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
 
@@ -5588,18 +5625,46 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 static void
 pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
-	if (msg->m_resettarget == RESET_BGWRITER)
-	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
-		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
-	}
-	else if (msg->m_resettarget == RESET_ARCHIVER)
+	if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
 		memset(&archiverStats, 0, sizeof(archiverStats));
 		archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		/*
+		 * Reset global stats for bgwriter, buffers, and checkpointer.
+		 *
+		 * Because the stats collector cannot write to live backends'
+		 * PgBackendStatuses, it maintains an array of "resets". The reset
+		 * message contains the current values of these counters for live
+		 * backends. The stats collector saves these in its "resets" array,
+		 * then zeroes out the exited backends' saved IO operations counters.
+		 * This is required to calculate an accurate total for each IO
+		 * operations counter post reset.
+		 */
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * We reset each member individually (as opposed to resetting the
+		 * entire globalStats struct) because we need to preserve the resets
+		 * array (globalStats.buffers.resets).
+		 *
+		 * Though globalStats.buffers.ops, globalStats.bgwriter, and
+		 * globalStats.checkpointer only need to be reset once, doing so for
+		 * every message is less brittle and the extra cost is irrelevant given
+		 * how often stats are reset.
+		 */
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+		memcpy(&globalStats.buffers.resets[backend_type_get_idx(backend_type)],
+				&msg->m_backend_resets.iop.io_path_ops,
+				sizeof(msg->m_backend_resets.iop.io_path_ops));
+	}
 	else if (msg->m_resettarget == RESET_WAL)
 	{
 		/* Reset the WAL statistics for the cluster. */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 44c9a6e1a6..9fb888f2ca 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -631,6 +631,33 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IOOps
+ * for all IOPaths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (int i = 0; i < MaxBackends + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		int idx;
+
+		/* Don't count dead backends or those with type B_INVALID. */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		idx = backend_type_get_idx(beentry->st_backendType);
+		pgstat_sum_io_path_ops(backend_io_path_ops[idx].io_path_ops,
+				(IOOpCounters *) beentry->io_path_stats);
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 23496afffc..4505825c87 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -141,7 +141,7 @@ typedef struct PgStat_TableCounts
 typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
-	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -357,7 +357,8 @@ typedef struct PgStatIOPathOps
 
 /*
  * Sent by a backend to the stats collector to report all IOOps for all IOPaths
- * for a given type of a backend. This will happen when the backend exits.
+ * for a given type of a backend. This will happen when the backend exits or
+ * when stats are reset.
  */
 typedef struct PgStat_MsgIOPathOps
 {
@@ -375,9 +376,12 @@ typedef struct PgStat_MsgIOPathOps
  */
 typedef struct PgStat_BackendIOPathOps
 {
+	TimestampTz stat_reset_timestamp;
 	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
 } PgStat_BackendIOPathOps;
 
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -389,15 +393,28 @@ typedef struct PgStat_MsgResetcounter
 	Oid			m_databaseid;
 } PgStat_MsgResetcounter;
 
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *								to reset a shared counter
- * ----------
+/*
+ * Sent by the backend to tell the collector to reset a shared counter.
+ *
+ * In addition to the message header and reset target, the message also
+ * contains an array with all of the IO operations for all IO paths done by a
+ * particular backend type.
+ *
+ * This is needed because the IO operation stats for live backends cannot be
+ * safely modified by other processes. Therefore, to correctly calculate the
+ * total IO operations for a particular backend type after a reset, the balance
+ * of IO operations for live backends at the time of prior resets must be
+ * subtracted from the total IO operations.
+ *
+ * To satisfy this requirement, the process initiating the reset will read the
+ * IO operations counters from live backends and send them to the stats
+ * collector which maintains an array of reset values.
  */
 typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index ae4597c5fe..92f00e1fce 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -375,6 +375,7 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
 
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
@@ -402,6 +403,7 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	}
 }
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
-- 
2.30.2

v20-0008-small-comment-correction.patchtext/x-patch; charset=US-ASCII; name=v20-0008-small-comment-correction.patchDownload

From 7c367848f63d099f8f95ecdab516e911b4e69b3a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:21:08 -0500
Subject: [PATCH v20 8/8] small comment correction

Naming callers in function comment is brittle and unnecessary.
---
 src/backend/utils/activity/backend_status.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 08e3f8f167..d508b819b1 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -254,11 +254,11 @@ GetIOPathDesc(IOPath io_path)
 }
 
 /*
- * Initialize pgstats backend activity state, and set up our on-proc-exit
- * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
- * process, MyBackendId is invalid. Otherwise, MyBackendId must be set, but we
- * must not have started any transaction yet (since the exit hook must run
- * after the last transaction exit).
+ * Initialize pgstats backend activity state, and set up our on-proc-exit hook.
+ *
+ * For auxiliary process, MyBackendId is invalid. Otherwise, MyBackendId must
+ * be set, but we must not have started any transaction yet (since the exit
+ * hook must run after the last transaction exit).
  *
  * NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
  */
@@ -296,7 +296,6 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
-- 
2.30.2

v20-0004-Send-IO-operations-to-stats-collector.patchtext/x-patch; charset=US-ASCII; name=v20-0004-Send-IO-operations-to-stats-collector.patchDownload

From 51c5118ae8b189204c7f594c57d816a428f70da3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:16:55 -0500
Subject: [PATCH v20 4/8] Send IO operations to stats collector

On exit, backends send the IO operations they have done on all IO Paths
to the stats collector. The stats collector adds these counts to its
existing counts stored in a global data structure it maintains and
persists.

PgStatIOOpCounters contains the same information as backend_status.h's
IOOpCounters, however IOOpCounters' members must be atomics and the
stats collector has no such requirement.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c    | 100 ++++++++++++++++++++++++++++-
 src/include/miscadmin.h            |   2 +
 src/include/pgstat.h               |  56 ++++++++++++++++
 src/include/utils/backend_status.h |  37 +++++++++++
 4 files changed, 194 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7264d2c727..11db91f62b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -126,7 +126,7 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
+ * BgWriter, Checkpointer, and WAL global statistics counters.
  * Stored directly in a stats message structure so they can be sent
  * without needing to copy things around.  We assume these init to zeroes.
  */
@@ -369,6 +369,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -3152,6 +3153,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IOOps for IOPaths when a process exits. Users
+	 * requiring IOOps for both live and exited backends can read from live
+	 * backends' PgBackendStatuses and sum this with totals from exited
+	 * backends persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3301,6 +3310,46 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO operations statistics to the
+ * collector so that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStatIOOpCounters *io_path_ops;
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+	PgStat_Counter sum = 0;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	io_path_ops = msg.iop.io_path_ops;
+	pgstat_sum_io_path_ops(io_path_ops, (IOOpCounters *)
+			&beentry->io_path_stats);
+
+	/* If no IO was done, don't bother sending anything to the stats collector. */
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		sum += io_path_ops[i].allocs;
+		sum += io_path_ops[i].extends;
+		sum += io_path_ops[i].fsyncs;
+		sum += io_path_ops[i].writes;
+	}
+
+	if (sum == 0)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3483,6 +3532,29 @@ pgstat_send_subscription_purge(PgStat_MsgSubscriptionPurge *msg)
 	pgstat_send(msg, len);
 }
 
+/*
+ * Helper function to sum all IO operations stats for all IOPaths (e.g. shared,
+ * local) from live backends with those in the equivalent stats structure for
+ * exited backends.
+ * Note that this adds and doesn't set, so the destination stats structure
+ * should be zeroed out by the caller initially.
+ * This would commonly be used to transfer all IOOp stats for all IOPaths for a
+ * particular backend type to the pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src)
+{
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+}
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -3692,6 +3764,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5813,6 +5889,28 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	PgStatIOOpCounters *src_io_path_ops;
+	PgStatIOOpCounters *dest_io_path_ops;
+
+	src_io_path_ops = msg->iop.io_path_ops;
+	dest_io_path_ops =
+		globalStats.buffers.ops[backend_type_get_idx(msg->backend_type)].io_path_ops;
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStatIOOpCounters *src = &src_io_path_ops[i];
+		PgStatIOOpCounters *dest = &dest_io_path_ops[i];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1d688f9e51..85fe522780 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -339,6 +339,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5b51b58e5a..23496afffc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -73,6 +73,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -335,6 +336,48 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IOOps in the stats collector
+ */
+typedef struct PgStatIOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOpCounters;
+
+/*
+ * Structure for counting all IOOps on all types of IOPaths.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOpCounters io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IOOps for all IOPaths
+ * for a given type of a backend. This will happen when the backend exits.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IOOps for all IOPaths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -756,6 +799,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -939,6 +983,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1215,8 +1260,19 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+/*
+ * While some processes send some types of statistics to the collector at
+ * regular intervals (e.g. CheckpointerMain() calling
+ * pgstat_send_checkpointer()), IO operations stats are only sent by
+ * pgstat_send_buffers() when a process exits (in pgstat_shutdown_hook()). IO
+ * operations stats from live backends can be read from their PgBackendStatuses
+ * and, if desired, summed with totals from exited backends persisted by the
+ * stats collector.
+ */
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 56a0f25296..ae4597c5fe 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -331,6 +331,43 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
+
+extern const char *GetIOPathDesc(IOPath io_path);
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
-- 
2.30.2

v20-0007-Remove-superfluous-bgwriter-stats-code.patchtext/x-patch; charset=US-ASCII; name=v20-0007-Remove-superfluous-bgwriter-stats-code.patchDownload

From 147d4f64a8488ac03b98b4ae7a8a4ded3072b2ed Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 3 Jan 2022 15:01:10 -0500
Subject: [PATCH v20 7/8] Remove superfluous bgwriter stats code

After adding io_path_stats to PgBackendStatus, all backends keep track
of all IO done on all types of IO paths. When backends exit, they send
their IO operations stats to the stats collector to be persisted.

These statistics are available in the pg_stat_buffers view, making the
buffers_checkpoint, buffers_clean, buffers_backend,
buffers_backend_fsync, and buffers_alloc columns in the pg_stat_bgwriter
view redundant.

In order to maintain backward compatability, these columns in
pg_stat_bgwriter remain and are derived from the pg_stat_buffers view.

The structs used to track the statistics for these columns in the
pg_stat_bgwriter view and the functions querying them have been removed.

Additionally, since the "buffers" stats reset target resets both the IO
operations stats structs and the bgwriter stats structs, this member of
the bgwriter stats structs is no longer needed. Instead derive the
stats_reset column in the pg_stat_bgwriter view from pg_stat_buffers as
well.
---
 src/backend/catalog/system_views.sql  | 28 ++++++++++-----------
 src/backend/postmaster/checkpointer.c | 29 ++-------------------
 src/backend/postmaster/pgstat.c       |  7 ------
 src/backend/storage/buffer/bufmgr.c   |  6 -----
 src/backend/utils/adt/pgstatfuncs.c   | 36 ---------------------------
 src/include/catalog/pg_proc.dat       | 26 -------------------
 src/include/pgstat.h                  | 11 --------
 src/test/regress/expected/rules.out   | 24 +++++++++++++-----
 8 files changed, 34 insertions(+), 133 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e214d23056..b992d9300a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1062,20 +1062,6 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
-CREATE VIEW pg_stat_bgwriter AS
-    SELECT
-        pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
-        pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
-        pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
-        pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
-        pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
-        pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
-
 CREATE VIEW pg_stat_buffers AS
 SELECT
        b.backend_type,
@@ -1087,6 +1073,20 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_buffers() b;
 
+CREATE VIEW pg_stat_bgwriter AS
+    SELECT
+        pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
+        pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
+        pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
+        pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'checkpointer' AND io_path = 'shared') AS buffers_checkpoint,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'background writer' AND io_path = 'shared') AS buffers_clean,
+        pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_backend,
+        (SELECT fsync FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_backend_fsync,
+        (SELECT alloc FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_alloc,
+        (SELECT stats_reset FROM pg_stat_buffers LIMIT 1) AS stats_reset;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 8440b2b802..b9c3745474 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -124,9 +116,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1081,10 +1070,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1094,13 +1079,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
+
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1257,15 +1241,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f4bff26630..0b52533140 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4473,7 +4473,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 * existing statsfile).
 	 */
 	ts = GetCurrentTimestamp();
-	globalStats.bgwriter.stat_reset_timestamp = ts;
 	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
@@ -5672,7 +5671,6 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
 		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
-		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
 		memcpy(&globalStats.buffers.resets[backend_type_get_idx(backend_type)],
 				&msg->m_backend_resets.iop.io_path_ops,
@@ -5944,9 +5942,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5962,9 +5958,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 528995e7fb..6ec15ea00e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2165,7 +2165,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2274,9 +2273,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2473,8 +2469,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index c16dc3ba9a..b25a688e17 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1738,18 +1738,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1772,30 +1760,6 @@ pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 					 pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
-Datum
-pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
-}
-
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 /*
 * When adding a new column to the pg_stat_buffers view, add a new enum
 * value here above BUFFERS_NUM_COLUMNS.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 24d1fc29a2..bbc613eddf 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5594,25 +5594,11 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_maxwritten_clean' },
-{ oid => '3075', descr => 'statistics: last reset for the bgwriter',
-  proname => 'pg_stat_get_bgwriter_stat_reset_time', provolatile => 's',
-  proparallel => 'r', prorettype => 'timestamptz', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_stat_reset_time' },
 { oid => '3160',
   descr => 'statistics: checkpoint time spent writing buffers to disk, in milliseconds',
   proname => 'pg_stat_get_checkpoint_write_time', provolatile => 's',
@@ -5623,18 +5609,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d0700e6efe..b17edd8679 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -522,9 +522,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -537,9 +535,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -970,10 +965,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
-	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
 /*
@@ -986,9 +978,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5869ce442f..9736c1fe37 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1821,13 +1821,25 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'checkpointer'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_checkpoint,
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'background writer'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
-    pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_backend,
+    ( SELECT pg_stat_buffers.fsync
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_backend_fsync,
+    ( SELECT pg_stat_buffers.alloc
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_alloc,
+    ( SELECT pg_stat_buffers.stats_reset
+           FROM pg_stat_buffers
+         LIMIT 1) AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
     b.alloc,
-- 
2.30.2

v20-0003-Add-IO-operation-counters-to-PgBackendStatus.patchtext/x-patch; charset=US-ASCII; name=v20-0003-Add-IO-operation-counters-to-PgBackendStatus.patchDownload

From 37fa6f0503e6d959e70b70e8a936dee7a7fc4c0f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v20 3/8] Add IO operation counters to PgBackendStatus

Add an array of counters to PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend.

Each "IO Op" (alloc, fsync, extend, write) is counted per "IO Path"
(direct, local, shared, or strategy).

"local" and "shared" IO Path counters count operations on local and
shared buffers.

The "strategy" IO Path counts buffers alloc'd/written/read/fsync'd as
part of a BufferAccessStrategy.

The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, backends increment a counter in the array in their
PgBackendStatus when performing an IO operation.

Future patches will persist the IO stat counters from a backend's
PgBackendStatus upon backend exit and use the counters to provide
observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call pgstat_inc_ioop()
for unbuffered IO and avoid regressions for future users of these
functions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/storage/buffer/bufmgr.c         | 47 +++++++++++---
 src/backend/storage/buffer/freelist.c       | 22 ++++++-
 src/backend/storage/buffer/localbuf.c       |  3 +
 src/backend/storage/sync/sync.c             |  1 +
 src/backend/utils/activity/backend_status.c | 10 +++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 68 +++++++++++++++++++++
 8 files changed, 141 insertions(+), 15 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 25a18b7a14..8440b2b802 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1101,6 +1101,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..528995e7fb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,7 +480,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -972,6 +972,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1172,6 +1173,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1182,7 +1185,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1219,6 +1222,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1236,7 +1241,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1245,13 +1250,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 *
+				 * When a strategy is not in use, at this point the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2552,10 +2571,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2803,9 +2823,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2897,6 +2920,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3533,6 +3558,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3568,7 +3595,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3664,7 +3691,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3720,7 +3747,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3747,7 +3774,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..45d73995b2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04b3558ea3..f396a2b68d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..3be06d5d5a 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -420,6 +420,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598822..44c9a6e1a6 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -400,6 +400,16 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
 
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		IOOpCounters *io_ops = &lbeentry.io_path_stats[i];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
+
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
 	 * examine it until st_progress_command has been set to something other
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7c6653311a..15f5724fa1 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8042b817df..56a0f25296 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,47 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOp for a live backend.
+ */
+typedef struct IOOpCounters
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOpCounters;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +204,12 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IOOps for all IOPaths for this backend. These should be
+	 * incremented whenever an IO Operation is performed.
+	 */
+	IOOpCounters	io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,6 +338,32 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOpCounters *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_unlocked_inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_unlocked_inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_unlocked_inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_unlocked_inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
-- 
2.30.2

v20-0001-Read-only-atomic-backend-write-function.patchtext/x-patch; charset=US-ASCII; name=v20-0001-Read-only-atomic-backend-write-function.patchDownload

From b6ada2c3033b03dac6cbdddf0e29454f9d26d986 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v20 1/8] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values; however, pg_atomic_fetch_add_u64() is overkill for
incrementing such counters. pg_atomic_unlocked_inc_counter() is a helper
function which can be used to increment these values safely without
unnecessary overhead.

Author: Thomas Munro <tmunro@postgresql.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ06d3h5JeOtAv4h52n0vG1jOPZxqMCn5FySJQUVZA32w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/include/port/atomics.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..af30da32e5 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++. On some older systems
+ * there might be more to it (due to an inability to read and write 64-bit
+ * values atomically).
+ */
+static inline void
+pg_atomic_unlocked_inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.30.2

v20-0002-Move-backend-pgstat-initialization-earlier.patchtext/x-patch; charset=US-ASCII; name=v20-0002-Move-backend-pgstat-initialization-earlier.patchDownload

From 569bfe5bd2493d8873d8763c1767ccf34fa15505 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 14 Dec 2021 12:26:56 -0500
Subject: [PATCH v20 2/8] Move backend pgstat initialization earlier

Initialize the pgstats subsystem earlier during process initialization
so that more process types have a backend activity state
(PgBackendStatus).

Conditionally initializing backend activity state in some types of
processes and not in others necessitates surprising special cases in the
code.

This particular commit was motivated by single user mode missing a
backend activity state.

This commit also adds a new BackendType for standalone backends,
B_STANDALONE_BACKEND (and alphabetizes the BackendTypes). Both the
bootstrap backend and single user mode backends will have BackendType
B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 23 ++++++++++++++---------
 src/backend/utils/init/postinit.c |  7 +++----
 src/include/miscadmin.h           |  7 ++++---
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 88801374b5..41d0b023cd 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,9 +278,18 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
+		case B_STATS_COLLECTOR:
+			backendDesc = "stats collector";
+			break;
 		case B_WAL_RECEIVER:
 			backendDesc = "walreceiver";
 			break;
@@ -285,15 +299,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_STATS_COLLECTOR:
-			backendDesc = "stats collector";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7292e51f7d..11f1fec17e 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -623,6 +623,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
 	}
 
+	pgstat_beinit();
+
 	/*
 	 * If this is either a bootstrap process nor a standalone backend, start
 	 * up the XLOG machinery, and register to have it closed down at exit.
@@ -638,6 +640,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 */
 		CreateAuxProcessResourceOwner();
 
+		pgstat_bestart();
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
 		ReleaseAuxProcessResources(true);
@@ -665,7 +668,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
@@ -903,10 +905,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 * transaction we started before returning.
 		 */
 		if (!bootstrap)
-		{
-			pgstat_bestart();
 			CommitTransactionCommand();
-		}
 		return;
 	}
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 90a3016065..1d688f9e51 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -323,19 +323,20 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
+	B_STATS_COLLECTOR,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_STATS_COLLECTOR,
-	B_LOGGER,
 } BackendType;
 
 extern BackendType MyBackendType;
-- 
2.30.2

#50

melanieplageman@gmail.com

almost 4 years ago

In reply to: Melanie Plageman (#49)

8 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v21 rebased with compile errors fixed is attached.

Attachments:

v21-0008-small-comment-correction.patchapplication/octet-stream; name=v21-0008-small-comment-correction.patchDownload

From 681792783b8ce795f1510b6fd2f26139196d6b25 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:21:08 -0500
Subject: [PATCH v21 8/8] small comment correction

Naming callers in function comment is brittle and unnecessary.
---
 src/backend/utils/activity/backend_status.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index c579014ec2..9861ae24ba 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -259,11 +259,11 @@ GetIOPathDesc(IOPath io_path)
 }
 
 /*
- * Initialize pgstats backend activity state, and set up our on-proc-exit
- * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
- * process, MyBackendId is invalid. Otherwise, MyBackendId must be set, but we
- * must not have started any transaction yet (since the exit hook must run
- * after the last transaction exit).
+ * Initialize pgstats backend activity state, and set up our on-proc-exit hook.
+ *
+ * For auxiliary process, MyBackendId is invalid. Otherwise, MyBackendId must
+ * be set, but we must not have started any transaction yet (since the exit
+ * hook must run after the last transaction exit).
  *
  * NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
  */
@@ -301,7 +301,6 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
-- 
2.32.0

v21-0007-Remove-superfluous-bgwriter-stats-code.patchapplication/octet-stream; name=v21-0007-Remove-superfluous-bgwriter-stats-code.patchDownload

From ad8f98a7f6b711e03471e49cb8b4ff919d27cdae Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 3 Jan 2022 15:01:10 -0500
Subject: [PATCH v21 7/8] Remove superfluous bgwriter stats code

After adding io_path_stats to PgBackendStatus, all backends keep track
of all IO done on all types of IO paths. When backends exit, they send
their IO operations stats to the stats collector to be persisted.

These statistics are available in the pg_stat_buffers view, making the
buffers_checkpoint, buffers_clean, buffers_backend,
buffers_backend_fsync, and buffers_alloc columns in the pg_stat_bgwriter
view redundant.

In order to maintain backward compatability, these columns in
pg_stat_bgwriter remain and are derived from the pg_stat_buffers view.

The structs used to track the statistics for these columns in the
pg_stat_bgwriter view and the functions querying them have been removed.

Additionally, since the "buffers" stats reset target resets both the IO
operations stats structs and the bgwriter stats structs, this member of
the bgwriter stats structs is no longer needed. Instead derive the
stats_reset column in the pg_stat_bgwriter view from pg_stat_buffers as
well.
---
 src/backend/catalog/system_views.sql  | 28 ++++++++++-----------
 src/backend/postmaster/checkpointer.c | 29 ++-------------------
 src/backend/postmaster/pgstat.c       |  7 ------
 src/backend/storage/buffer/bufmgr.c   |  6 -----
 src/backend/utils/adt/pgstatfuncs.c   | 36 ---------------------------
 src/include/catalog/pg_proc.dat       | 26 -------------------
 src/include/pgstat.h                  | 11 --------
 src/test/regress/expected/rules.out   | 24 +++++++++++++-----
 8 files changed, 34 insertions(+), 133 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 446a817905..d0a0c54b74 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1064,20 +1064,6 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
-CREATE VIEW pg_stat_bgwriter AS
-    SELECT
-        pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
-        pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
-        pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
-        pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
-        pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
-        pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
-
 CREATE VIEW pg_stat_buffers AS
 SELECT
        b.backend_type,
@@ -1089,6 +1075,20 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_buffers() b;
 
+CREATE VIEW pg_stat_bgwriter AS
+    SELECT
+        pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
+        pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
+        pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
+        pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'checkpointer' AND io_path = 'shared') AS buffers_checkpoint,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'background writer' AND io_path = 'shared') AS buffers_clean,
+        pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_backend,
+        (SELECT fsync FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_backend_fsync,
+        (SELECT alloc FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_alloc,
+        (SELECT stats_reset FROM pg_stat_buffers LIMIT 1) AS stats_reset;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4e88327425..ffe8292d8f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -91,17 +91,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -125,9 +117,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1086,10 +1075,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1099,13 +1084,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
+
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1262,15 +1246,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1a5b15410..93656ced72 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4473,7 +4473,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 * existing statsfile).
 	 */
 	ts = GetCurrentTimestamp();
-	globalStats.bgwriter.stat_reset_timestamp = ts;
 	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
@@ -5672,7 +5671,6 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
 		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
-		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
 		memcpy(&globalStats.buffers.resets[backend_type_get_idx(backend_type)],
 				&msg->m_backend_resets.iop.io_path_ops,
@@ -5944,9 +5942,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -5962,9 +5958,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a6e446f29a..c4ee2894a0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2166,7 +2166,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2275,9 +2274,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2474,8 +2470,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b0e66f89cf..15ca2d0a73 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1732,18 +1732,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1766,30 +1754,6 @@ pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 					 pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
-Datum
-pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
-}
-
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 /*
 * When adding a new column to the pg_stat_buffers view, add a new enum
 * value here above BUFFERS_NUM_COLUMNS.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9d3ab6d0a3..c028ca58f4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5600,25 +5600,11 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_maxwritten_clean' },
-{ oid => '3075', descr => 'statistics: last reset for the bgwriter',
-  proname => 'pg_stat_get_bgwriter_stat_reset_time', provolatile => 's',
-  proparallel => 'r', prorettype => 'timestamptz', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_stat_reset_time' },
 { oid => '3160',
   descr => 'statistics: checkpoint time spent writing buffers to disk, in milliseconds',
   proname => 'pg_stat_get_checkpoint_write_time', provolatile => 's',
@@ -5629,18 +5615,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index caf5ef5678..1abdb4e7ba 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -522,9 +522,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -537,9 +535,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -970,10 +965,7 @@ typedef struct PgStat_ArchiverStats
  */
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
-	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
 /*
@@ -986,9 +978,6 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter requested_checkpoints;
 	PgStat_Counter checkpoint_write_time;	/* times in milliseconds */
 	PgStat_Counter checkpoint_sync_time;
-	PgStat_Counter buf_written_checkpoints;
-	PgStat_Counter buf_written_backend;
-	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb17ed7f93..65634eef6d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1799,13 +1799,25 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'checkpointer'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_checkpoint,
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'background writer'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
-    pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_backend,
+    ( SELECT pg_stat_buffers.fsync
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_backend_fsync,
+    ( SELECT pg_stat_buffers.alloc
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_alloc,
+    ( SELECT pg_stat_buffers.stats_reset
+           FROM pg_stat_buffers
+         LIMIT 1) AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
     b.alloc,
-- 
2.32.0

v21-0004-Send-IO-operations-to-stats-collector.patchapplication/octet-stream; name=v21-0004-Send-IO-operations-to-stats-collector.patchDownload

From 939f7f01239b56865e212c24cd63437a46a344b1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:16:55 -0500
Subject: [PATCH v21 4/8] Send IO operations to stats collector

On exit, backends send the IO operations they have done on all IO Paths
to the stats collector. The stats collector adds these counts to its
existing counts stored in a global data structure it maintains and
persists.

PgStatIOOpCounters contains the same information as backend_status.h's
IOOpCounters, however IOOpCounters' members must be atomics and the
stats collector has no such requirement.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c    | 100 ++++++++++++++++++++++++++++-
 src/include/miscadmin.h            |   2 +
 src/include/pgstat.h               |  56 ++++++++++++++++
 src/include/utils/backend_status.h |  37 +++++++++++
 4 files changed, 194 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0646f53098..5eaf8b6ee7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -126,7 +126,7 @@ char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter and WAL global statistics counters.
+ * BgWriter, Checkpointer, and WAL global statistics counters.
  * Stored directly in a stats message structure so they can be sent
  * without needing to copy things around.  We assume these init to zeroes.
  */
@@ -369,6 +369,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -3152,6 +3153,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IOOps for IOPaths when a process exits. Users
+	 * requiring IOOps for both live and exited backends can read from live
+	 * backends' PgBackendStatuses and sum this with totals from exited
+	 * backends persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3301,6 +3310,46 @@ pgstat_send_bgwriter(void)
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
 }
 
+/*
+ * Before exiting, a backend sends its IO operations statistics to the
+ * collector so that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStatIOOpCounters *io_path_ops;
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+	PgStat_Counter sum = 0;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	io_path_ops = msg.iop.io_path_ops;
+	pgstat_sum_io_path_ops(io_path_ops, (IOOpCounters *)
+			&beentry->io_path_stats);
+
+	/* If no IO was done, don't bother sending anything to the stats collector. */
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		sum += io_path_ops[i].allocs;
+		sum += io_path_ops[i].extends;
+		sum += io_path_ops[i].fsyncs;
+		sum += io_path_ops[i].writes;
+	}
+
+	if (sum == 0)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * pgstat_send_checkpointer() -
  *
@@ -3483,6 +3532,29 @@ pgstat_send_subscription_purge(PgStat_MsgSubscriptionPurge *msg)
 	pgstat_send(msg, len);
 }
 
+/*
+ * Helper function to sum all IO operations stats for all IOPaths (e.g. shared,
+ * local) from live backends with those in the equivalent stats structure for
+ * exited backends.
+ * Note that this adds and doesn't set, so the destination stats structure
+ * should be zeroed out by the caller initially.
+ * This would commonly be used to transfer all IOOp stats for all IOPaths for a
+ * particular backend type to the pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src)
+{
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+}
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -3692,6 +3764,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -5813,6 +5889,28 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	PgStatIOOpCounters *src_io_path_ops;
+	PgStatIOOpCounters *dest_io_path_ops;
+
+	src_io_path_ops = msg->iop.io_path_ops;
+	dest_io_path_ops =
+		globalStats.buffers.ops[backend_type_get_idx(msg->backend_type)].io_path_ops;
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStatIOOpCounters *src = &src_io_path_ops[i];
+		PgStatIOOpCounters *dest = &dest_io_path_ops[i];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /* ----------
  * pgstat_recv_wal() -
  *
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 94c6135e93..77c89134c2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e10d20222a..431f273d23 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -73,6 +73,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -335,6 +336,48 @@ typedef struct PgStat_MsgDropdb
 } PgStat_MsgDropdb;
 
 
+/*
+ * Structure for counting all types of IOOps in the stats collector
+ */
+typedef struct PgStatIOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOpCounters;
+
+/*
+ * Structure for counting all IOOps on all types of IOPaths.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOpCounters io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IOOps for all IOPaths
+ * for a given type of a backend. This will happen when the backend exits.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IOOps for all IOPaths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -756,6 +799,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -939,6 +983,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 /*
@@ -1215,8 +1260,19 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+/*
+ * While some processes send some types of statistics to the collector at
+ * regular intervals (e.g. CheckpointerMain() calling
+ * pgstat_send_checkpointer()), IO operations stats are only sent by
+ * pgstat_send_buffers() when a process exits (in pgstat_shutdown_hook()). IO
+ * operations stats from live backends can be read from their PgBackendStatuses
+ * and, if desired, summed with totals from exited backends persisted by the
+ * stats collector.
+ */
+extern void pgstat_send_buffers(void);
 extern void pgstat_send_checkpointer(void);
 extern void pgstat_send_wal(bool force);
+extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 950b7396a5..3de1e7c8d3 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -331,6 +331,43 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
+
+extern const char *GetIOPathDesc(IOPath io_path);
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
-- 
2.32.0

v21-0005-Add-buffers-to-pgstat_reset_shared_counters.patchapplication/octet-stream; name=v21-0005-Add-buffers-to-pgstat_reset_shared_counters.patchDownload

From 99b5fcbe6a7e097fd36bdee730e98eca41f5426b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 3 Jan 2022 18:26:45 -0500
Subject: [PATCH v21 5/8] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their
PgBackendStatus. Upon exit, they send these counts to the stats
collector.

Prior to this commit, the IO operations stats from exited backends
persisted by the stats collector would have been been reset when
pgstat_reset_shared_counters() was invoked with target "bgwriter".
However the IO operations stats in each live backend's PgBackendStatus
would remain the same. Thus the totals calculated from both live and
exited backends would be incorrect after a reset.

Backends' PgBackendStatuses cannot be written to by another backend;
therefore, in order to calculate correct totals after a reset has
occurred, the backend sending the reset message to the stats collector
now reads the IO operation stats totals from live backends and sends
them to the stats collector to be persisted in an array of "resets"
which can be used to calculate the correct totals after a reset.

Because the IO operations statistics are broader in scope than those in
pg_stat_bgwriter, rename the reset target to "buffers". The "buffers"
target will reset all IO operations statistics and all statistics for
the pg_stat_bgwriter view maintained by the stats collector.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                |  2 +-
 src/backend/postmaster/pgstat.c             | 87 ++++++++++++++++++---
 src/backend/utils/activity/backend_status.c | 27 +++++++
 src/include/pgstat.h                        | 29 +++++--
 src/include/utils/backend_status.h          |  2 +
 5 files changed, 129 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf7625d988..caa45cb5f5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5218,7 +5218,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
+        argument.  The argument can be <literal>buffers</literal> to reset
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5eaf8b6ee7..a5b7cfa45d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1509,6 +1509,36 @@ pgstat_reset_counters(void)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+	memset(ops, 0, sizeof(ops));
+	pgstat_report_live_backend_io_path_ops(ops);
+
+	/*
+	 * Iterate through the array of all IOOps for all IOPaths for each
+	 * BackendType.
+	 *
+	 * An individual message is sent for each backend type because sending all
+	 * IO operations in one message would exceed the PGSTAT_MAX_MSG_SIZE of
+	 * 1000.
+	 */
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		msg->m_backend_resets.backend_type = idx_get_backend_type(i);
+		memcpy(&msg->m_backend_resets.iop, &ops[i],
+				sizeof(msg->m_backend_resets.iop));
+		pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+	}
+}
+
 /* ----------
  * pgstat_reset_shared_counters() -
  *
@@ -1526,19 +1556,25 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+
 	if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
-	else if (strcmp(target, "bgwriter") == 0)
-		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
+				 errhint(
+					 "Target must be \"archiver\", \"buffers\", or \"wal\".")));
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -4425,6 +4461,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 */
 	ts = GetCurrentTimestamp();
 	globalStats.bgwriter.stat_reset_timestamp = ts;
+	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
 
@@ -5588,18 +5625,46 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 static void
 pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
-	if (msg->m_resettarget == RESET_BGWRITER)
-	{
-		/* Reset the global, bgwriter and checkpointer statistics for the cluster. */
-		memset(&globalStats, 0, sizeof(globalStats));
-		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
-	}
-	else if (msg->m_resettarget == RESET_ARCHIVER)
+	if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
 		memset(&archiverStats, 0, sizeof(archiverStats));
 		archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		/*
+		 * Reset global stats for bgwriter, buffers, and checkpointer.
+		 *
+		 * Because the stats collector cannot write to live backends'
+		 * PgBackendStatuses, it maintains an array of "resets". The reset
+		 * message contains the current values of these counters for live
+		 * backends. The stats collector saves these in its "resets" array,
+		 * then zeroes out the exited backends' saved IO operations counters.
+		 * This is required to calculate an accurate total for each IO
+		 * operations counter post reset.
+		 */
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * We reset each member individually (as opposed to resetting the
+		 * entire globalStats struct) because we need to preserve the resets
+		 * array (globalStats.buffers.resets).
+		 *
+		 * Though globalStats.buffers.ops, globalStats.bgwriter, and
+		 * globalStats.checkpointer only need to be reset once, doing so for
+		 * every message is less brittle and the extra cost is irrelevant given
+		 * how often stats are reset.
+		 */
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+		memcpy(&globalStats.buffers.resets[backend_type_get_idx(backend_type)],
+				&msg->m_backend_resets.iop.io_path_ops,
+				sizeof(msg->m_backend_resets.iop.io_path_ops));
+	}
 	else if (msg->m_resettarget == RESET_WAL)
 	{
 		/* Reset the WAL statistics for the cluster. */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 79410e0b2c..87b9d0fc0d 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -636,6 +636,33 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IOOps
+ * for all IOPaths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (int i = 0; i < GetMaxBackends() + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		int idx;
+
+		/* Don't count dead backends or those with type B_INVALID. */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		idx = backend_type_get_idx(beentry->st_backendType);
+		pgstat_sum_io_path_ops(backend_io_path_ops[idx].io_path_ops,
+				(IOOpCounters *) beentry->io_path_stats);
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 431f273d23..e818a26780 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -141,7 +141,7 @@ typedef struct PgStat_TableCounts
 typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
-	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -357,7 +357,8 @@ typedef struct PgStatIOPathOps
 
 /*
  * Sent by a backend to the stats collector to report all IOOps for all IOPaths
- * for a given type of a backend. This will happen when the backend exits.
+ * for a given type of a backend. This will happen when the backend exits or
+ * when stats are reset.
  */
 typedef struct PgStat_MsgIOPathOps
 {
@@ -375,9 +376,12 @@ typedef struct PgStat_MsgIOPathOps
  */
 typedef struct PgStat_BackendIOPathOps
 {
+	TimestampTz stat_reset_timestamp;
 	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
 } PgStat_BackendIOPathOps;
 
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -389,15 +393,28 @@ typedef struct PgStat_MsgResetcounter
 	Oid			m_databaseid;
 } PgStat_MsgResetcounter;
 
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *								to reset a shared counter
- * ----------
+/*
+ * Sent by the backend to tell the collector to reset a shared counter.
+ *
+ * In addition to the message header and reset target, the message also
+ * contains an array with all of the IO operations for all IO paths done by a
+ * particular backend type.
+ *
+ * This is needed because the IO operation stats for live backends cannot be
+ * safely modified by other processes. Therefore, to correctly calculate the
+ * total IO operations for a particular backend type after a reset, the balance
+ * of IO operations for live backends at the time of prior resets must be
+ * subtracted from the total IO operations.
+ *
+ * To satisfy this requirement, the process initiating the reset will read the
+ * IO operations counters from live backends and send them to the stats
+ * collector which maintains an array of reset values.
  */
 typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 3de1e7c8d3..7e59c063b9 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -375,6 +375,7 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
 
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
@@ -402,6 +403,7 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	}
 }
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
-- 
2.32.0

v21-0006-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v21-0006-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 87c8dbac82fcd0ded35ec12f79804ef8e46a56e0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:07:37 -0500
Subject: [PATCH v21 6/8] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO Path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IOPath and IOOp are enumerated in
the view but not all are populated or even possible at this point.

All backends increment a counter in an array of IO stat counters in
their PgBackendStatus when performing an IO operation. On exit, backends
send these stats to the stats collector to be persisted.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO Path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                | 119 +++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/pgstat.c             |  13 ++
 src/backend/utils/activity/backend_status.c |  19 ++-
 src/backend/utils/adt/pgstatfuncs.c         | 150 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/pgstat.h                        |   1 +
 src/include/utils/backend_status.h          |   1 +
 src/test/regress/expected/rules.out         |   8 ++
 9 files changed, 323 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index caa45cb5f5..82d6f9a7de 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3613,7 +3622,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5218,12 +5322,13 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>buffers</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
-        view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        argument.  The argument can be <literal>archiver</literal> to reset all
+        the counters shown in the <structname>pg_stat_archiver</structname>
+        view, <literal>buffers</literal> to reset all the counters shown in
+        both the <structname>pg_stat_bgwriter</structname> view and
+        <structname>pg_stat_buffers</structname> view, or
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3cb69b1f87..446a817905 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1078,6 +1078,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a5b7cfa45d..b1a5b15410 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2921,6 +2921,19 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 		rec->tuples_inserted + rec->tuples_updated;
 }
 
+/*
+ *	Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *	the PgStat_BackendIOPathOps structure tracking IO operations statistics for
+ *	both exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+
+	return &globalStats.buffers;
+}
+
 
 /* ----------
  * pgstat_fetch_stat_dbentry() -
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 87b9d0fc0d..c579014ec2 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -38,7 +38,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -241,6 +241,23 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 30e8dfa7c1..b0e66f89cf 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1790,6 +1790,156 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above BUFFERS_NUM_COLUMNS.
+*/
+enum
+{
+	BUFFERS_COLUMN_BACKEND_TYPE,
+	BUFFERS_COLUMN_IO_PATH,
+	BUFFERS_COLUMN_ALLOCS,
+	BUFFERS_COLUMN_EXTENDS,
+	BUFFERS_COLUMN_FSYNCS,
+	BUFFERS_COLUMN_WRITES,
+	BUFFERS_COLUMN_RESET_TIME,
+	BUFFERS_NUM_COLUMNS,
+};
+
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS],
+		BackendType backend_type, IOPath io_path)
+{
+	return all_values[backend_type_get_idx(backend_type)][io_path];
+}
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+	bool		all_nulls[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+	tupstore = tuplestore_begin_heap((bool) (rsinfo->allowedModes & SFRM_Materialize_Random),
+			false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	 /* Loop through all live backends and count their IO Ops for each IO Path */
+	beentry = BackendStatusArray;
+
+	for (int i = 0; i < GetMaxBackends() + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		IOOpCounters   *io_ops;
+
+		/*
+		 * Don't count dead backends. They will be added below. There are no
+		 * rows in the view for BackendType B_INVALID, so skip those as well.
+		 */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, beentry->st_backendType, i);
+
+			/*
+			 * BUFFERS_COLUMN_RESET_TIME, BUFFERS_COLUMN_BACKEND_TYPE, and
+			 * BUFFERS_COLUMN_IO_PATH will all be set when looping through
+			 * exited backends array
+			 */
+			row[BUFFERS_COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			row[BUFFERS_COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			row[BUFFERS_COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			row[BUFFERS_COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		BackendType backend_type = idx_get_backend_type(i);
+
+		PgStatIOOpCounters *io_ops =
+			backend_io_path_ops->ops[i].io_path_ops;
+		PgStatIOOpCounters *resets =
+			backend_io_path_ops->resets[i].io_path_ops;
+
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, backend_type, j);
+
+			row[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			row[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(j));
+			row[BUFFERS_COLUMN_RESET_TIME] = reset_time;
+			row[BUFFERS_COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+			row[BUFFERS_COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+			row[BUFFERS_COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+			row[BUFFERS_COLUMN_WRITES] += io_ops->writes - resets->writes;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum	   *values = all_values[i][j];
+			bool	   *nulls = all_nulls[i][j];
+
+			tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7f1ee97f55..9d3ab6d0a3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5642,6 +5642,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e818a26780..caf5ef5678 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1296,6 +1296,7 @@ extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
  * generate the pgstat* views.
  * ----------
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7e59c063b9..6d623ff746 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -316,6 +316,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 1420288d67..fb17ed7f93 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1806,6 +1806,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.32.0

v21-0002-Move-backend-pgstat-initialization-earlier.patchapplication/octet-stream; name=v21-0002-Move-backend-pgstat-initialization-earlier.patchDownload

From 931557730cbd0d76c7ab20c3f49bdee24ea59db0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 14 Dec 2021 12:26:56 -0500
Subject: [PATCH v21 2/8] Move backend pgstat initialization earlier

Initialize the pgstats subsystem earlier during process initialization
so that more process types have a backend activity state
(PgBackendStatus).

Conditionally initializing backend activity state in some types of
processes and not in others necessitates surprising special cases in the
code.

This particular commit was motivated by single user mode missing a
backend activity state.

This commit also adds a new BackendType for standalone backends,
B_STANDALONE_BACKEND (and alphabetizes the BackendTypes). Both the
bootstrap backend and single user mode backends will have BackendType
B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 23 ++++++++++++++---------
 src/backend/utils/init/postinit.c |  7 +++----
 src/include/miscadmin.h           |  7 ++++---
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bdc77af719..cf6eca4bb4 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,9 +278,18 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
+		case B_STATS_COLLECTOR:
+			backendDesc = "stats collector";
+			break;
 		case B_WAL_RECEIVER:
 			backendDesc = "walreceiver";
 			break;
@@ -285,15 +299,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_STATS_COLLECTOR:
-			backendDesc = "stats collector";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index e2208151e4..c856fbe286 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -709,6 +709,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
 	}
 
+	pgstat_beinit();
+
 	/*
 	 * If this is either a bootstrap process nor a standalone backend, start
 	 * up the XLOG machinery, and register to have it closed down at exit.
@@ -724,6 +726,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 */
 		CreateAuxProcessResourceOwner();
 
+		pgstat_bestart();
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
 		ReleaseAuxProcessResources(true);
@@ -751,7 +754,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
@@ -989,10 +991,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 * transaction we started before returning.
 		 */
 		if (!bootstrap)
-		{
-			pgstat_bestart();
 			CommitTransactionCommand();
-		}
 		return;
 	}
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0abc3ad540..94c6135e93 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -322,19 +322,20 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
+	B_STATS_COLLECTOR,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_STATS_COLLECTOR,
-	B_LOGGER,
 } BackendType;
 
 extern BackendType MyBackendType;
-- 
2.32.0

v21-0003-Add-IO-operation-counters-to-PgBackendStatus.patchapplication/octet-stream; name=v21-0003-Add-IO-operation-counters-to-PgBackendStatus.patchDownload

From c1c3b700679a5376362120488cd68b0d0c3033e5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH v21 3/8] Add IO operation counters to PgBackendStatus

Add an array of counters to PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend.

Each "IO Op" (alloc, fsync, extend, write) is counted per "IO Path"
(direct, local, shared, or strategy).

"local" and "shared" IO Path counters count operations on local and
shared buffers.

The "strategy" IO Path counts buffers alloc'd/written/read/fsync'd as
part of a BufferAccessStrategy.

The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, backends increment a counter in the array in their
PgBackendStatus when performing an IO operation.

Future patches will persist the IO stat counters from a backend's
PgBackendStatus upon backend exit and use the counters to provide
observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call pgstat_inc_ioop()
for unbuffered IO and avoid regressions for future users of these
functions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/storage/buffer/bufmgr.c         | 47 +++++++++++---
 src/backend/storage/buffer/freelist.c       | 22 ++++++-
 src/backend/storage/buffer/localbuf.c       |  3 +
 src/backend/storage/sync/sync.c             |  1 +
 src/backend/utils/activity/backend_status.c | 10 +++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 68 +++++++++++++++++++++
 8 files changed, 141 insertions(+), 15 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..4e88327425 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1106,6 +1106,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..a6e446f29a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,7 +481,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -973,6 +973,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1173,6 +1174,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1183,7 +1186,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1220,6 +1223,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1237,7 +1242,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1246,13 +1251,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 *
+				 * When a strategy is not in use, at this point the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2553,10 +2572,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2804,9 +2824,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2898,6 +2921,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3534,6 +3559,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3569,7 +3596,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3665,7 +3692,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3721,7 +3748,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3748,7 +3775,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3b98e68d50..15a9de4c9d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index e71f95ac1f..864b4db20e 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..266fb9ca0a 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -420,6 +420,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 079321599d..79410e0b2c 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -405,6 +405,16 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
 
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		IOOpCounters *io_ops = &lbeentry.io_path_stats[i];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
+
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
 	 * examine it until st_progress_command has been set to something other
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf..64f4fc3b96 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8217d0cb6b..950b7396a5 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,47 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOp for a live backend.
+ */
+typedef struct IOOpCounters
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOpCounters;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +204,12 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IOOps for all IOPaths for this backend. These should be
+	 * incremented whenever an IO Operation is performed.
+	 */
+	IOOpCounters	io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,6 +338,32 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOpCounters *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_unlocked_inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_unlocked_inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_unlocked_inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_unlocked_inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
-- 
2.32.0

v21-0001-Read-only-atomic-backend-write-function.patchapplication/octet-stream; name=v21-0001-Read-only-atomic-backend-write-function.patchDownload

From f9a15549a8a08f784284bc463df274767291334f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH v21 1/8] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values; however, pg_atomic_fetch_add_u64() is overkill for
incrementing such counters. pg_atomic_unlocked_inc_counter() is a helper
function which can be used to increment these values safely without
unnecessary overhead.

Author: Thomas Munro <tmunro@postgresql.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ06d3h5JeOtAv4h52n0vG1jOPZxqMCn5FySJQUVZA32w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/include/port/atomics.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 9550e04aaa..3d1fdc6475 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++. On some older systems
+ * there might be more to it (due to an inability to read and write 64-bit
+ * values atomically).
+ */
+static inline void
+pg_atomic_unlocked_inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.32.0

#51

andres@anarazel.de

almost 4 years ago

In reply to: Melanie Plageman (#50)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-02-19 11:06:18 -0500, Melanie Plageman wrote:

v21 rebased with compile errors fixed is attached.

This currently doesn't apply (mea culpa likely): http://cfbot.cputube.org/patch_37_3272.log

Could you rebase? Marked as waiting-on-author for now.

- Andres

#52

pryzby@telsasoft.com

almost 4 years ago

In reply to: Justin Pryzby (#44)

8 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

I already rebased this in a local branch, so here it's.
I don't expect it to survive the day.

This should be updated to use the tuplestore helper.

Attachments:

0001-Read-only-atomic-backend-write-function.patchtext/x-diff; charset=us-asciiDownload

From bc4afef0bf0cb34d90fb6c029ab4c5ff1a6d033d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 11 Oct 2021 16:15:06 -0400
Subject: [PATCH 1/8] Read-only atomic backend write function

For counters in shared memory which can be read by any backend but only
written to by one backend, an atomic is still needed to protect against
torn values; however, pg_atomic_fetch_add_u64() is overkill for
incrementing such counters. pg_atomic_unlocked_inc_counter() is a helper
function which can be used to increment these values safely without
unnecessary overhead.

Author: Thomas Munro <tmunro@postgresql.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ06d3h5JeOtAv4h52n0vG1jOPZxqMCn5FySJQUVZA32w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/include/port/atomics.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 9550e04aaa5..3d1fdc64752 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * On modern systems this is really just *counter++. On some older systems
+ * there might be more to it (due to an inability to read and write 64-bit
+ * values atomically).
+ */
+static inline void
+pg_atomic_unlocked_inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
-- 
2.17.1

0002-Move-backend-pgstat-initialization-earlier.patchtext/x-diff; charset=us-asciiDownload

From 5cb1fa7d4be390f51e19bf43d01526ea0a09b329 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 14 Dec 2021 12:26:56 -0500
Subject: [PATCH 2/8] Move backend pgstat initialization earlier

Initialize the pgstats subsystem earlier during process initialization
so that more process types have a backend activity state
(PgBackendStatus).

Conditionally initializing backend activity state in some types of
processes and not in others necessitates surprising special cases in the
code.

This particular commit was motivated by single user mode missing a
backend activity state.

This commit also adds a new BackendType for standalone backends,
B_STANDALONE_BACKEND (and alphabetizes the BackendTypes). Both the
bootstrap backend and single user mode backends will have BackendType
B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 23 ++++++++++++++---------
 src/backend/utils/init/postinit.c |  7 +++----
 src/include/miscadmin.h           |  7 ++++---
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bdc77af7194..cf6eca4bb4e 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,9 +278,18 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
+		case B_STATS_COLLECTOR:
+			backendDesc = "stats collector";
+			break;
 		case B_WAL_RECEIVER:
 			backendDesc = "walreceiver";
 			break;
@@ -285,15 +299,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_STATS_COLLECTOR:
-			backendDesc = "stats collector";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 6452b42dbff..31c83855be6 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -727,6 +727,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
 	}
 
+	pgstat_beinit();
+
 	/*
 	 * If this is either a bootstrap process or a standalone backend, start
 	 * up the XLOG machinery, and register to have it closed down at exit.
@@ -742,6 +744,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 */
 		CreateAuxProcessResourceOwner();
 
+		pgstat_bestart();
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
 		ReleaseAuxProcessResources(true);
@@ -769,7 +772,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	EnablePortalManager();
 
 	/* Initialize status reporting */
-	pgstat_beinit();
 
 	/*
 	 * Load relcache entries for the shared system catalogs.  This must create
@@ -1007,10 +1009,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		 * transaction we started before returning.
 		 */
 		if (!bootstrap)
-		{
-			pgstat_bestart();
 			CommitTransactionCommand();
-		}
 		return;
 	}
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0abc3ad5405..94c6135e930 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -322,19 +322,20 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
+	B_STATS_COLLECTOR,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_STATS_COLLECTOR,
-	B_LOGGER,
 } BackendType;
 
 extern BackendType MyBackendType;
-- 
2.17.1

0003-Add-IO-operation-counters-to-PgBackendStatus.patchtext/x-diff; charset=us-asciiDownload

From 35d7112fd8ff9ce756bbe9a52ad7dc6e92f4ed0b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 10:32:56 -0500
Subject: [PATCH 3/8] Add IO operation counters to PgBackendStatus

Add an array of counters to PgBackendStatus which count the buffers
allocated, extended, fsynced, and written by a given backend.

Each "IO Op" (alloc, fsync, extend, write) is counted per "IO Path"
(direct, local, shared, or strategy).

"local" and "shared" IO Path counters count operations on local and
shared buffers.

The "strategy" IO Path counts buffers alloc'd/written/read/fsync'd as
part of a BufferAccessStrategy.

The "direct" IO Path counts blocks of IO which are read, written, or
fsync'd using smgrwrite/extend/immedsync directly (as opposed to through
[Local]BufferAlloc()).

With this commit, backends increment a counter in the array in their
PgBackendStatus when performing an IO operation.

Future patches will persist the IO stat counters from a backend's
PgBackendStatus upon backend exit and use the counters to provide
observability of database IO operations.

Note that this commit does not add code to increment the "direct" path.
A future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call pgstat_inc_ioop()
for unbuffered IO and avoid regressions for future users of these
functions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c       |  1 +
 src/backend/storage/buffer/bufmgr.c         | 47 +++++++++++---
 src/backend/storage/buffer/freelist.c       | 22 ++++++-
 src/backend/storage/buffer/localbuf.c       |  3 +
 src/backend/storage/sync/sync.c             |  1 +
 src/backend/utils/activity/backend_status.c | 10 +++
 src/include/storage/buf_internals.h         |  4 +-
 src/include/utils/backend_status.h          | 68 +++++++++++++++++++++
 8 files changed, 141 insertions(+), 15 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a59c3cf0201..061775de4ed 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1112,6 +1112,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d73a40c1bc6..a3136a589fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -977,6 +977,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_inc_ioop(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1177,6 +1178,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1187,7 +1190,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1224,6 +1227,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath		iopath;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1241,7 +1246,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1250,13 +1255,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 *
+				 * When a strategy is not in use, at this point the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2557,10 +2576,11 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 	 * buffer is clean by the time we've locked it.)
 	 */
+
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2808,9 +2828,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2902,6 +2925,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	pgstat_inc_ioop(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3538,6 +3563,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3573,7 +3600,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3669,7 +3696,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3877,7 +3904,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3904,7 +3931,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3b98e68d50f..15a9de4c9d0 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -19,6 +19,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
+#include "utils/backend_status.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_inc_ioop(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -683,8 +686,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -700,5 +709,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index e71f95ac1ff..864b4db20e7 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_inc_ioop(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index c695d816fc6..cad94065211 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -433,6 +433,7 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 079321599d6..79410e0b2c2 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -405,6 +405,16 @@ pgstat_bestart(void)
 	lbeentry.st_progress_command_target = InvalidOid;
 	lbeentry.st_query_id = UINT64CONST(0);
 
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		IOOpCounters *io_ops = &lbeentry.io_path_stats[i];
+
+		pg_atomic_init_u64(&io_ops->allocs, 0);
+		pg_atomic_init_u64(&io_ops->extends, 0);
+		pg_atomic_init_u64(&io_ops->fsyncs, 0);
+		pg_atomic_init_u64(&io_ops->writes, 0);
+	}
+
 	/*
 	 * we don't zero st_progress_param here to save cycles; nobody should
 	 * examine it until st_progress_command has been set to something other
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf0..64f4fc3b961 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8217d0cb6b7..950b7396a59 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -13,6 +13,7 @@
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"			/* for BackendType */
+#include "port/atomics.h"
 #include "utils/backend_progress.h"
 
 
@@ -31,12 +32,47 @@ typedef enum BackendState
 	STATE_DISABLED
 } BackendState;
 
+/* ----------
+ * IO Stats reporting utility types
+ * ----------
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
 
 /* ----------
  * Shared-memory data structures
  * ----------
  */
 
+/*
+ * Structure for counting all types of IOOp for a live backend.
+ */
+typedef struct IOOpCounters
+{
+	pg_atomic_uint64 allocs;
+	pg_atomic_uint64 extends;
+	pg_atomic_uint64 fsyncs;
+	pg_atomic_uint64 writes;
+} IOOpCounters;
+
 /*
  * PgBackendSSLStatus
  *
@@ -168,6 +204,12 @@ typedef struct PgBackendStatus
 
 	/* query identifier, optionally computed using post_parse_analyze_hook */
 	uint64		st_query_id;
+
+	/*
+	 * Stats on all IOOps for all IOPaths for this backend. These should be
+	 * incremented whenever an IO Operation is performed.
+	 */
+	IOOpCounters	io_path_stats[IOPATH_NUM_TYPES];
 } PgBackendStatus;
 
 
@@ -296,6 +338,32 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+
+static inline void
+pgstat_inc_ioop(IOOp io_op, IOPath io_path)
+{
+	IOOpCounters *io_ops;
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(beentry);
+
+	io_ops = &beentry->io_path_stats[io_path];
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pg_atomic_unlocked_inc_counter(&io_ops->allocs);
+			break;
+		case IOOP_EXTEND:
+			pg_atomic_unlocked_inc_counter(&io_ops->extends);
+			break;
+		case IOOP_FSYNC:
+			pg_atomic_unlocked_inc_counter(&io_ops->fsyncs);
+			break;
+		case IOOP_WRITE:
+			pg_atomic_unlocked_inc_counter(&io_ops->writes);
+			break;
+	}
+}
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
-- 
2.17.1

0004-Send-IO-operations-to-stats-collector.patchtext/x-diff; charset=us-asciiDownload

From 380903150cdeb75a7703d6918df7c77f3211c3df Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 11:16:55 -0500
Subject: [PATCH 4/8] Send IO operations to stats collector

On exit, backends send the IO operations they have done on all IO Paths
to the stats collector. The stats collector adds these counts to its
existing counts stored in a global data structure it maintains and
persists.

PgStatIOOpCounters contains the same information as backend_status.h's
IOOpCounters, however IOOpCounters' members must be atomics and the
stats collector has no such requirement.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c    | 98 +++++++++++++++++++++++++++++-
 src/include/miscadmin.h            |  2 +
 src/include/pgstat.h               | 60 ++++++++++++++++++
 src/include/utils/backend_status.h | 37 +++++++++++
 4 files changed, 196 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ef1cba61a6f..f4c0fd3e8dc 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -146,6 +146,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len);
+static void pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -178,7 +179,6 @@ char	   *pgstat_stat_directory = NULL;
 char	   *pgstat_stat_filename = NULL;
 char	   *pgstat_stat_tmpname = NULL;
 
-
 /* ----------
  * state shared with pgstat_*.c
  * ----------
@@ -704,6 +704,14 @@ pgstat_shutdown_hook(int code, Datum arg)
 {
 	Assert(!pgstat_is_shutdown);
 
+	/*
+	 * Only need to send stats on IOOps for IOPaths when a process exits. Users
+	 * requiring IOOps for both live and exited backends can read from live
+	 * backends' PgBackendStatuses and sum this with totals from exited
+	 * backends persisted by the stats collector.
+	 */
+	pgstat_send_buffers();
+
 	/*
 	 * If we got as far as discovering our own database ID, we can report what
 	 * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -1559,6 +1567,45 @@ pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
 	hdr->m_type = mtype;
 }
 
+/*
+ * Before exiting, a backend sends its IO operations statistics to the
+ * collector so that they may be persisted.
+ */
+void
+pgstat_send_buffers(void)
+{
+	PgStatIOOpCounters *io_path_ops;
+	PgStat_MsgIOPathOps msg;
+
+	PgBackendStatus *beentry = MyBEEntry;
+	PgStat_Counter sum = 0;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.backend_type = beentry->st_backendType;
+
+	io_path_ops = msg.iop.io_path_ops;
+	pgstat_sum_io_path_ops(io_path_ops, (IOOpCounters *)
+			&beentry->io_path_stats);
+
+	/* If no IO was done, don't bother sending anything to the stats collector. */
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		sum += io_path_ops[i].allocs;
+		sum += io_path_ops[i].extends;
+		sum += io_path_ops[i].fsyncs;
+		sum += io_path_ops[i].writes;
+	}
+
+	if (sum == 0)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS);
+	pgstat_send(&msg, sizeof(msg));
+}
+
 
 /*
  * Send out one statistics message to the collector
@@ -1588,6 +1635,29 @@ pgstat_send(void *msg, int len)
 #endif
 }
 
+/*
+ * Helper function to sum all IO operations stats for all IOPaths (e.g. shared,
+ * local) from live backends with those in the equivalent stats structure for
+ * exited backends.
+ * Note that this adds and doesn't set, so the destination stats structure
+ * should be zeroed out by the caller initially.
+ * This would commonly be used to transfer all IOOp stats for all IOPaths for a
+ * particular backend type to the pgstats structure.
+ */
+void
+pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src)
+{
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		dest->allocs += pg_atomic_read_u64(&src->allocs);
+		dest->extends += pg_atomic_read_u64(&src->extends);
+		dest->fsyncs += pg_atomic_read_u64(&src->fsyncs);
+		dest->writes += pg_atomic_read_u64(&src->writes);
+		dest++;
+		src++;
+	}
+}
+
 /*
  * Start up the statistics collector process.  This is the body of the
  * postmaster child process.
@@ -1798,6 +1868,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_checkpointer(&msg.msg_checkpointer, len);
 					break;
 
+				case PGSTAT_MTYPE_IO_PATH_OPS:
+					pgstat_recv_io_path_ops(&msg.msg_io_path_ops, len);
+					break;
+
 				case PGSTAT_MTYPE_WAL:
 					pgstat_recv_wal(&msg.msg_wal, len);
 					break;
@@ -3961,6 +4035,28 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
+static void
+pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len)
+{
+	PgStatIOOpCounters *src_io_path_ops;
+	PgStatIOOpCounters *dest_io_path_ops;
+
+	src_io_path_ops = msg->iop.io_path_ops;
+	dest_io_path_ops =
+		globalStats.buffers.ops[backend_type_get_idx(msg->backend_type)].io_path_ops;
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStatIOOpCounters *src = &src_io_path_ops[i];
+		PgStatIOOpCounters *dest = &dest_io_path_ops[i];
+
+		dest->allocs += src->allocs;
+		dest->extends += src->extends;
+		dest->fsyncs += src->fsyncs;
+		dest->writes += src->writes;
+	}
+}
+
 /*
  * Process a WAL message.
  */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 94c6135e930..77c89134c21 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -338,6 +338,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3584078f6ea..cdb2ce60c46 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -239,6 +239,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_CHECKPOINTER,
+	PGSTAT_MTYPE_IO_PATH_OPS,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -372,6 +373,49 @@ typedef struct PgStat_MsgDropdb
 	Oid			m_databaseid;
 } PgStat_MsgDropdb;
 
+
+/*
+ * Structure for counting all types of IOOps in the stats collector
+ */
+typedef struct PgStatIOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStatIOOpCounters;
+
+/*
+ * Structure for counting all IOOps on all types of IOPaths.
+ */
+typedef struct PgStatIOPathOps
+{
+	PgStatIOOpCounters io_path_ops[IOPATH_NUM_TYPES];
+} PgStatIOPathOps;
+
+/*
+ * Sent by a backend to the stats collector to report all IOOps for all IOPaths
+ * for a given type of a backend. This will happen when the backend exits.
+ */
+typedef struct PgStat_MsgIOPathOps
+{
+	PgStat_MsgHdr m_hdr;
+
+	BackendType backend_type;
+	PgStatIOPathOps iop;
+} PgStat_MsgIOPathOps;
+
+/*
+ * Structure used by stats collector to keep track of all types of exited
+ * backends' IOOps for all IOPaths as well as all stats from live backends at
+ * the time of stats reset. resets is populated using a reset message sent to
+ * the stats collector.
+ */
+typedef struct PgStat_BackendIOPathOps
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -750,6 +794,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgCheckpointer msg_checkpointer;
+	PgStat_MsgIOPathOps msg_io_path_ops;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -869,6 +914,7 @@ typedef struct PgStat_GlobalStats
 
 	PgStat_CheckpointerStats checkpointer;
 	PgStat_BgWriterStats bgwriter;
+	PgStat_BackendIOPathOps buffers;
 } PgStat_GlobalStats;
 
 typedef struct PgStat_StatReplSlotEntry
@@ -1120,6 +1166,20 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 									  void *recdata, uint32 len);
 
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
+extern void pgstat_send_archiver(const char *xlog, bool failed);
+extern void pgstat_send_bgwriter(void);
+
+/*
+ * While some processes send some types of statistics to the collector at
+ * regular intervals (e.g. CheckpointerMain() calling
+ * pgstat_send_checkpointer()), IO operations stats are only sent by
+ * pgstat_send_buffers() when a process exits (in pgstat_shutdown_hook()). IO
+ * operations stats from live backends can be read from their PgBackendStatuses
+ * and, if desired, summed with totals from exited backends persisted by the
+ * stats collector.
+ */
+extern void pgstat_send_buffers(void);
+extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
 
 
 /*
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 950b7396a59..3de1e7c8d37 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -331,6 +331,43 @@ extern void CreateSharedBackendStatus(void);
  * ----------
  */
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
+
+extern const char *GetIOPathDesc(IOPath io_path);
+
 /* Initialization functions */
 extern void pgstat_beinit(void);
 extern void pgstat_bestart(void);
-- 
2.17.1

0005-Add-buffers-to-pgstat_reset_shared_counters.patchtext/x-diff; charset=us-asciiDownload

From bbd2271de28702fa83b47896ed2762a38701b416 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 3 Jan 2022 18:26:45 -0500
Subject: [PATCH 5/8] Add "buffers" to pgstat_reset_shared_counters

Backends count IO operations for various IO paths in their
PgBackendStatus. Upon exit, they send these counts to the stats
collector.

Prior to this commit, the IO operations stats from exited backends
persisted by the stats collector would have been been reset when
pgstat_reset_shared_counters() was invoked with target "bgwriter".
However the IO operations stats in each live backend's PgBackendStatus
would remain the same. Thus the totals calculated from both live and
exited backends would be incorrect after a reset.

Backends' PgBackendStatuses cannot be written to by another backend;
therefore, in order to calculate correct totals after a reset has
occurred, the backend sending the reset message to the stats collector
now reads the IO operation stats totals from live backends and sends
them to the stats collector to be persisted in an array of "resets"
which can be used to calculate the correct totals after a reset.

Because the IO operations statistics are broader in scope than those in
pg_stat_bgwriter, rename the reset target to "buffers". The "buffers"
target will reset all IO operations statistics and all statistics for
the pg_stat_bgwriter view maintained by the stats collector.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                |  2 +-
 src/backend/postmaster/pgstat.c             | 90 +++++++++++++++++----
 src/backend/utils/activity/backend_status.c | 27 +++++++
 src/include/pgstat.h                        | 34 ++++++--
 src/include/utils/backend_status.h          |  2 +
 5 files changed, 134 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3b9172f65bd..dc07cc15f53 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5186,7 +5186,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>bgwriter</literal> to reset
+        argument.  The argument can be <literal>buffers</literal> to reset
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f4c0fd3e8dc..57a7f0fa7e9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1267,6 +1267,36 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ * Helper function to collect and send live backends' current IO operations
+ * stats counters when a stats reset is initiated so that they may be deducted
+ * from future totals.
+ */
+static void
+pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg)
+{
+	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+
+	memset(ops, 0, sizeof(ops));
+	pgstat_report_live_backend_io_path_ops(ops);
+
+	/*
+	 * Iterate through the array of all IOOps for all IOPaths for each
+	 * BackendType.
+	 *
+	 * An individual message is sent for each backend type because sending all
+	 * IO operations in one message would exceed the PGSTAT_MAX_MSG_SIZE of
+	 * 1000.
+	 */
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		msg->m_backend_resets.backend_type = idx_get_backend_type(i);
+		memcpy(&msg->m_backend_resets.iop, &ops[i],
+				sizeof(msg->m_backend_resets.iop));
+		pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter));
+	}
+}
+
 /*
  * Tell the statistics collector to reset cluster-wide shared counters.
  *
@@ -1281,19 +1311,25 @@ pgstat_reset_shared_counters(const char *target)
 	if (pgStatSock == PGINVALID_SOCKET)
 		return;
 
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
+	if (strcmp(target, "buffers") == 0)
+	{
+		msg.m_resettarget = RESET_BUFFERS;
+		pgstat_send_buffers_reset(&msg);
+		return;
+	}
+
 	if (strcmp(target, "archiver") == 0)
 		msg.m_resettarget = RESET_ARCHIVER;
-	else if (strcmp(target, "bgwriter") == 0)
-		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
+				 errhint(
+					 "Target must be \"archiver\", \"buffers\", or \"wal\".")));
 
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -2606,6 +2642,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 */
 	ts = GetCurrentTimestamp();
 	globalStats.bgwriter.stat_reset_timestamp = ts;
+	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
 
@@ -3726,21 +3763,46 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
 static void
 pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 {
-	if (msg->m_resettarget == RESET_BGWRITER)
-	{
-		/*
-		 * Reset the global, bgwriter and checkpointer statistics for the
-		 * cluster.
-		 */
-		memset(&globalStats, 0, sizeof(globalStats));
-		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
-	}
-	else if (msg->m_resettarget == RESET_ARCHIVER)
+	if (msg->m_resettarget == RESET_ARCHIVER)
 	{
 		/* Reset the archiver statistics for the cluster. */
 		memset(&archiverStats, 0, sizeof(archiverStats));
 		archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
 	}
+	else if (msg->m_resettarget == RESET_BUFFERS)
+	{
+		/*
+		 * Reset global stats for bgwriter, buffers, and checkpointer.
+		 *
+		 * Because the stats collector cannot write to live backends'
+		 * PgBackendStatuses, it maintains an array of "resets". The reset
+		 * message contains the current values of these counters for live
+		 * backends. The stats collector saves these in its "resets" array,
+		 * then zeroes out the exited backends' saved IO operations counters.
+		 * This is required to calculate an accurate total for each IO
+		 * operations counter post reset.
+		 */
+		BackendType backend_type = msg->m_backend_resets.backend_type;
+
+		/*
+		 * We reset each member individually (as opposed to resetting the
+		 * entire globalStats struct) because we need to preserve the resets
+		 * array (globalStats.buffers.resets).
+		 *
+		 * Though globalStats.buffers.ops, globalStats.bgwriter, and
+		 * globalStats.checkpointer only need to be reset once, doing so for
+		 * every message is less brittle and the extra cost is irrelevant given
+		 * how often stats are reset.
+		 */
+		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
+		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
+		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
+		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
+		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
+		memcpy(&globalStats.buffers.resets[backend_type_get_idx(backend_type)],
+				&msg->m_backend_resets.iop.io_path_ops,
+				sizeof(msg->m_backend_resets.iop.io_path_ops));
+	}
 	else if (msg->m_resettarget == RESET_WAL)
 	{
 		/* Reset the WAL statistics for the cluster. */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 79410e0b2c2..87b9d0fc0d8 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -636,6 +636,33 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/*
+ * Iterate through BackendStatusArray and capture live backends' stats on IOOps
+ * for all IOPaths, adding them to that backend type's member of the
+ * backend_io_path_ops structure.
+ */
+void
+pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops)
+{
+	PgBackendStatus *beentry = BackendStatusArray;
+
+	/*
+	 * Loop through live backends and capture reset values
+	 */
+	for (int i = 0; i < GetMaxBackends() + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		int idx;
+
+		/* Don't count dead backends or those with type B_INVALID. */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		idx = backend_type_get_idx(beentry->st_backendType);
+		pgstat_sum_io_path_ops(backend_io_path_ops[idx].io_path_ops,
+				(IOOpCounters *) beentry->io_path_stats);
+	}
+}
+
 /* --------
  * pgstat_report_query_id() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index cdb2ce60c46..9e93580f680 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -62,7 +62,7 @@ typedef int64 PgStat_Counter;
 typedef enum PgStat_Shared_Reset_Target
 {
 	RESET_ARCHIVER,
-	RESET_BGWRITER,
+	RESET_BUFFERS,
 	RESET_WAL
 } PgStat_Shared_Reset_Target;
 
@@ -164,6 +164,11 @@ typedef struct PgStat_TableCounts
 	PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
+/* ------------------------------------------------------------
+ * Structures kept in backend local memory while accumulating counts
+ * ------------------------------------------------------------
+ */
+
 /* ----------
  * PgStat_TableStatus			Per-table status within a backend
  *
@@ -395,7 +400,8 @@ typedef struct PgStatIOPathOps
 
 /*
  * Sent by a backend to the stats collector to report all IOOps for all IOPaths
- * for a given type of a backend. This will happen when the backend exits.
+ * for a given type of a backend. This will happen when the backend exits or
+ * when stats are reset.
  */
 typedef struct PgStat_MsgIOPathOps
 {
@@ -413,9 +419,12 @@ typedef struct PgStat_MsgIOPathOps
  */
 typedef struct PgStat_BackendIOPathOps
 {
+	TimestampTz stat_reset_timestamp;
 	PgStatIOPathOps ops[BACKEND_NUM_TYPES];
+	PgStatIOPathOps resets[BACKEND_NUM_TYPES];
 } PgStat_BackendIOPathOps;
 
+
 /* ----------
  * PgStat_MsgResetcounter		Sent by the backend to tell the collector
  *								to reset counters
@@ -427,15 +436,28 @@ typedef struct PgStat_MsgResetcounter
 	Oid			m_databaseid;
 } PgStat_MsgResetcounter;
 
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *								to reset a shared counter
- * ----------
+/*
+ * Sent by the backend to tell the collector to reset a shared counter.
+ *
+ * In addition to the message header and reset target, the message also
+ * contains an array with all of the IO operations for all IO paths done by a
+ * particular backend type.
+ *
+ * This is needed because the IO operation stats for live backends cannot be
+ * safely modified by other processes. Therefore, to correctly calculate the
+ * total IO operations for a particular backend type after a reset, the balance
+ * of IO operations for live backends at the time of prior resets must be
+ * subtracted from the total IO operations.
+ *
+ * To satisfy this requirement, the process initiating the reset will read the
+ * IO operations counters from live backends and send them to the stats
+ * collector which maintains an array of reset values.
  */
 typedef struct PgStat_MsgResetsharedcounter
 {
 	PgStat_MsgHdr m_hdr;
 	PgStat_Shared_Reset_Target m_resettarget;
+	PgStat_MsgIOPathOps m_backend_resets;
 } PgStat_MsgResetsharedcounter;
 
 /* ----------
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 3de1e7c8d37..7e59c063b94 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -375,6 +375,7 @@ extern void pgstat_bestart(void);
 extern void pgstat_clear_backend_activity_snapshot(void);
 
 /* Activity reporting functions */
+typedef struct PgStatIOPathOps PgStatIOPathOps;
 
 static inline void
 pgstat_inc_ioop(IOOp io_op, IOPath io_path)
@@ -402,6 +403,7 @@ pgstat_inc_ioop(IOOp io_op, IOPath io_path)
 	}
 }
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_live_backend_io_path_ops(PgStatIOPathOps *backend_io_path_ops);
 extern void pgstat_report_query_id(uint64 query_id, bool force);
 extern void pgstat_report_tempfile(size_t filesize);
 extern void pgstat_report_appname(const char *appname);
-- 
2.17.1

0006-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-diff; charset=us-asciiDownload

From 47a36e4a17817e56c6bebd0e2e53c338b35d267b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:07:37 -0500
Subject: [PATCH 6/8] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO Path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IOPath and IOOp are enumerated in
the view but not all are populated or even possible at this point.

All backends increment a counter in an array of IO stat counters in
their PgBackendStatus when performing an IO operation. On exit, backends
send these stats to the stats collector to be persisted.

When the pg_stat_buffers view is queried, one backend will sum live
backends' stats with saved stats from exited backends and subtract saved
reset stats, returning the total.

Each row of the view is stats for a particular backend type for a
particular IO Path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                | 119 +++++++++++++++-
 src/backend/catalog/system_views.sql        |  11 ++
 src/backend/postmaster/pgstat.c             |  12 ++
 src/backend/utils/activity/backend_status.c |  19 ++-
 src/backend/utils/adt/pgstatfuncs.c         | 150 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat             |   9 ++
 src/include/pgstat.h                        |  11 ++
 src/include/utils/backend_status.h          |   1 +
 src/test/regress/expected/rules.out         |   8 ++
 9 files changed, 332 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dc07cc15f53..e73783fe116 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -435,6 +435,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3581,7 +3590,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5186,12 +5290,13 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Resets some cluster-wide statistics counters to zero, depending on the
-        argument.  The argument can be <literal>buffers</literal> to reset
-        all the counters shown in
-        the <structname>pg_stat_bgwriter</structname>
-        view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        argument.  The argument can be <literal>archiver</literal> to reset all
+        the counters shown in the <structname>pg_stat_archiver</structname>
+        view, <literal>buffers</literal> to reset all the counters shown in
+        both the <structname>pg_stat_bgwriter</structname> view and
+        <structname>pg_stat_buffers</structname> view, or
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9eaa51df290..b6cfe3d3f93 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1103,6 +1103,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 57a7f0fa7e9..a48998fe944 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1363,6 +1363,18 @@ pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databas
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/*
+ *     Support function for SQL-callable pgstat* functions. Returns a pointer to
+ *     the PgStat_BackendIOPathOps structure tracking IO operations statistics for
+ *     both exited backends and reset arithmetic.
+ */
+PgStat_BackendIOPathOps *
+pgstat_fetch_exited_backend_buffers(void)
+{
+	backend_read_statsfile();
+	return &globalStats.buffers;
+}
+
 /*
  * Support function for the SQL-callable pgstat* functions. Returns
  * the collected statistics for one database or NULL. NULL doesn't mean
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 87b9d0fc0d8..c579014ec25 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -38,7 +38,7 @@ int			pgstat_track_activity_query_size = 1024;
 PgBackendStatus *MyBEEntry = NULL;
 
 
-static PgBackendStatus *BackendStatusArray = NULL;
+PgBackendStatus *BackendStatusArray = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendClientHostnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
@@ -241,6 +241,23 @@ CreateSharedBackendStatus(void)
 #endif
 }
 
+const char *
+GetIOPathDesc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "direct";
+		case IOPATH_LOCAL:
+			return "local";
+		case IOPATH_SHARED:
+			return "shared";
+		case IOPATH_STRATEGY:
+			return "strategy";
+	}
+	return "unknown IO path";
+}
+
 /*
  * Initialize pgstats backend activity state, and set up our on-proc-exit
  * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ce84525d402..e1213a9ad03 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1739,6 +1739,156 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above BUFFERS_NUM_COLUMNS.
+*/
+enum
+{
+	BUFFERS_COLUMN_BACKEND_TYPE,
+	BUFFERS_COLUMN_IO_PATH,
+	BUFFERS_COLUMN_ALLOCS,
+	BUFFERS_COLUMN_EXTENDS,
+	BUFFERS_COLUMN_FSYNCS,
+	BUFFERS_COLUMN_WRITES,
+	BUFFERS_COLUMN_RESET_TIME,
+	BUFFERS_NUM_COLUMNS,
+};
+
+/*
+ * Helper function to get the correct row in the pg_stat_buffers view.
+ */
+static inline Datum *
+get_pg_stat_buffers_row(Datum all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS],
+		BackendType backend_type, IOPath io_path)
+{
+	return all_values[backend_type_get_idx(backend_type)][io_path];
+}
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *backend_io_path_ops;
+	PgBackendStatus *beentry;
+	Datum		reset_time;
+
+	ReturnSetInfo *rsinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	Datum		all_values[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+	bool		all_nulls[BACKEND_NUM_TYPES][IOPATH_NUM_TYPES][BUFFERS_NUM_COLUMNS];
+
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+	tupstore = tuplestore_begin_heap((bool) (rsinfo->allowedModes & SFRM_Materialize_Random),
+			false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+	MemoryContextSwitchTo(oldcontext);
+
+	memset(all_values, 0, sizeof(all_values));
+	memset(all_nulls, 0, sizeof(all_nulls));
+
+	 /* Loop through all live backends and count their IO Ops for each IO Path */
+	beentry = BackendStatusArray;
+
+	for (int i = 0; i < GetMaxBackends() + NUM_AUXPROCTYPES; i++, beentry++)
+	{
+		IOOpCounters   *io_ops;
+
+		/*
+		 * Don't count dead backends. They will be added below. There are no
+		 * rows in the view for BackendType B_INVALID, so skip those as well.
+		 */
+		if (beentry->st_procpid == 0 || beentry->st_backendType == B_INVALID)
+			continue;
+
+		io_ops = beentry->io_path_stats;
+
+		for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, beentry->st_backendType, i);
+
+			/*
+			 * BUFFERS_COLUMN_RESET_TIME, BUFFERS_COLUMN_BACKEND_TYPE, and
+			 * BUFFERS_COLUMN_IO_PATH will all be set when looping through
+			 * exited backends array
+			 */
+			row[BUFFERS_COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs);
+			row[BUFFERS_COLUMN_EXTENDS] += pg_atomic_read_u64(&io_ops->extends);
+			row[BUFFERS_COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs);
+			row[BUFFERS_COLUMN_WRITES] += pg_atomic_read_u64(&io_ops->writes);
+			io_ops++;
+		}
+	}
+
+	/* Add stats from all exited backends */
+	backend_io_path_ops = pgstat_fetch_exited_backend_buffers();
+
+	reset_time = TimestampTzGetDatum(backend_io_path_ops->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		BackendType backend_type = idx_get_backend_type(i);
+
+		PgStatIOOpCounters *io_ops =
+			backend_io_path_ops->ops[i].io_path_ops;
+		PgStatIOOpCounters *resets =
+			backend_io_path_ops->resets[i].io_path_ops;
+
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum *row = get_pg_stat_buffers_row(all_values, backend_type, j);
+
+			row[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			row[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(j));
+			row[BUFFERS_COLUMN_RESET_TIME] = reset_time;
+			row[BUFFERS_COLUMN_ALLOCS] += io_ops->allocs - resets->allocs;
+			row[BUFFERS_COLUMN_EXTENDS] += io_ops->extends - resets->extends;
+			row[BUFFERS_COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs;
+			row[BUFFERS_COLUMN_WRITES] += io_ops->writes - resets->writes;
+			io_ops++;
+			resets++;
+		}
+	}
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum	   *values = all_values[i][j];
+			bool	   *nulls = all_nulls[i][j];
+
+			tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 25304430f44..2a3f11c26e9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5641,6 +5641,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e93580f680..ca96f7c3e4c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1207,6 +1207,17 @@ extern void pgstat_sum_io_path_ops(PgStatIOOpCounters *dest, IOOpCounters *src);
 /*
  * Functions in pgstat_replslot.c
  */
+extern PgStat_BackendIOPathOps *pgstat_fetch_exited_backend_buffers(void);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
+extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriterStats *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatReplSlotEntry *pgstat_fetch_replslot(NameData slotname);
 
 extern void pgstat_reset_replslot_counter(const char *name);
 extern void pgstat_report_replslot(const PgStat_StatReplSlotEntry *repSlotStat);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7e59c063b94..6d623ff7469 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -316,6 +316,7 @@ extern PGDLLIMPORT int pgstat_track_activity_query_size;
  * ----------
  */
 extern PGDLLIMPORT PgBackendStatus *MyBEEntry;
+extern PGDLLIMPORT PgBackendStatus *BackendStatusArray;
 
 
 /* ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 423b9b99fb6..39d0ed45642 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1820,6 +1820,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
-- 
2.17.1

0007-Remove-superfluous-bgwriter-stats-code.patchtext/x-diff; charset=us-asciiDownload

From aaeabb2896c5c12f37845ddd0179f8c6a62a8095 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 3 Jan 2022 15:01:10 -0500
Subject: [PATCH 7/8] Remove superfluous bgwriter stats code

After adding io_path_stats to PgBackendStatus, all backends keep track
of all IO done on all types of IO paths. When backends exit, they send
their IO operations stats to the stats collector to be persisted.

These statistics are available in the pg_stat_buffers view, making the
buffers_checkpoint, buffers_clean, buffers_backend,
buffers_backend_fsync, and buffers_alloc columns in the pg_stat_bgwriter
view redundant.

In order to maintain backward compatability, these columns in
pg_stat_bgwriter remain and are derived from the pg_stat_buffers view.

The structs used to track the statistics for these columns in the
pg_stat_bgwriter view and the functions querying them have been removed.

Additionally, since the "buffers" stats reset target resets both the IO
operations stats structs and the bgwriter stats structs, this member of
the bgwriter stats structs is no longer needed. Instead derive the
stats_reset column in the pg_stat_bgwriter view from pg_stat_buffers as
well.
---
 src/backend/catalog/system_views.sql  | 28 ++++++++++-----------
 src/backend/postmaster/checkpointer.c | 29 ++-------------------
 src/backend/postmaster/pgstat.c       |  7 ------
 src/backend/storage/buffer/bufmgr.c   |  6 -----
 src/backend/utils/adt/pgstatfuncs.c   | 36 ---------------------------
 src/include/catalog/pg_proc.dat       | 26 -------------------
 src/include/pgstat.h                  |  8 ------
 src/test/regress/expected/rules.out   | 24 +++++++++++++-----
 8 files changed, 34 insertions(+), 130 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b6cfe3d3f93..c6f609d9a8b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1089,20 +1089,6 @@ CREATE VIEW pg_stat_archiver AS
         s.stats_reset
     FROM pg_stat_get_archiver() s;
 
-CREATE VIEW pg_stat_bgwriter AS
-    SELECT
-        pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
-        pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
-        pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
-        pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-        pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-        pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
-        pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-        pg_stat_get_buf_written_backend() AS buffers_backend,
-        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-        pg_stat_get_buf_alloc() AS buffers_alloc,
-        pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
-
 CREATE VIEW pg_stat_buffers AS
 SELECT
        b.backend_type,
@@ -1114,6 +1100,20 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_buffers() b;
 
+CREATE VIEW pg_stat_bgwriter AS
+    SELECT
+        pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
+        pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
+        pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
+        pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'checkpointer' AND io_path = 'shared') AS buffers_checkpoint,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'background writer' AND io_path = 'shared') AS buffers_clean,
+        pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
+        (SELECT write FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_backend,
+        (SELECT fsync FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_backend_fsync,
+        (SELECT alloc FROM pg_stat_buffers WHERE backend_type = 'client backend' AND io_path = 'shared') AS buffers_alloc,
+        (SELECT stats_reset FROM pg_stat_buffers LIMIT 1) AS stats_reset;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 061775de4ed..90020cc899e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -91,17 +91,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of buffer writes performed
- * by user backend processes.  This counter should be wide enough that it
- * can't overflow during a single processing cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the checkpointer failed to absorb their request.
- *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
  *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
@@ -125,9 +117,6 @@ typedef struct
 	ConditionVariable start_cv; /* signaled when ckpt_started advances */
 	ConditionVariable done_cv;	/* signaled when ckpt_done advances */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -1092,10 +1081,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
-
 	/*
 	 * If the checkpointer isn't running or the request queue is full, the
 	 * backend will have to perform its own fsync request.  But before forcing
@@ -1105,13 +1090,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 		 !CompactCheckpointerRequestQueue()))
 	{
+		LWLockRelease(CheckpointerCommLock);
+
 		/*
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
 		pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
@@ -1268,15 +1252,6 @@ AbsorbSyncRequests(void)
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	/* Transfer stats counts into pending pgstats message */
-	PendingCheckpointerStats.m_buf_written_backend
-		+= CheckpointerShmem->num_backend_writes;
-	PendingCheckpointerStats.m_buf_fsync_backend
-		+= CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
-
 	/*
 	 * We try to avoid holding the lock for a long time by copying the request
 	 * array, and processing the requests after releasing the lock.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a48998fe944..6c2388dcf7e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2653,7 +2653,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	 * existing statsfile).
 	 */
 	ts = GetCurrentTimestamp();
-	globalStats.bgwriter.stat_reset_timestamp = ts;
 	globalStats.buffers.stat_reset_timestamp = ts;
 	archiverStats.stat_reset_timestamp = ts;
 	walStats.stat_reset_timestamp = ts;
@@ -3809,7 +3808,6 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
 		memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter));
 		memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer));
 		memset(&globalStats.buffers.ops, 0, sizeof(globalStats.buffers.ops));
-		globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp();
 		globalStats.buffers.stat_reset_timestamp = GetCurrentTimestamp();
 		memcpy(&globalStats.buffers.resets[backend_type_get_idx(backend_type)],
 				&msg->m_backend_resets.iop.io_path_ops,
@@ -4089,9 +4087,7 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-	globalStats.bgwriter.buf_written_clean += msg->m_buf_written_clean;
 	globalStats.bgwriter.maxwritten_clean += msg->m_maxwritten_clean;
-	globalStats.bgwriter.buf_alloc += msg->m_buf_alloc;
 }
 
 /*
@@ -4104,9 +4100,6 @@ pgstat_recv_checkpointer(PgStat_MsgCheckpointer *msg, int len)
 	globalStats.checkpointer.requested_checkpoints += msg->m_requested_checkpoints;
 	globalStats.checkpointer.checkpoint_write_time += msg->m_checkpoint_write_time;
 	globalStats.checkpointer.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-	globalStats.checkpointer.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-	globalStats.checkpointer.buf_written_backend += msg->m_buf_written_backend;
-	globalStats.checkpointer.buf_fsync_backend += msg->m_buf_fsync_backend;
 }
 
 static void
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a3136a589fa..13f0c0aac10 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2170,7 +2170,6 @@ BufferSync(int flags)
 			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.m_buf_written_checkpoints++;
 				num_written++;
 			}
 		}
@@ -2279,9 +2278,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
-	/* Report buffer alloc counts to pgstat */
-	PendingBgWriterStats.m_buf_alloc += recent_alloc;
-
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
 	 * stuff.  We mark the saved state invalid so that we can recover sanely
@@ -2478,8 +2474,6 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
-	PendingBgWriterStats.m_buf_written_clean += num_written;
-
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e1213a9ad03..0a448ba4ed9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1681,18 +1681,6 @@ pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
-Datum
-pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
-}
-
-Datum
-pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
-}
-
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
@@ -1715,30 +1703,6 @@ pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 					 pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
-Datum
-pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
-}
-
-Datum
-pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
-}
-
-Datum
-pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
-}
-
-Datum
-pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
-{
-	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
-}
-
 /*
 * When adding a new column to the pg_stat_buffers view, add a new enum
 * value here above BUFFERS_NUM_COLUMNS.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2a3f11c26e9..0af63b50d15 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5599,25 +5599,11 @@
   proname => 'pg_stat_get_bgwriter_requested_checkpoints', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_requested_checkpoints' },
-{ oid => '2771',
-  descr => 'statistics: number of buffers written by the bgwriter during checkpoints',
-  proname => 'pg_stat_get_bgwriter_buf_written_checkpoints', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_checkpoints' },
-{ oid => '2772',
-  descr => 'statistics: number of buffers written by the bgwriter for cleaning dirty buffers',
-  proname => 'pg_stat_get_bgwriter_buf_written_clean', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_buf_written_clean' },
 { oid => '2773',
   descr => 'statistics: number of times the bgwriter stopped processing when it had written too many buffers while cleaning',
   proname => 'pg_stat_get_bgwriter_maxwritten_clean', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => '',
   prosrc => 'pg_stat_get_bgwriter_maxwritten_clean' },
-{ oid => '3075', descr => 'statistics: last reset for the bgwriter',
-  proname => 'pg_stat_get_bgwriter_stat_reset_time', provolatile => 's',
-  proparallel => 'r', prorettype => 'timestamptz', proargtypes => '',
-  prosrc => 'pg_stat_get_bgwriter_stat_reset_time' },
 { oid => '3160',
   descr => 'statistics: checkpoint time spent writing buffers to disk, in milliseconds',
   proname => 'pg_stat_get_checkpoint_write_time', provolatile => 's',
@@ -5628,18 +5614,6 @@
   proname => 'pg_stat_get_checkpoint_sync_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => '',
   prosrc => 'pg_stat_get_checkpoint_sync_time' },
-{ oid => '2775', descr => 'statistics: number of buffers written by backends',
-  proname => 'pg_stat_get_buf_written_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_written_backend' },
-{ oid => '3063',
-  descr => 'statistics: number of backend buffer writes that did their own fsync',
-  proname => 'pg_stat_get_buf_fsync_backend', provolatile => 's',
-  proparallel => 'r', prorettype => 'int8', proargtypes => '',
-  prosrc => 'pg_stat_get_buf_fsync_backend' },
-{ oid => '2859', descr => 'statistics: number of buffer allocations',
-  proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
-  prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
 { oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
   proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ca96f7c3e4c..63075a6b358 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -573,9 +573,7 @@ typedef struct PgStat_MsgBgWriter
 {
 	PgStat_MsgHdr m_hdr;
 
-	PgStat_Counter m_buf_written_clean;
 	PgStat_Counter m_maxwritten_clean;
-	PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;
 
 /* ----------
@@ -588,9 +586,6 @@ typedef struct PgStat_MsgCheckpointer
 
 	PgStat_Counter m_timed_checkpoints;
 	PgStat_Counter m_requested_checkpoints;
-	PgStat_Counter m_buf_written_checkpoints;
-	PgStat_Counter m_buf_written_backend;
-	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgCheckpointer;
@@ -858,10 +853,7 @@ typedef struct PgStat_ArchiverStats
 
 typedef struct PgStat_BgWriterStats
 {
-	PgStat_Counter buf_written_clean;
 	PgStat_Counter maxwritten_clean;
-	PgStat_Counter buf_alloc;
-	TimestampTz stat_reset_timestamp;
 } PgStat_BgWriterStats;
 
 typedef struct PgStat_CheckpointerStats
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 39d0ed45642..47f54d4a929 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1813,13 +1813,25 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
     pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
     pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-    pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-    pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'checkpointer'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_checkpoint,
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'background writer'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-    pg_stat_get_buf_written_backend() AS buffers_backend,
-    pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-    pg_stat_get_buf_alloc() AS buffers_alloc,
-    pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+    ( SELECT pg_stat_buffers.write
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_backend,
+    ( SELECT pg_stat_buffers.fsync
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_backend_fsync,
+    ( SELECT pg_stat_buffers.alloc
+           FROM pg_stat_buffers
+          WHERE ((pg_stat_buffers.backend_type = 'client backend'::text) AND (pg_stat_buffers.io_path = 'shared'::text))) AS buffers_alloc,
+    ( SELECT pg_stat_buffers.stats_reset
+           FROM pg_stat_buffers
+         LIMIT 1) AS stats_reset;
 pg_stat_buffers| SELECT b.backend_type,
     b.io_path,
     b.alloc,
-- 
2.17.1

0008-small-comment-correction.patchtext/x-diff; charset=us-asciiDownload

From 85751f055a7e9732123938215a0ab9b3c9e165fb Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 24 Nov 2021 12:21:08 -0500
Subject: [PATCH 8/8] small comment correction

Naming callers in function comment is brittle and unnecessary.
---
 src/backend/utils/activity/backend_status.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index c579014ec25..9861ae24ba1 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -259,11 +259,11 @@ GetIOPathDesc(IOPath io_path)
 }
 
 /*
- * Initialize pgstats backend activity state, and set up our on-proc-exit
- * hook.  Called from InitPostgres and AuxiliaryProcessMain. For auxiliary
- * process, MyBackendId is invalid. Otherwise, MyBackendId must be set, but we
- * must not have started any transaction yet (since the exit hook must run
- * after the last transaction exit).
+ * Initialize pgstats backend activity state, and set up our on-proc-exit hook.
+ *
+ * For auxiliary process, MyBackendId is invalid. Otherwise, MyBackendId must
+ * be set, but we must not have started any transaction yet (since the exit
+ * hook must run after the last transaction exit).
  *
  * NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
  */
@@ -301,7 +301,6 @@ pgstat_beinit(void)
  * pgstat_bestart() -
  *
  *	Initialize this backend's entry in the PgBackendStatus array.
- *	Called from InitPostgres.
  *
  *	Apart from auxiliary processes, MyBackendId, MyDatabaseId,
  *	session userid, and application_name must be set for a
-- 
2.17.1

#53

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#51)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Mar 21, 2022 at 8:15 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-02-19 11:06:18 -0500, Melanie Plageman wrote:

v21 rebased with compile errors fixed is attached.

This currently doesn't apply (mea culpa likely):
http://cfbot.cputube.org/patch_37_3272.log

Could you rebase? Marked as waiting-on-author for now.

Attached is the rebased/rewritten version of the pg_stat_buffers patch
which uses the cumulative stats system instead of stats collector.

I've moved to the model of backend-local pending stats which get
accumulated into shared memory by pgstat_report_stat().

It is worth noting that, with this method, other backends will no longer
have access to each other's individual IO operation statistics. An
argument could be made to keep the statistics in each backend in
PgBackendStatus before accumulating them to the cumulative stats system
so that they can be accessed at the per-backend level of detail.

There are two TODOs related to when pgstat_report_io_ops() should be
called. pgstat_report_io_ops() is meant for backends that will not
commonly call pgstat_report_stat(). I was unsure if it made sense for
BootstrapModeMain() to explicitly call pgstat_report_io_ops() and if
auto vacuum worker should call it explicitly and, if so, if it was the
right location to call it after do_autovacuum().

Archiver and syslogger do not increment or report IO operations.

I did not change pg_stat_bgwriter fields to derive from the IO
operations statistics structures since the reset targets differ.

Also, I added one test, but I'm not sure if it will be flakey. It tests
that the "writes" for checkpointer are tracked when data is inserted
into a table and then CHECKPOINT is explicitly invoked directly after. I
don't know if this will have a problem if the checkpointer is busy and
somehow the backend which dirtied the buffer is forced to write out its
own buffer, causing the test to potentially fail (even if the
checkpointer is doing other writes [causing it to be busy], it may not
do them in between the INSERT and the SELECT from pg_stat_buffers).

I am wondering how to add a non-flakey test. For regular backends, I
couldn't think of a way to suspend checkpointer to make them do their
own writes and fsyncs in the context of a regression or isolation test.
In fact for many of the dirty buffers it seems like it will be difficult
to keep bgwriter, checkpointer, and regular backends from competing and
sometimes causing test failures.

- Melanie

Attachments:

v22-0002-Track-IO-operation-statistics.patchapplication/octet-stream; name=v22-0002-Track-IO-operation-statistics.patchDownload

From 6a701ba5e069f03289f01ebf1bd41c73a0196aea Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Jun 2022 18:37:42 -0400
Subject: [PATCH v22 2/3] Track IO operation statistics

Introduce IOOp, an IO operation done by a backend, and IOPath, the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp write on
an IOPath IOPATH_SHARED.

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live.

Some BackendTypes will not execute pgstat_report_stat() and thus must
explicitly call pgstat_report_io_ops() in order to flush their backend
local IO operation statistics to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml               |   2 +
 src/backend/bootstrap/bootstrap.c          |   3 +
 src/backend/postmaster/autovacuum.c        |   3 +
 src/backend/postmaster/bgwriter.c          |   1 +
 src/backend/postmaster/checkpointer.c      |   4 +
 src/backend/postmaster/startup.c           |   2 +
 src/backend/postmaster/walwriter.c         |   2 +
 src/backend/storage/buffer/bufmgr.c        |  44 ++-
 src/backend/storage/buffer/freelist.c      |  23 +-
 src/backend/storage/buffer/localbuf.c      |   3 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/pgstat.c        |  24 ++
 src/backend/utils/activity/pgstat_io_ops.c | 296 +++++++++++++++++++++
 src/backend/utils/adt/pgstatfuncs.c        |   4 +-
 src/include/miscadmin.h                    |   2 +
 src/include/pgstat.h                       |  51 ++++
 src/include/storage/buf_internals.h        |   4 +-
 src/include/utils/backend_status.h         |  34 +++
 src/include/utils/pgstat_internal.h        |  20 ++
 20 files changed, 509 insertions(+), 16 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4549c2560e..602f0a827a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5355,6 +5355,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>buffers</literal> to reset all the counters shown in the
+        <structname>pg_stat_buffers</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 088556ab54..87298ede77 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -33,6 +33,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/condition_variable.h"
@@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	 * out the initial relation mapping files.
 	 */
 	RelationMapFinishBootstrap();
+	// TODO: should this be done for bootstrap?
+	pgstat_report_io_ops();
 
 	/* Clean up and exit */
 	cleanup();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2e146aac93..e6dbb1c4bb 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[])
 		recentXid = ReadNextTransactionId();
 		recentMulti = ReadNextMultiXactId();
 		do_autovacuum();
+
+		// TODO: should this be done more often somewhere in do_autovacuum()?
+		pgstat_report_io_ops();
 	}
 
 	/*
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 91e6f6ea18..87e4b9e9bd 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -242,6 +242,7 @@ BackgroundWriterMain(void)
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
+		pgstat_report_io_ops();
 
 		if (FirstCallSinceLastCheckpoint())
 		{
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c937c39f50..1f72789739 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -504,6 +504,7 @@ CheckpointerMain(void)
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
+		pgstat_report_io_ops();
 		pgstat_report_wal(true);
 
 		/*
@@ -582,6 +583,7 @@ HandleCheckpointerInterrupts(void)
 		PendingCheckpointerStats.requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_report_checkpointer();
+		pgstat_report_io_ops();
 		pgstat_report_wal(true);
 
 		/* Normal exit from the checkpointer is here */
@@ -726,6 +728,7 @@ CheckpointWriteDelay(int flags, double progress)
 
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
+		pgstat_report_io_ops();
 
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1116,6 +1119,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index f99186eab7..678663798d 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -266,6 +266,8 @@ StartupProcessMain(void)
 	 */
 	StartupXLOG();
 
+	pgstat_report_io_ops();
+
 	/*
 	 * Exit normally. Exit code 0 tells postmaster that we completed recovery
 	 * successfully.
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..f649282443 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -259,6 +259,7 @@ WalWriterMain(void)
 
 		/* report pending statistics to the cumulative stats system */
 		pgstat_report_wal(false);
+		pgstat_report_io_ops();
 
 		/*
 		 * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
@@ -302,6 +303,7 @@ HandleWalWriterInterrupts(void)
 		 * exist unreported stats counters for the WAL writer.
 		 */
 		pgstat_report_wal(true);
+		pgstat_report_io_ops();
 
 		proc_exit(0);
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ae13011d27..f1ddd696b1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 										  ForkNumber forkNum,
 										  BlockNumber nForkBlock,
@@ -980,6 +980,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, IOPATH_SHARED);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1180,6 +1181,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1190,7 +1192,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1227,6 +1229,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath iopath;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1244,7 +1247,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1253,13 +1256,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 *
+				 * When a strategy is not in use, at this point the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rnode.node.spcNode,
 														  smgr->smgr_rnode.node.dbNode,
 														  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2563,7 +2580,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2810,9 +2827,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2892,6 +2912,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3540,6 +3562,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3575,7 +3599,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3670,7 +3694,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3878,7 +3902,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3905,7 +3929,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..afc22a4203 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,7 +213,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
 			return buf;
 	}
 
@@ -247,6 +249,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -682,8 +685,15 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -699,5 +709,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index e71f95ac1f..cb7b1720a4 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e1fb631003..20e259edef 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0d9d09c492..27507dfbb6 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1312,6 +1324,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_path_ops);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1486,6 +1504,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..e184b0b484
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,296 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOPathOps pending_IOOpStats;
+static PgStat_IOPathOps cumulative_IOOpStats;
+bool have_ioopstats = false;
+
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ *
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStat_IOPathOps *dest_io_path_ops;
+	PgStatShared_BackendIOPathOps *stats_shmem;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!have_ioopstats)
+		return false;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return false;
+
+	stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	if (!nowait)
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
+		return true;
+
+	dest_io_path_ops =
+		&stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *sharedent = &dest_io_path_ops->data[i];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[i];
+
+#define IO_OP_ACC(fld) sharedent->fld += pendingent->fld
+		IO_OP_ACC(allocs);
+		IO_OP_ACC(extends);
+		IO_OP_ACC(fsyncs);
+		IO_OP_ACC(writes);
+#undef IO_OP_ACC
+	}
+
+	LWLockRelease(&stats_shmem->lock);
+
+	MemSet(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops;
+	PgStat_IOPathOps *reset_ops;
+
+	PgStat_IOPathOps *reset_offset = stats_shmem->reset_offset;
+	PgStat_IOPathOps reset[BACKEND_NUM_TYPES];
+
+	pgstat_copy_changecounted_stats(snapshot_ops,
+			&stats_shmem->stats, sizeof(stats_shmem->stats),
+			&stats_shmem->changecount);
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
+	LWLockRelease(&stats_shmem->lock);
+
+	reset_ops = reset;
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *counters = snapshot_ops->data;
+		PgStat_IOOpCounters *reset_counters = reset_ops->data;
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			counters->allocs -= reset_counters->allocs;
+			counters->extends -= reset_counters->extends;
+			counters->fsyncs -= reset_counters->fsyncs;
+			counters->writes -= reset_counters->writes;
+
+			counters++;
+			reset_counters++;
+		}
+		snapshot_ops++;
+		reset_ops++;
+	}
+}
+
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	pgstat_copy_changecounted_stats(&stats_shmem->reset_offset,
+			&stats_shmem->stats, sizeof(stats_shmem->stats),
+			&stats_shmem->changecount);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		stats_shmem->stats[i].stat_reset_timestamp = ts;
+	LWLockRelease(&stats_shmem->lock);
+}
+
+void
+pgstat_count_io_op(IOOp io_op, IOPath io_path)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
+	PgStat_IOOpCounters *cumulative_counters =
+			&cumulative_IOOpStats.data[io_path];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			cumulative_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			cumulative_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			cumulative_counters->fsyncs++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			cumulative_counters->writes++;
+			break;
+	}
+
+	have_ioopstats = true;
+}
+
+/*
+ * Report IO operation statistics
+ *
+ * This works in much the same way as pgstat_flush_io_ops() but is meant for
+ * BackendTypes like bgwriter for whom pgstat_report_stat() will not be called
+ * frequently enough to keep shared memory stats fresh.
+ * Backends not typically calling pgstat_report_stat() can invoke
+ * pgstat_report_io_ops() explicitly.
+ */
+void
+pgstat_report_io_ops(void)
+{
+	PgStat_IOPathOps *dest_io_path_ops;
+	PgStatShared_BackendIOPathOps *stats_shmem;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(!pgStatLocal.shmem->is_shutdown);
+	pgstat_assert_is_up();
+
+	if (!have_ioopstats)
+		return;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;
+
+	stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	dest_io_path_ops =
+		&stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
+
+	pgstat_begin_changecount_write(&stats_shmem->changecount);
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *sharedent = &dest_io_path_ops->data[i];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[i];
+
+#define IO_OP_ACC(fld) sharedent->fld += pendingent->fld
+		IO_OP_ACC(allocs);
+		IO_OP_ACC(extends);
+		IO_OP_ACC(fsyncs);
+		IO_OP_ACC(writes);
+#undef IO_OP_ACC
+	}
+
+	pgstat_end_changecount_write(&stats_shmem->changecount);
+
+	MemSet(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+}
+
+PgStat_IOPathOps *
+pgstat_fetch_backend_io_path_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	return pgStatLocal.snapshot.io_path_ops;
+}
+
+PgStat_Counter
+pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op)
+{
+	PgStat_IOOpCounters *counters = &cumulative_IOOpStats.data[io_path];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return counters->allocs;
+		case IOOP_EXTEND:
+			return counters->extends;
+		case IOOP_FSYNC:
+			return counters->fsyncs;
+		case IOOP_WRITE:
+			return counters->writes;
+		default:
+			elog(ERROR, "IO Operation %s for IO Path %s is undefined.",
+					pgstat_io_op_desc(io_op), pgstat_io_path_desc(io_path));
+	}
+}
+
+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+	const char *io_path_desc = "Unknown IO Path";
+
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			io_path_desc = "Direct";
+			break;
+		case IOPATH_LOCAL:
+			io_path_desc = "Local";
+			break;
+		case IOPATH_SHARED:
+			io_path_desc = "Shared";
+			break;
+		case IOPATH_STRATEGY:
+			io_path_desc = "Strategy";
+			break;
+	}
+
+	return io_path_desc;
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	const char *io_op_desc = "Unknown IO Op";
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			io_op_desc = "Alloc";
+			break;
+		case IOOP_EXTEND:
+			io_op_desc = "Extend";
+			break;
+		case IOOP_FSYNC:
+			io_op_desc = "Fsync";
+			break;
+		case IOOP_WRITE:
+			io_op_desc = "Write";
+			break;
+	}
+
+	return io_op_desc;
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad5..47ceba78e6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2104,6 +2104,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "buffers") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2112,7 +2114,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"buffers\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 78097e714b..70e1f66370 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -339,6 +339,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..71f75806f6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,44 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Paths
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOPathOps
+{
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+	TimestampTz stat_reset_timestamp;
+} PgStat_IOPathOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +492,18 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOPath io_path);
+extern void pgstat_report_io_ops(void);
+extern PgStat_IOPathOps *pgstat_fetch_backend_io_path_ops(void);
+extern PgStat_Counter pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+extern const char *pgstat_io_path_desc(IOPath io_path);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index a17e7b28a5..c8546be22b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7403bca25e..49d062b1af 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -306,6 +306,40 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..2dfa750df9 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,14 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_BackendIOPathOps
+{
+	LWLock		lock;
+	uint32		changecount;
+	PgStat_IOPathOps stats[BACKEND_NUM_TYPES];
+	PgStat_IOPathOps reset_offset[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOPathOps;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +427,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOPathOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +451,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IOPathOps io_path_ops[BACKEND_NUM_TYPES];
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +560,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern bool pgstat_flush_io_ops(bool nowait);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+
+
 /*
  * Functions in pgstat_relation.c
  */
-- 
2.37.0

v22-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v22-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IO Path for
checkpointer will be 0 for all IO operations (alloc, write, fsync, and
extend). All possible combinations of IOPath and IOOp are enumerated in
the view but not all are populated or even possible at this point.

View stats are fetched from statistics incremented when a backend
performs an IO operation and maintained by the cumulative statistics
subsystem.

Each row of the view is stats for a particular backend type for a
particular IO Path (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'buffers'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 106 ++++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  11 +++
 src/backend/utils/adt/pgstatfuncs.c  |  66 +++++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +++
 src/test/regress/expected/rules.out  |   8 ++
 src/test/regress/expected/stats.out  |  52 +++++++++++++
 src/test/regress/sql/stats.sql       |  14 ++++
 7 files changed, 265 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 602f0a827a..ee276b44ca 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_buffers</structname><indexterm><primary>pg_stat_buffers</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-buffers-view">
+       <structname>pg_stat_buffers</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3595,7 +3604,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-buffers-view">
+  <title><structname>pg_stat_buffers</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_buffers</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_buffers</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-buffers-view" xreflabel="pg_stat_buffers">
+   <title><structname>pg_stat_buffers</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..b756480eab 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 47ceba78e6..c94d98de62 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1739,6 +1739,72 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_buffers view, add a new enum
+* value here above BUFFERS_NUM_COLUMNS.
+*/
+enum
+{
+	BUFFERS_COLUMN_BACKEND_TYPE,
+	BUFFERS_COLUMN_IO_PATH,
+	BUFFERS_COLUMN_ALLOCS,
+	BUFFERS_COLUMN_EXTENDS,
+	BUFFERS_COLUMN_FSYNCS,
+	BUFFERS_COLUMN_WRITES,
+	BUFFERS_COLUMN_RESET_TIME,
+	BUFFERS_NUM_COLUMNS,
+};
+
+Datum
+pg_stat_get_buffers(PG_FUNCTION_ARGS)
+{
+	PgStat_IOPathOps *io_path_ops;
+	ReturnSetInfo *rsinfo;
+	Datum reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	io_path_ops = pgstat_fetch_backend_io_path_ops();
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/*
+		* Currently it is not permitted to reset IO operation stats for individual
+		* IO Paths or individual BackendTypes. All IO Operation statistics are
+		* reset together. As such, it is easiest to reuse the first reset timestamp
+		* available.
+		*/
+	reset_time = TimestampTzGetDatum(io_path_ops->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *counters = io_path_ops->data;
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
+			/* const char *log_name = GetBackendTypeDesc(idx_get_backend_type(i)); */
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum values[BUFFERS_NUM_COLUMNS];
+			bool nulls[BUFFERS_NUM_COLUMNS];
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));
+			values[BUFFERS_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+			values[BUFFERS_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[BUFFERS_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[BUFFERS_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[BUFFERS_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			counters++;
+		}
+		io_path_ops++;
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..462bcbbe15 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_buffers', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_buffers' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..4f266fea8e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1822,6 +1822,14 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
+pg_stat_buffers| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_buffers() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,
         CASE
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 5b0ebf090f..350171f2be 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -554,4 +554,56 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+SELECT pg_stat_reset_shared('buffers');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
+ ?column? 
+----------
+ t
+(1 row)
+
+CREATE TABLE test_buffer_stats(a int, b int);
+INSERT INTO test_buffer_stats SELECT i, i FROM generate_series(1,1000)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT write != 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT pg_stat_reset_shared('buffers');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_buffer_stats;
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 3f3cf8fb56..11c69ca423 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -285,4 +285,18 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+
+SELECT pg_stat_reset_shared('buffers');
+SELECT pg_stat_force_next_flush();
+SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
+CREATE TABLE test_buffer_stats(a int, b int);
+INSERT INTO test_buffer_stats SELECT i, i FROM generate_series(1,1000)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+SELECT write != 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
+SELECT pg_stat_reset_shared('buffers');
+SELECT pg_stat_force_next_flush();
+SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
+DROP TABLE test_buffer_stats;
+
 -- End of Stats Test
-- 
2.37.0

v22-0001-Add-BackendType-for-standalone-backends.patchapplication/octet-stream; name=v22-0001-Add-BackendType-for-standalone-backends.patchDownload

From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Jun 2022 11:33:04 -0400
Subject: [PATCH v22 1/3] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 17 +++++++++++------
 src/include/miscadmin.h           |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..07e6db1a1c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,6 +278,12 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
@@ -285,12 +296,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0af130fbc5..78097e714b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -324,18 +324,19 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_LOGGER,
 } BackendType;
 
 extern PGDLLIMPORT BackendType MyBackendType;
-- 
2.37.0

#54

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#53)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:

From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Jun 2022 11:33:04 -0400
Subject: [PATCH v22 1/3] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: /messages/by-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU+ECH4tNwGFNERkZA@mail.gmail.com
---
src/backend/utils/init/miscinit.c | 17 +++++++++++------
src/include/miscadmin.h | 5 +++--
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..07e6db1a1c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
{
Assert(!IsPostmasterEnvironment);
+ MyBackendType = B_STANDALONE_BACKEND;

Hm. This is used for singleuser mode as well as bootstrap. Should we
split those? It's not like bootstrap mode really matters for stats, so
I'm inclined not to.

@@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
* out the initial relation mapping files.
*/
RelationMapFinishBootstrap();
+	// TODO: should this be done for bootstrap?
+	pgstat_report_io_ops();

Hm. Not particularly useful, but also not harmful. But we don't need an
explicit call, because it'll be done at process exit too. At least I
think, it could be that it's different for bootstrap.

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2e146aac93..e6dbb1c4bb 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[])
recentXid = ReadNextTransactionId();
recentMulti = ReadNextMultiXactId();
do_autovacuum();
+
+		// TODO: should this be done more often somewhere in do_autovacuum()?
+		pgstat_report_io_ops();
}

Don't think you need all these calls before process exit - it'll happen
via pgstat_shutdown_hook().

IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 91e6f6ea18..87e4b9e9bd 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -242,6 +242,7 @@ BackgroundWriterMain(void)
/* Report pending statistics to the cumulative stats system */
pgstat_report_bgwriter();
+ pgstat_report_io_ops();

if (FirstCallSinceLastCheckpoint())
{

How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.

+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ *
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStat_IOPathOps *dest_io_path_ops;
+	PgStatShared_BackendIOPathOps *stats_shmem;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	if (!have_ioopstats)
+		return false;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return false;
+
+	stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	if (!nowait)
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
+		return true;

Wonder if it's worth making the lock specific to the backend type?

+	dest_io_path_ops =
+		&stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
+

This could be done before acquiring the lock, right?

+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops;
+	PgStat_IOPathOps *reset_ops;
+
+	PgStat_IOPathOps *reset_offset = stats_shmem->reset_offset;
+	PgStat_IOPathOps reset[BACKEND_NUM_TYPES];
+
+	pgstat_copy_changecounted_stats(snapshot_ops,
+			&stats_shmem->stats, sizeof(stats_shmem->stats),
+			&stats_shmem->changecount);

This doesn't make sense - with multiple writers you can't use the
changecount approach (and you don't in the flush part above).

+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
+	LWLockRelease(&stats_shmem->lock);

Which then also means that you don't need the reset offset stuff. It's
only there because with the changecount approach we can't take a lock to
reset the stats (since there is no lock). With a lock you can just reset
the shared state.

+void
+pgstat_count_io_op(IOOp io_op, IOPath io_path)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
+	PgStat_IOOpCounters *cumulative_counters =
+			&cumulative_IOOpStats.data[io_path];

the pending_/cumultive_ prefix before an uppercase-first camelcase name
seems ugly...

+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			cumulative_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			cumulative_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			cumulative_counters->fsyncs++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			cumulative_counters->writes++;
+			break;
+	}
+
+	have_ioopstats = true;
+}

Doing two math ops / memory accesses every time seems off. Seems better
to maintain cumultive_counters whenever reporting stats, just before
zeroing pending_counters?

+/*
+ * Report IO operation statistics
+ *
+ * This works in much the same way as pgstat_flush_io_ops() but is meant for
+ * BackendTypes like bgwriter for whom pgstat_report_stat() will not be called
+ * frequently enough to keep shared memory stats fresh.
+ * Backends not typically calling pgstat_report_stat() can invoke
+ * pgstat_report_io_ops() explicitly.
+ */
+void
+pgstat_report_io_ops(void)
+{

This shouldn't be needed - the flush function above can be used.

+	PgStat_IOPathOps *dest_io_path_ops;
+	PgStatShared_BackendIOPathOps *stats_shmem;
+
+	PgBackendStatus *beentry = MyBEEntry;
+
+	Assert(!pgStatLocal.shmem->is_shutdown);
+	pgstat_assert_is_up();
+
+	if (!have_ioopstats)
+		return;
+
+	if (!beentry || beentry->st_backendType == B_INVALID)
+		return;

Is there a case where this may be called where we have no beentry?

Why not just use MyBackendType?

+	stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	dest_io_path_ops =
+		&stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
+
+	pgstat_begin_changecount_write(&stats_shmem->changecount);

A mentioned before, the changecount stuff doesn't apply here. You need a
lock.

+PgStat_IOPathOps *
+pgstat_fetch_backend_io_path_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	return pgStatLocal.snapshot.io_path_ops;
+}
+
+PgStat_Counter
+pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op)
+{
+	PgStat_IOOpCounters *counters = &cumulative_IOOpStats.data[io_path];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return counters->allocs;
+		case IOOP_EXTEND:
+			return counters->extends;
+		case IOOP_FSYNC:
+			return counters->fsyncs;
+		case IOOP_WRITE:
+			return counters->writes;
+		default:
+			elog(ERROR, "IO Operation %s for IO Path %s is undefined.",
+					pgstat_io_op_desc(io_op), pgstat_io_path_desc(io_path));
+	}
+}

There's currently no user for this, right? Maybe let's just defer the
cumulative stuff until we need it?

+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+	const char *io_path_desc = "Unknown IO Path";
+

This should be unreachable, right?

From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

I think I like pg_stat_io a bit better? Nearly everything in here seems
to fit better in that.

I guess we could split out buffers allocated, but that's actually
interesting in the context of the kind of IO too.

<row>
<entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
<entry>One row only, showing statistics about WAL activity. See
@@ -3595,7 +3604,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<structfield>stats_reset</structfield> <type>timestamp with time zone</type>
</para>
<para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>

Grammar critique time :)

+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;

Do we want to expose all data to all users? I guess pg_stat_bgwriter
does? But this does split things out a lot more...

+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *counters = io_path_ops->data;
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
+			/* const char *log_name = GetBackendTypeDesc(idx_get_backend_type(i)); */
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum values[BUFFERS_NUM_COLUMNS];
+			bool nulls[BUFFERS_NUM_COLUMNS];
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));

Random musing: I wonder if we should start to use SQL level enums for
this kind of thing.

DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
DROP TABLE prevstats;
+SELECT pg_stat_reset_shared('buffers');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
+ ?column? 
+----------
+ t
+(1 row)

Don't think you can rely on that. The lookup of the view, functions
might have needed to load catalog data, which might have needed to evict
buffers. I think you can do something more reliable by checking that
there's more written buffers after a checkpoint than before, or such.

Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

Greetings,

Andres Freund

#55

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#54)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

In the attached patch set, I've added in missing IO operations for
certain IO Paths as well as enumerating in the commit message which IO
Paths and IO Operations are not currently counted and or not possible.

There is a TODO in HandleWalWriterInterrupts() about removing
pgstat_report_wal() since it is immediately before a proc_exit()

I was wondering if LocalBufferAlloc() should increment the counter or if
I should wait until GetLocalBufferStorage() to increment the counter.

I also realized that I am not differentiating between IOPATH_SHARED and
IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
of buffer we are fsync'ing by the time we call register_dirty_segment(),
I'm not sure how we would fix this.

On Wed, Jul 6, 2022 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:

From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Jun 2022 11:33:04 -0400
Subject: [PATCH v22 1/3] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion:

/messages/by-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU+ECH4tNwGFNERkZA@mail.gmail.com
---
src/backend/utils/init/miscinit.c | 17 +++++++++++------
src/include/miscadmin.h | 5 +++--
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/src/backend/utils/init/miscinit.c
b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..07e6db1a1c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
{
Assert(!IsPostmasterEnvironment);
+ MyBackendType = B_STANDALONE_BACKEND;
Hm. This is used for singleuser mode as well as bootstrap. Should we
split those? It's not like bootstrap mode really matters for stats, so
I'm inclined not to.

I have no opinion currently.
It depends on how commonly you think developers might want separate
bootstrap and single user mode IO stats.

@@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool

check_only)
* out the initial relation mapping files.
*/
RelationMapFinishBootstrap();
+     // TODO: should this be done for bootstrap?
+     pgstat_report_io_ops();
Hm. Not particularly useful, but also not harmful. But we don't need an
explicit call, because it'll be done at process exit too. At least I
think, it could be that it's different for bootstrap.

I've removed this and other occurrences which were before proc_exit()
(and thus redundant). (Though I did not explicitly check if it was
different for bootstrap.)

diff --git a/src/backend/postmaster/autovacuum.c
b/src/backend/postmaster/autovacuum.c
index 2e146aac93..e6dbb1c4bb 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[])
recentXid = ReadNextTransactionId();
recentMulti = ReadNextMultiXactId();
do_autovacuum();
+
+             // TODO: should this be done more often somewhere in
do_autovacuum()?

+ pgstat_report_io_ops();
}

Don't think you need all these calls before process exit - it'll happen
via pgstat_shutdown_hook().

IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.

noted and fixed.

diff --git a/src/backend/postmaster/bgwriter.c
b/src/backend/postmaster/bgwriter.c
index 91e6f6ea18..87e4b9e9bd 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -242,6 +242,7 @@ BackgroundWriterMain(void)
/* Report pending statistics to the cumulative stats
system */

pgstat_report_bgwriter();
+ pgstat_report_io_ops();

if (FirstCallSinceLastCheckpoint())
{

How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.

noted and fixed.

+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If nowait is true, this function returns false on lock failure.

Otherwise

+ * this function always returns true. Writer processes are mutually

excluded

+ * using LWLock, but readers are expected to use change-count protocol

to avoid

+ * interference with writers.
+ *
+ * If nowait is true, this function returns true if the lock could not

+ * acquired. Otherwise return false.
+ *
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+     PgStat_IOPathOps *dest_io_path_ops;
+     PgStatShared_BackendIOPathOps *stats_shmem;
+
+     PgBackendStatus *beentry = MyBEEntry;
+
+     if (!have_ioopstats)
+             return false;
+
+     if (!beentry || beentry->st_backendType == B_INVALID)
+             return false;
+
+     stats_shmem = &pgStatLocal.shmem->io_ops;
+
+     if (!nowait)
+             LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+     else if (!LWLockConditionalAcquire(&stats_shmem->lock,

LW_EXCLUSIVE))

+ return true;

Wonder if it's worth making the lock specific to the backend type?

I've added another Lock into PgStat_IOPathOps so that each BackendType
can be locked separately. But, I've also kept the lock in
PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
done easily.

+ dest_io_path_ops =
+

&stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];

+

This could be done before acquiring the lock, right?
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+     PgStatShared_BackendIOPathOps *stats_shmem =
&pgStatLocal.shmem->io_ops;
+     PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops;
+     PgStat_IOPathOps *reset_ops;
+
+     PgStat_IOPathOps *reset_offset = stats_shmem->reset_offset;
+     PgStat_IOPathOps reset[BACKEND_NUM_TYPES];
+
+     pgstat_copy_changecounted_stats(snapshot_ops,
+                     &stats_shmem->stats, sizeof(stats_shmem->stats),
+                     &stats_shmem->changecount);
This doesn't make sense - with multiple writers you can't use the
changecount approach (and you don't in the flush part above).
+     LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+     memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
+     LWLockRelease(&stats_shmem->lock);
Which then also means that you don't need the reset offset stuff. It's
only there because with the changecount approach we can't take a lock to
reset the stats (since there is no lock). With a lock you can just reset
the shared state.

Yes, I believe I have cleaned up all of this embarrassing mess. I use the
lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
locks in PgStat_IOPathOps for flush.

+void
+pgstat_count_io_op(IOOp io_op, IOPath io_path)
+{
+     PgStat_IOOpCounters *pending_counters =

&pending_IOOpStats.data[io_path];

+     PgStat_IOOpCounters *cumulative_counters =
+                     &cumulative_IOOpStats.data[io_path];

the pending_/cumultive_ prefix before an uppercase-first camelcase name
seems ugly...

+     switch (io_op)
+     {
+             case IOOP_ALLOC:
+                     pending_counters->allocs++;
+                     cumulative_counters->allocs++;
+                     break;
+             case IOOP_EXTEND:
+                     pending_counters->extends++;
+                     cumulative_counters->extends++;
+                     break;
+             case IOOP_FSYNC:
+                     pending_counters->fsyncs++;
+                     cumulative_counters->fsyncs++;
+                     break;
+             case IOOP_WRITE:
+                     pending_counters->writes++;
+                     cumulative_counters->writes++;
+                     break;
+     }
+
+     have_ioopstats = true;
+}

Doing two math ops / memory accesses every time seems off. Seems better
to maintain cumultive_counters whenever reporting stats, just before
zeroing pending_counters?

I've gone ahead and cut the cumulative counters concept.

+/*
+ * Report IO operation statistics
+ *
+ * This works in much the same way as pgstat_flush_io_ops() but is
meant for

+ * BackendTypes like bgwriter for whom pgstat_report_stat() will not be

called
+ * frequently enough to keep shared memory stats fresh.
+ * Backends not typically calling pgstat_report_stat() can invoke
+ * pgstat_report_io_ops() explicitly.
+ */
+void
+pgstat_report_io_ops(void)
+{
This shouldn't be needed - the flush function above can be used.

Fixed.

+     PgStat_IOPathOps *dest_io_path_ops;
+     PgStatShared_BackendIOPathOps *stats_shmem;
+
+     PgBackendStatus *beentry = MyBEEntry;
+
+     Assert(!pgStatLocal.shmem->is_shutdown);
+     pgstat_assert_is_up();
+
+     if (!have_ioopstats)
+             return;
+
+     if (!beentry || beentry->st_backendType == B_INVALID)
+             return;

Is there a case where this may be called where we have no beentry?

Why not just use MyBackendType?

Fixed.

+     stats_shmem = &pgStatLocal.shmem->io_ops;
+
+     dest_io_path_ops =
+
&stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
+
+     pgstat_begin_changecount_write(&stats_shmem->changecount);
A mentioned before, the changecount stuff doesn't apply here. You need a
lock.

Fixed.

+PgStat_IOPathOps *
+pgstat_fetch_backend_io_path_ops(void)
+{
+     pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+     return pgStatLocal.snapshot.io_path_ops;
+}
+
+PgStat_Counter
+pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op)
+{
+     PgStat_IOOpCounters *counters =

&cumulative_IOOpStats.data[io_path];

+
+     switch (io_op)
+     {
+             case IOOP_ALLOC:
+                     return counters->allocs;
+             case IOOP_EXTEND:
+                     return counters->extends;
+             case IOOP_FSYNC:
+                     return counters->fsyncs;
+             case IOOP_WRITE:
+                     return counters->writes;
+             default:
+                     elog(ERROR, "IO Operation %s for IO Path %s is

undefined.",

+ pgstat_io_op_desc(io_op),

pgstat_io_path_desc(io_path));

+ }
+}

There's currently no user for this, right? Maybe let's just defer the
cumulative stuff until we need it?

Removed.

+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+     const char *io_path_desc = "Unknown IO Path";
+

This should be unreachable, right?

Changed it to an error.

From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

I think I like pg_stat_io a bit better? Nearly everything in here seems
to fit better in that.

I guess we could split out buffers allocated, but that's actually
interesting in the context of the kind of IO too.

changed it to pg_stat_io

+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
Do we want to expose all data to all users? I guess pg_stat_bgwriter
does? But this does split things out a lot more...

I didn't see another similar example limiting access.

DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2,

trunc_stats_test3, trunc_stats_test4;
DROP TABLE prevstats;
+SELECT pg_stat_reset_shared('buffers');
+ pg_stat_reset_shared
+----------------------
+
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush
+--------------------------
+
+(1 row)
+
+SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and
backend_type = 'checkpointer';
+ ?column?
+----------
+ t
+(1 row)
Don't think you can rely on that. The lookup of the view, functions
might have needed to load catalog data, which might have needed to evict
buffers. I think you can do something more reliable by checking that
there's more written buffers after a checkpoint than before, or such.

Yes, per an off list suggestion by you, I have changed the tests to use a
sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
the missing calls to count IO Operations for IOPATH_LOCAL and
IOPATH_STRATEGY.

I struggled to come up with a way to test writes for a particular
type of backend are counted correctly since a dirty buffer could be
written out by another type of backend before the target BackendType has
a chance to write it out.

I also struggled to come up with a way to test IO operations for
background workers. I'm not sure of a way to deterministically have a
background worker do a particular kind of IO in a test scenario.

I'm not sure how to cause a strategy "extend" for testing.

Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

I've added a test to test that reused strategy buffers are counted as
allocs. I would like to add a test which checks that if a buffer in the
ring is pinned and thus not reused, that it is not counted as a strategy
alloc, but I found it challenging without a way to pause vacuuming, pin
a buffer, then resume vacuuming.

Thanks,
Melanie

Attachments:

v23-0001-Add-BackendType-for-standalone-backends.patchtext/x-patch; charset=US-ASCII; name=v23-0001-Add-BackendType-for-standalone-backends.patchDownload

From 9d8fdbcf8dde109e84b680c8160c0174574a2c05 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Jun 2022 11:33:04 -0400
Subject: [PATCH v23 1/3] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 17 +++++++++++------
 src/include/miscadmin.h           |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..07e6db1a1c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,6 +278,12 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
@@ -285,12 +296,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ea9a56d395..5276bf25a1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -316,18 +316,19 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_LOGGER,
 } BackendType;
 
 extern PGDLLIMPORT BackendType MyBackendType;
-- 
2.34.1

v23-0002-Track-IO-operation-statistics.patchtext/x-patch; charset=US-ASCII; name=v23-0002-Track-IO-operation-statistics.patchDownload

From b20d5fcc16492b1934d9a8cc8144508505db5d6b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Jun 2022 18:37:42 -0400
Subject: [PATCH v23 2/3] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp write on
an IOPath IOPATH_SHARED by BackendType "checkpointer".

Each IOOp (alloc, fsync, extend, write) is counted per IOPath
(direct, local, shared, or strategy) through a call to
pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOPATH_LOCAL and IOPATH_SHARED IOPaths concern operations on local
and shared buffers.

The IOPATH_STRATEGY IOPath concerns buffers alloc'd/written/read/fsync'd
as part of a BufferAccessStrategy.

The IOPATH_DIRECT IOPath concerns blocks of IO which are read, written,
or fsync'd using smgrwrite/extend/immedsync directly (as opposed to
through [Local]BufferAlloc()).

Note that this commit does not add code to increment IOPATH_DIRECT. A
future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call
pgstat_count_io_op() for unbuffered IO and avoid regressions for future
users of these functions.

IOOP_ALLOC is counted for IOPATH_SHARED and IOPATH_LOCAL whenever a
buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC is invalid
for IOPATH_DIRECT. IOOP_ALLOC for IOPATH_STRATEGY is counted whenever a
buffer already in the strategy ring is reused.

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live.

Some BackendTypes will not execute pgstat_report_stat() and thus must
explicitly call pgstat_report_io_ops() in order to flush their backend
local IO operation statistics to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/bootstrap/bootstrap.c             |   1 +
 src/backend/postmaster/checkpointer.c         |   1 +
 src/backend/postmaster/walwriter.c            |   1 +
 src/backend/storage/buffer/bufmgr.c           |  53 ++++--
 src/backend/storage/buffer/freelist.c         |  33 +++-
 src/backend/storage/buffer/localbuf.c         |   5 +
 src/backend/storage/sync/sync.c               |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/pgstat.c           |  24 +++
 src/backend/utils/activity/pgstat_bgwriter.c  |   5 +
 .../utils/activity/pgstat_checkpointer.c      |   5 +
 src/backend/utils/activity/pgstat_database.c  |   5 +
 src/backend/utils/activity/pgstat_io_ops.c    | 168 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  10 ++
 src/backend/utils/activity/pgstat_wal.c       |   5 +
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  53 ++++++
 src/include/storage/buf_internals.h           |   4 +-
 src/include/utils/backend_status.h            |  34 ++++
 src/include/utils/pgstat_internal.h           |  17 ++
 21 files changed, 417 insertions(+), 16 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 088556ab54..963b05321e 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -33,6 +33,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..a06331e1eb 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..64e58f17f6 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -301,6 +301,7 @@ HandleWalWriterInterrupts(void)
 		 * loop to avoid overloading the cumulative stats system, there may
 		 * exist unreported stats counters for the WAL writer.
 		 */
+		// TODO: This may not be needed also
 		pgstat_report_wal(true);
 
 		proc_exit(0);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e257ae23e4..be5fb1e5bf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelFileLocatorBuffers(RelFileLocator rlocator,
 											 ForkNumber forkNum,
 											 BlockNumber nForkBlock,
@@ -980,6 +980,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		IOPath io_path;
+
+		if (isLocalBuf)
+			io_path = IOPATH_LOCAL;
+		else if (strategy != NULL)
+			io_path = IOPATH_STRATEGY;
+		else
+			io_path = IOPATH_SHARED;
+
+		pgstat_count_io_op(IOOP_EXTEND, io_path);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1180,6 +1190,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1190,7 +1201,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1227,6 +1238,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath iopath;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1244,7 +1256,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1253,13 +1265,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the dirty buffer was selected
+				 * from the strategy ring and we did not bother checking the
+				 * freelist or doing a clock sweep to look for a clean shared
+				 * buffer to use, the write will be counted as a strategy
+				 * write. However, if the dirty buffer was obtained from the
+				 * freelist or a clock sweep, it is counted as a regular write.
+				 *
+				 * When a strategy is not in use, at this point the write can
+				 * only be a "regular" write of a dirty buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2563,7 +2589,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2810,9 +2836,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2892,6 +2921,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3540,6 +3571,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3575,7 +3608,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3670,7 +3703,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3878,7 +3911,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3905,7 +3938,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..e042612c4a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,7 +199,7 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
@@ -212,8 +213,19 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy ring
+			 * will be counted as allocations for the purposes of IO Operation
+			 * statistics tracking. However, even when a strategy is in use, if
+			 * a new buffer must be allocated from shared buffers and added to
+			 * the ring, this is counted as a IOPATH_SHARED allocation.
+			 */
+			pgstat_count_io_op(IOOP_ALLOC, IOPATH_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +259,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -682,8 +695,15 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring)
 {
+
+	/*
+	 * Start by assuming that we will use the dirty buffer selected by
+	 * StrategyGetBuffer().
+	 */
+	*from_ring = true;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -699,5 +719,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Since we will not be writing out a dirty buffer from the ring, set
+	 * from_ring to false so that the caller does not count this write as a
+	 * "strategy write" and can do proper bookkeeping.
+	 */
+	*from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 41a08076b3..e99e1f53ef 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
@@ -123,6 +124,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	if (LocalBufHash == NULL)
 		InitLocalBuffers();
 
+	pgstat_count_io_op(IOOP_ALLOC, IOPATH_LOCAL);
+
 	/* See if the desired buffer already exists */
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e1fb631003..20e259edef 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..52924e64dd 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1312,6 +1324,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_path_ops);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1486,6 +1504,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..d83df169db 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Also report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..668abecf90 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Also report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index d9275611f0..5fac75c8c6 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -72,6 +72,11 @@ pgstat_report_autovac(Oid dboid)
 	dbentry->stats.last_autovac_time = GetCurrentTimestamp();
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Also report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..abe288efc4
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,168 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOPathOps pending_IOOpStats;
+bool have_ioopstats = false;
+
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStat_IOPathOps *stats_shmem;
+
+	if (!have_ioopstats)
+		return false;
+
+	stats_shmem =
+		&pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];
+
+	if (!nowait)
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
+		return true;
+
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *sharedent = &stats_shmem->data[i];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[i];
+
+#define IO_OP_ACC(fld) sharedent->fld += pendingent->fld
+		IO_OP_ACC(allocs);
+		IO_OP_ACC(extends);
+		IO_OP_ACC(fsyncs);
+		IO_OP_ACC(writes);
+#undef IO_OP_ACC
+	}
+
+	LWLockRelease(&stats_shmem->lock);
+
+	MemSet(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+	memcpy(pgStatLocal.snapshot.io_path_ops, &stats_shmem->stats,
+			sizeof(stats_shmem->stats));
+
+	LWLockRelease(&stats_shmem->lock);
+}
+
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+	memset(&stats_shmem->stats, 0, sizeof(stats_shmem->stats));
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		stats_shmem->stats[i].stat_reset_timestamp = ts;
+
+	LWLockRelease(&stats_shmem->lock);
+}
+
+void
+pgstat_count_io_op(IOOp io_op, IOPath io_path)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+	have_ioopstats = true;
+}
+
+PgStat_IOPathOps *
+pgstat_fetch_backend_io_path_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return pgStatLocal.snapshot.io_path_ops;
+}
+
+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "Direct";
+		case IOPATH_LOCAL:
+			return "Local";
+		case IOPATH_SHARED:
+			return "Shared";
+		case IOPATH_STRATEGY:
+			return "Strategy";
+	}
+
+	elog(ERROR, "Attempt to describe an unknown IOPath");
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "Alloc";
+		case IOOP_EXTEND:
+			return "Extend";
+		case IOOP_FSYNC:
+			return "Fsync";
+		case IOOP_WRITE:
+			return "Write";
+	}
+
+	elog(ERROR, "Attempt to describe an unknown IOOperation");
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..01ea45adf4 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -257,6 +257,11 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Also report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
@@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Also report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..29b2e63e16 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -43,6 +43,11 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	/*
+	 * Also report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad5..6259cc4f4c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2104,6 +2104,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2112,7 +2114,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 5276bf25a1..61e95135f2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..36a4b89a58 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/lwlock.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -48,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +278,45 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Paths
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOPathOps
+{
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+	TimestampTz stat_reset_timestamp;
+} PgStat_IOPathOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +494,18 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOPath io_path);
+extern bool pgstat_flush_io_ops(bool nowait);
+extern PgStat_IOPathOps *pgstat_fetch_backend_io_path_ops(void);
+extern PgStat_Counter pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+extern const char *pgstat_io_path_desc(IOPath io_path);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index aded5e8f7e..e35d82f050 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,10 +310,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7403bca25e..49d062b1af 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -306,6 +306,40 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..adffdd147d 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,12 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_BackendIOPathOps
+{
+	LWLock lock;
+	PgStat_IOPathOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOPathOps;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +425,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOPathOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +449,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IOPathOps io_path_ops[BACKEND_NUM_TYPES];
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +558,14 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_snapshot_cb(void);
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+
+
 /*
  * Functions in pgstat_relation.c
  */
-- 
2.34.1

v23-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v23-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 714fab745590b4ed6c1b9e220fb75c36ad5ab85d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v23 3/3] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
writes, fsyncs, and extends) done through each IOPath (e.g. shared
buffers, local buffers, unbuffered IO) by each type of backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IOPath for
checkpointer will be 0 for all IOOps. All possible combinations of
IOPath and IOOp are enumerated in the view but not all are populated or
even possible at this point.

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view is stats for a particular BackendType for a
particular IOPath (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 108 ++++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  11 +++
 src/backend/utils/adt/pgstatfuncs.c  |  66 ++++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +++
 src/test/regress/expected/rules.out  |   8 ++
 src/test/regress/expected/stats.out  |  59 +++++++++++++++
 src/test/regress/sql/stats.sql       |  34 +++++++++
 7 files changed, 294 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4549c2560e..775ecf2f21 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3595,7 +3604,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5355,6 +5459,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..b0b2d39e28 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6259cc4f4c..30aff64860 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1739,6 +1739,72 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum
+* value here above IO_NUM_COLUMNS.
+*/
+enum
+{
+	IO_COLUMN_BACKEND_TYPE,
+	IO_COLUMN_IO_PATH,
+	IO_COLUMN_ALLOCS,
+	IO_COLUMN_EXTENDS,
+	IO_COLUMN_FSYNCS,
+	IO_COLUMN_WRITES,
+	IO_COLUMN_RESET_TIME,
+	IO_NUM_COLUMNS,
+};
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_IOPathOps *io_path_ops;
+	ReturnSetInfo *rsinfo;
+	Datum reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	io_path_ops = pgstat_fetch_backend_io_path_ops();
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/*
+		* Currently it is not permitted to reset IO operation stats for individual
+		* IO Paths or individual BackendTypes. All IO Operation statistics are
+		* reset together. As such, it is easiest to reuse the first reset timestamp
+		* available.
+		*/
+	reset_time = TimestampTzGetDatum(io_path_ops->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *counters = io_path_ops->data;
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
+			/* const char *log_name = GetBackendTypeDesc(idx_get_backend_type(i)); */
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum values[IO_NUM_COLUMNS];
+			bool nulls[IO_NUM_COLUMNS];
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[IO_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[IO_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));
+			values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+			values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			counters++;
+		}
+		io_path_ops++;
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..e9662fdc04 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_io', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..3b05af9ac8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,14 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 5b0ebf090f..6dade03b65 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -554,4 +554,63 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+-- Test that writes to Shared Buffers are tracked in pg_stat_io
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+CREATE TABLE test_io_shared_writes(a int);
+INSERT INTO test_io_shared_writes SELECT i FROM generate_series(1,100)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared_writes;
+-- Test that extends of temporary tables are tracked in pg_stat_io
+CREATE TEMPORARY TABLE test_io_local_extends(a int);
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+INSERT INTO test_io_local_extends VALUES(1);
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs.
+CREATE TABLE test_io_strategy_stats(a INT, b INT);
+ALTER TABLE test_io_strategy_stats SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy_stats SELECT i, i from generate_series(1,8000)i;
+-- Ensure that the next VACUUM will need to perform IO
+VACUUM (FULL) test_io_strategy_stats;
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+VACUUM (PARALLEL 0) test_io_strategy_stats;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_stats;
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 3f3cf8fb56..fbd3977605 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -285,4 +285,38 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+
+-- Test that writes to Shared Buffers are tracked in pg_stat_io
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+CREATE TABLE test_io_shared_writes(a int);
+INSERT INTO test_io_shared_writes SELECT i FROM generate_series(1,100)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+DROP TABLE test_io_shared_writes;
+
+-- Test that extends of temporary tables are tracked in pg_stat_io
+CREATE TEMPORARY TABLE test_io_local_extends(a int);
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+INSERT INTO test_io_local_extends VALUES(1);
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs.
+CREATE TABLE test_io_strategy_stats(a INT, b INT);
+ALTER TABLE test_io_strategy_stats SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy_stats SELECT i, i from generate_series(1,8000)i;
+-- Ensure that the next VACUUM will need to perform IO
+VACUUM (FULL) test_io_strategy_stats;
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+VACUUM (PARALLEL 0) test_io_strategy_stats;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+DROP TABLE test_io_strategy_stats;
+
+
 -- End of Stats Test
-- 
2.34.1

#56

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#55)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

Hi,

In the attached patch set, I've added in missing IO operations for
certain IO Paths as well as enumerating in the commit message which IO
Paths and IO Operations are not currently counted and or not possible.

There is a TODO in HandleWalWriterInterrupts() about removing
pgstat_report_wal() since it is immediately before a proc_exit()

Right. walwriter does that without needing the explicit call.

I was wondering if LocalBufferAlloc() should increment the counter or if
I should wait until GetLocalBufferStorage() to increment the counter.

Depends on what "allocate" means. Different from shared buffers, local
buffers are taken from OS then allocated to page. OS-allcoated pages
are restricted by num_temp_buffers so I think what we're interested in
is the count incremented by LocalBuferAlloc(). (And it is the parallel
of alloc for shared-buffers)

I also realized that I am not differentiating between IOPATH_SHARED and
IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
of buffer we are fsync'ing by the time we call register_dirty_segment(),
I'm not sure how we would fix this.

I think there scarcely happens flush for strategy-loaded buffers. If
that is sensible, IOOP_FSYNC would not make much sense for
IOPATH_STRATEGY.

On Wed, Jul 6, 2022 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:

@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
{
Assert(!IsPostmasterEnvironment);

+ MyBackendType = B_STANDALONE_BACKEND;

Hm. This is used for singleuser mode as well as bootstrap. Should we
split those? It's not like bootstrap mode really matters for stats, so
I'm inclined not to.

I have no opinion currently.
It depends on how commonly you think developers might want separate
bootstrap and single user mode IO stats.

Regarding to stats, I don't think separating them makes much sense.

@@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool

check_only)
* out the initial relation mapping files.
*/
RelationMapFinishBootstrap();
+     // TODO: should this be done for bootstrap?
+     pgstat_report_io_ops();
Hm. Not particularly useful, but also not harmful. But we don't need an
explicit call, because it'll be done at process exit too. At least I
think, it could be that it's different for bootstrap.
I've removed this and other occurrences which were before proc_exit()
(and thus redundant). (Though I did not explicitly check if it was
different for bootstrap.)

pgstat_report_stat(true) is supposed to be called as needed via
before_shmem_hook so I think that's the right thing.

IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.

noted and fixed.

How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.

noted and fixed.

+ * Also report IO Operations statistics

I think that the function comment also should mention this.

Wonder if it's worth making the lock specific to the backend type?

I've added another Lock into PgStat_IOPathOps so that each BackendType
can be locked separately. But, I've also kept the lock in
PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
done easily.

Looks fine about the lock separation.
By the way, in the following line:

+ &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];

backend_type_get_idx(x) is actually (x - 1) plus assertion on the
value range. And the only use-case is here. There's an reverse
function and also used only at one place.

+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));

In this usage GetBackendTypeDesc() gracefully treats out-of-domain
values but idx_get_backend_type keenly kills the process for the
same. This is inconsistent.

My humbel opinion on this is we don't define the two functions and
replace the calls to them with (x +/- 1). Addition to that, I think
we should not abort() by invalid backend types. In that sense, I
wonder if we could use B_INVALIDth element for this purpose.

+     LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+     memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
+     LWLockRelease(&stats_shmem->lock);
Which then also means that you don't need the reset offset stuff. It's
only there because with the changecount approach we can't take a lock to
reset the stats (since there is no lock). With a lock you can just reset
the shared state.
Yes, I believe I have cleaned up all of this embarrassing mess. I use the
lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
locks in PgStat_IOPathOps for flush.

Looks fine, but I think pgstat_flush_io_ops() need more comments like
other pgstat_flush_* functions.

+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		stats_shmem->stats[i].stat_reset_timestamp = ts;

I'm not sure we need a separate reset timestamp for each backend type
but SLRU counter does the same thing..

+pgstat_report_io_ops(void)
+{

This shouldn't be needed - the flush function above can be used.

Fixed.

The commit message of 0002 contains that name:p

+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+     const char *io_path_desc = "Unknown IO Path";
+
This should be unreachable, right?
Changed it to an error.

+ elog(ERROR, "Attempt to describe an unknown IOPath");

I think we usually spell it as ("unrecognized IOPath value: %d", io_path).

From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
backend.

I think I like pg_stat_io a bit better? Nearly everything in here seems
to fit better in that.

I guess we could split out buffers allocated, but that's actually
interesting in the context of the kind of IO too.

changed it to pg_stat_io

A bit different thing, but I felt a little uneasy about some uses of
"pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
other hand, actually iopath is used as an attribute of io_ops in many
places. Couldn't we be more consistent about the relationship between
the names?

IOOp -> PgStat_IOOpType
IOPath -> PgStat_IOPath
PgStat_IOOpCOonters -> PgStat_IOCounters
PgStat_IOPathOps -> PgStat_IO
pgstat_count_io_op -> pgstat_count_io
...

(Better wordings are welcome.)

+CREATE VIEW pg_stat_buffers AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_buffers() b;
Do we want to expose all data to all users? I guess pg_stat_bgwriter
does? But this does split things out a lot more...
I didn't see another similar example limiting access.

(The doc told me that) pg_buffercache view is restricted to
pg_monitor. But other activity-stats(aka stats collector:)-related
pg_stat_* views are not restricted to pg_monitor.

doc> pg_monitor Read/execute various monitoring views and functions.

Hmm....

Don't think you can rely on that. The lookup of the view, functions
might have needed to load catalog data, which might have needed to evict
buffers. I think you can do something more reliable by checking that
there's more written buffers after a checkpoint than before, or such.

Yes, per an off list suggestion by you, I have changed the tests to use a
sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
the missing calls to count IO Operations for IOPATH_LOCAL and
IOPATH_STRATEGY.

I struggled to come up with a way to test writes for a particular
type of backend are counted correctly since a dirty buffer could be
written out by another type of backend before the target BackendType has
a chance to write it out.

I also struggled to come up with a way to test IO operations for
background workers. I'm not sure of a way to deterministically have a
background worker do a particular kind of IO in a test scenario.

I'm not sure how to cause a strategy "extend" for testing.

I'm not sure what you are expecting, but for example, "create table t
as select generate_series(0, 99999)" increments Strategy-extend by
about 400. (I'm surprised that autovac worker-shared-extend has
non-zero number)

Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

I've added a test to test that reused strategy buffers are counted as
allocs. I would like to add a test which checks that if a buffer in the
ring is pinned and thus not reused, that it is not counted as a strategy
alloc, but I found it challenging without a way to pause vacuuming, pin
a buffer, then resume vacuuming.

===

If I'm not missing something, in BufferAlloc, when strategy is not
used and the victim is dirty, iopath is determined based on the
uninitialized from_ring. It seems to me from_ring is equivalent to
strategy_current_was_in_ring. And if StrategyGetBuffer has set
from_ring to false, StratetgyRejectBuffer may set it to true, which is
is wrong. The logic around there seems to need a rethink.

What can we read from the values separated to Shared and Strategy?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#57

melanieplageman@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#56)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Thanks for the review!

On Tue, Jul 12, 2022 at 4:06 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman <
melanieplageman@gmail.com> wrote in

Hi,

In the attached patch set, I've added in missing IO operations for
certain IO Paths as well as enumerating in the commit message which IO
Paths and IO Operations are not currently counted and or not possible.

There is a TODO in HandleWalWriterInterrupts() about removing
pgstat_report_wal() since it is immediately before a proc_exit()

Right. walwriter does that without needing the explicit call.

I have deleted it.

I was wondering if LocalBufferAlloc() should increment the counter or if
I should wait until GetLocalBufferStorage() to increment the counter.

Depends on what "allocate" means. Different from shared buffers, local
buffers are taken from OS then allocated to page. OS-allcoated pages
are restricted by num_temp_buffers so I think what we're interested in
is the count incremented by LocalBuferAlloc(). (And it is the parallel
of alloc for shared-buffers)

I've left it in LocalBufferAlloc().

I also realized that I am not differentiating between IOPATH_SHARED and
IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
of buffer we are fsync'ing by the time we call register_dirty_segment(),
I'm not sure how we would fix this.

I think there scarcely happens flush for strategy-loaded buffers. If
that is sensible, IOOP_FSYNC would not make much sense for
IOPATH_STRATEGY.

Why would it be less likely for a backend to do its own fsync when
flushing a dirty strategy buffer than a regular dirty shared buffer?

IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.

noted and fixed.

How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.

noted and fixed.

+ * Also report IO Operations statistics

I think that the function comment also should mention this.

I've added comments at the top of all these functions.

Wonder if it's worth making the lock specific to the backend type?

I've added another Lock into PgStat_IOPathOps so that each BackendType
can be locked separately. But, I've also kept the lock in
PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
done easily.

Looks fine about the lock separation.

Actually, I think it is not safe to use both of these locks. So for
picking one method, it is probably better to go with the locks in
PgStat_IOPathOps, it will be more efficient for flush (and not for
fetching and resetting), so that is probably the way to go, right?

By the way, in the following line:

+
&pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];

backend_type_get_idx(x) is actually (x - 1) plus assertion on the
value range. And the only use-case is here. There's an reverse
function and also used only at one place.
+               Datum           backend_type_desc =
+
CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
In this usage GetBackendTypeDesc() gracefully treats out-of-domain
values but idx_get_backend_type keenly kills the process for the
same. This is inconsistent.

My humbel opinion on this is we don't define the two functions and
replace the calls to them with (x +/- 1). Addition to that, I think
we should not abort() by invalid backend types. In that sense, I
wonder if we could use B_INVALIDth element for this purpose.

I think that GetBackendTypeDesc() should probably also error out for an
unknown value.

I would be open to not using the helper functions. I thought it would be
less error-prone, but since it is limited to the code in
pgstat_io_ops.c, it is probably okay. Let me think a bit more.

Could you explain more about what you mean about using B_INVALID
BackendType?

+     LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+     memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
+     LWLockRelease(&stats_shmem->lock);
Which then also means that you don't need the reset offset stuff. It's
only there because with the changecount approach we can't take a lock
to

reset the stats (since there is no lock). With a lock you can just

reset

the shared state.

Yes, I believe I have cleaned up all of this embarrassing mess. I use the
lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
locks in PgStat_IOPathOps for flush.

Looks fine, but I think pgstat_flush_io_ops() need more comments like
other pgstat_flush_* functions.
+       for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+               stats_shmem->stats[i].stat_reset_timestamp = ts;
I'm not sure we need a separate reset timestamp for each backend type
but SLRU counter does the same thing..

Yes, I think for SLRU stats it is because you can reset individual SLRU
stats. Also there is no wrapper data structure to put it in. I could
keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
operation stats at once, but I am thinking of getting rid of
PgStatShared_BackendIOPathOps since it is not needed if I only keep the
locks in PgStat_IOPathOps and make the global shared value an array of
PgStat_IOPathOps.

+pgstat_report_io_ops(void)
+{

This shouldn't be needed - the flush function above can be used.

Fixed.

The commit message of 0002 contains that name:p

Thanks! Fixed.

+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+     const char *io_path_desc = "Unknown IO Path";
+
This should be unreachable, right?
Changed it to an error.
+ elog(ERROR, "Attempt to describe an unknown IOPath");

I think we usually spell it as ("unrecognized IOPath value: %d", io_path).

I have changed to this.

From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00

2001

From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend

type

Add pg_stat_buffers, a system view which tracks the number of IO
operations (allocs, writes, fsyncs, and extends) done through each IO
path (e.g. shared buffers, local buffers, unbuffered IO) by each

type of

backend.

I think I like pg_stat_io a bit better? Nearly everything in here seems
to fit better in that.

I guess we could split out buffers allocated, but that's actually
interesting in the context of the kind of IO too.

changed it to pg_stat_io

A bit different thing, but I felt a little uneasy about some uses of
"pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
other hand, actually iopath is used as an attribute of io_ops in many
places. Couldn't we be more consistent about the relationship between
the names?

IOOp -> PgStat_IOOpType
IOPath -> PgStat_IOPath
PgStat_IOOpCOonters -> PgStat_IOCounters
PgStat_IOPathOps -> PgStat_IO
pgstat_count_io_op -> pgstat_count_io
...

(Better wordings are welcome.)

Let me think about naming and make changes in the next version.

Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

I've added a test to test that reused strategy buffers are counted as
allocs. I would like to add a test which checks that if a buffer in the
ring is pinned and thus not reused, that it is not counted as a strategy
alloc, but I found it challenging without a way to pause vacuuming, pin
a buffer, then resume vacuuming.

===

If I'm not missing something, in BufferAlloc, when strategy is not
used and the victim is dirty, iopath is determined based on the
uninitialized from_ring. It seems to me from_ring is equivalent to
strategy_current_was_in_ring. And if StrategyGetBuffer has set
from_ring to false, StratetgyRejectBuffer may set it to true, which is
is wrong. The logic around there seems to need a rethink.

What can we read from the values separated to Shared and Strategy?

I have changed this local variable to only be used for communicating if
the buffer which was not rejected by StrategyRejectBuffer() was from the
ring or not for the purposes of counting strategy writes. I could add an
accessor for this member (strategy->current_was_in_ring) if that makes
more sense? For strategy allocs, I just use
strategy->current_was_in_ring inside of StrategyGetBuffer() since this
has access to that member of the struct.

Currently, strategy allocs count only reuses of a strategy buffer (not
initial shared buffers which are added to the ring).
strategy writes count only the writing out of dirty buffers which are
already in the ring and are being reused.

Alternatively, we could also count as strategy allocs all those buffers
which are added to the ring and count as strategy writes all those
shared buffers which are dirty when initially added to the ring.

- Melanie

Attachments:

v24-0001-Add-BackendType-for-standalone-backends.patchtext/x-patch; charset=US-ASCII; name=v24-0001-Add-BackendType-for-standalone-backends.patchDownload

From fdf6486b6a1546bb393aae7aa9e97604abc31d86 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Jun 2022 11:33:04 -0400
Subject: [PATCH v24 1/3] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 17 +++++++++++------
 src/include/miscadmin.h           |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..07e6db1a1c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,6 +278,12 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
@@ -285,12 +296,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ea9a56d395..5276bf25a1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -316,18 +316,19 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_LOGGER,
 } BackendType;
 
 extern PGDLLIMPORT BackendType MyBackendType;
-- 
2.34.1

v24-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v24-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 546b9a06e33e1bf6ab7c8de2d5db93e77f77c39c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v24 3/3] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
writes, fsyncs, and extends) done through each IOPath (e.g. shared
buffers, local buffers, unbuffered IO) by each type of backend.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IOPath for
checkpointer will be 0 for all IOOps. All possible combinations of
IOPath and IOOp are enumerated in the view but not all are populated or
even possible at this point.

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view is stats for a particular BackendType for a
particular IOPath (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 108 ++++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  11 +++
 src/backend/utils/adt/pgstatfuncs.c  |  66 ++++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +++
 src/test/regress/expected/rules.out  |   8 ++
 src/test/regress/expected/stats.out  |  59 +++++++++++++++
 src/test/regress/sql/stats.sql       |  34 +++++++++
 7 files changed, 294 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4549c2560e..775ecf2f21 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3595,7 +3604,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5355,6 +5459,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..b0b2d39e28 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,17 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6259cc4f4c..30aff64860 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1739,6 +1739,72 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum
+* value here above IO_NUM_COLUMNS.
+*/
+enum
+{
+	IO_COLUMN_BACKEND_TYPE,
+	IO_COLUMN_IO_PATH,
+	IO_COLUMN_ALLOCS,
+	IO_COLUMN_EXTENDS,
+	IO_COLUMN_FSYNCS,
+	IO_COLUMN_WRITES,
+	IO_COLUMN_RESET_TIME,
+	IO_NUM_COLUMNS,
+};
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_IOPathOps *io_path_ops;
+	ReturnSetInfo *rsinfo;
+	Datum reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	io_path_ops = pgstat_fetch_backend_io_path_ops();
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/*
+		* Currently it is not permitted to reset IO operation stats for individual
+		* IO Paths or individual BackendTypes. All IO Operation statistics are
+		* reset together. As such, it is easiest to reuse the first reset timestamp
+		* available.
+		*/
+	reset_time = TimestampTzGetDatum(io_path_ops->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *counters = io_path_ops->data;
+		Datum		backend_type_desc =
+			CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
+			/* const char *log_name = GetBackendTypeDesc(idx_get_backend_type(i)); */
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum values[IO_NUM_COLUMNS];
+			bool nulls[IO_NUM_COLUMNS];
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[IO_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[IO_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));
+			values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+			values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			counters++;
+		}
+		io_path_ops++;
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..e9662fdc04 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_io', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,write,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..3b05af9ac8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,14 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_path, alloc, extend, fsync, write, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 5b0ebf090f..6dade03b65 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -554,4 +554,63 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+-- Test that writes to Shared Buffers are tracked in pg_stat_io
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+CREATE TABLE test_io_shared_writes(a int);
+INSERT INTO test_io_shared_writes SELECT i FROM generate_series(1,100)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared_writes;
+-- Test that extends of temporary tables are tracked in pg_stat_io
+CREATE TEMPORARY TABLE test_io_local_extends(a int);
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+INSERT INTO test_io_local_extends VALUES(1);
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs.
+CREATE TABLE test_io_strategy_stats(a INT, b INT);
+ALTER TABLE test_io_strategy_stats SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy_stats SELECT i, i from generate_series(1,8000)i;
+-- Ensure that the next VACUUM will need to perform IO
+VACUUM (FULL) test_io_strategy_stats;
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+VACUUM (PARALLEL 0) test_io_strategy_stats;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_stats;
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 3f3cf8fb56..fbd3977605 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -285,4 +285,38 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+
+-- Test that writes to Shared Buffers are tracked in pg_stat_io
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+CREATE TABLE test_io_shared_writes(a int);
+INSERT INTO test_io_shared_writes SELECT i FROM generate_series(1,100)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+DROP TABLE test_io_shared_writes;
+
+-- Test that extends of temporary tables are tracked in pg_stat_io
+CREATE TEMPORARY TABLE test_io_local_extends(a int);
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+INSERT INTO test_io_local_extends VALUES(1);
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs.
+CREATE TABLE test_io_strategy_stats(a INT, b INT);
+ALTER TABLE test_io_strategy_stats SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy_stats SELECT i, i from generate_series(1,8000)i;
+-- Ensure that the next VACUUM will need to perform IO
+VACUUM (FULL) test_io_strategy_stats;
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+VACUUM (PARALLEL 0) test_io_strategy_stats;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+DROP TABLE test_io_strategy_stats;
+
+
 -- End of Stats Test
-- 
2.34.1

v24-0002-Track-IO-operation-statistics.patchtext/x-patch; charset=US-ASCII; name=v24-0002-Track-IO-operation-statistics.patchDownload

From 16016c4d1989ca154b4894912281d290e093639e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Jun 2022 18:37:42 -0400
Subject: [PATCH v24 2/3] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp write on
an IOPath IOPATH_SHARED by BackendType "checkpointer".

Each IOOp (alloc, fsync, extend, write) is counted per IOPath
(direct, local, shared, or strategy) through a call to
pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOPATH_LOCAL and IOPATH_SHARED IOPaths concern operations on local
and shared buffers.

The IOPATH_STRATEGY IOPath concerns buffers alloc'd/written/read/fsync'd
as part of a BufferAccessStrategy.

The IOPATH_DIRECT IOPath concerns blocks of IO which are read, written,
or fsync'd using smgrwrite/extend/immedsync directly (as opposed to
through [Local]BufferAlloc()).

Note that this commit does not add code to increment IOPATH_DIRECT. A
future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call
pgstat_count_io_op() for unbuffered IO and avoid regressions for future
users of these functions.

IOOP_ALLOC is counted for IOPATH_SHARED and IOPATH_LOCAL whenever a
buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC is invalid
for IOPATH_DIRECT. IOOP_ALLOC for IOPATH_STRATEGY is counted whenever a
buffer already in the strategy ring is reused.

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live.

Some BackendTypes will not execute pgstat_report_stat() and thus must
explicitly call pgstat_flush_io_ops() in order to flush their backend
local IO operation statistics to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/bootstrap/bootstrap.c             |   1 +
 src/backend/postmaster/checkpointer.c         |   1 +
 src/backend/postmaster/walwriter.c            |  11 --
 src/backend/storage/buffer/bufmgr.c           |  51 +++++-
 src/backend/storage/buffer/freelist.c         |  48 ++++-
 src/backend/storage/buffer/localbuf.c         |   5 +
 src/backend/storage/sync/sync.c               |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/pgstat.c           |  24 +++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_database.c  |   8 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 168 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  14 +-
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  53 ++++++
 src/include/storage/buf_internals.h           |   2 +-
 src/include/utils/backend_status.h            |  34 ++++
 src/include/utils/pgstat_internal.h           |  17 ++
 21 files changed, 432 insertions(+), 32 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 088556ab54..963b05321e 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -33,6 +33,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..a06331e1eb 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..beb46dcb55 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
 	}
 
 	if (ShutdownRequestPending)
-	{
-		/*
-		 * Force reporting remaining WAL statistics at process exit.
-		 *
-		 * Since pgstat_report_wal is invoked with 'force' is false in main
-		 * loop to avoid overloading the cumulative stats system, there may
-		 * exist unreported stats counters for the WAL writer.
-		 */
-		pgstat_report_wal(true);
-
 		proc_exit(0);
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c7d7abcd73..846297f273 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -980,6 +980,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		IOPath io_path;
+
+		if (isLocalBuf)
+			io_path = IOPATH_LOCAL;
+		else if (strategy != NULL)
+			io_path = IOPATH_STRATEGY;
+		else
+			io_path = IOPATH_SHARED;
+
+		pgstat_count_io_op(IOOP_EXTEND, io_path);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1180,6 +1190,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool write_from_ring = false;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1227,6 +1238,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath iopath;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1244,7 +1256,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &write_from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1253,13 +1265,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an existing
+				 * strategy buffer being reused, count this as a strategy write for the
+				 * purposes of IO Operations statistics tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring will be
+				 * counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				iopath = write_from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2563,7 +2589,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2810,9 +2836,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2892,6 +2921,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3539,6 +3570,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3574,7 +3607,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3669,7 +3702,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3877,7 +3910,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3904,7 +3937,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..8ab88341ce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -212,8 +213,18 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		if (strategy->current_was_in_ring)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy ring
+			 * will be counted as allocations for the purposes of IO Operation
+			 * statistics tracking. However, even when a strategy is in use, if
+			 * a new buffer must be allocated from shared buffers and added to
+			 * the ring, this is counted as a IOPATH_SHARED allocation.
+			 */
+			pgstat_count_io_op(IOOP_ALLOC, IOPATH_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +258,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *write_from_ring)
 {
-	/* We only do this in bulkread mode */
+
+	/*
+	 * We only reject reusing and writing out the strategy buffer this in
+	 * bulkread mode.
+	 */
 	if (strategy->btype != BAS_BULKREAD)
+	{
+		/*
+		 * If the buffer was from the ring and we are not rejecting it, consider it
+		 * a write of a strategy buffer.
+		 */
+		if (strategy->current_was_in_ring)
+			*write_from_ring = true;
 		return false;
+	}
 
-	/* Don't muck with behavior of normal buffer-replacement strategy */
+	/*
+	 * Don't muck with behavior of normal buffer-replacement strategy.
+	 * Though we are not rejecting this buffer, write_from_ring will remain false
+	 * because shared buffers that are added to the ring, either initially or as
+	 * part of an expansion, are not considered strategy writes for the purposes
+	 * of IO Operation statistics.
+	 */
 	if (!strategy->current_was_in_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
+	{
+		*write_from_ring = false;
 		return false;
+	}
 
 	/*
 	 * Remove the dirty buffer from the ring; necessary to prevent infinite
@@ -699,5 +732,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Though caller should not use this flag since the buffer is being rejected
+	 * (and it should have been initialized to false anyway), set it here for
+	 * clarity.
+	 */
+	*write_from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 9c038851d7..58738d6684 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
@@ -123,6 +124,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	if (LocalBufHash == NULL)
 		InitLocalBuffers();
 
+	pgstat_count_io_op(IOOP_ALLOC, IOPATH_LOCAL);
+
 	/* See if the desired buffer already exists */
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e1fb631003..20e259edef 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..52924e64dd 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1312,6 +1324,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_path_ops);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1486,6 +1504,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index d9275611f0..d3963f59d0 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
 }
 
 /*
- * Called from autovacuum.c to report startup of an autovacuum process.
+ * Called from autovacuum.c to report startup of an autovacuum process and
+ * flush IO Operation statistics.
  * We are called before InitPostgres is done, so can't rely on MyDatabaseId;
  * the db OID must be passed in, instead.
  */
@@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
 	dbentry->stats.last_autovac_time = GetCurrentTimestamp();
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..f037d2c7c5
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,168 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOPathOps pending_IOOpStats;
+bool have_ioopstats = false;
+
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStat_IOPathOps *stats_shmem;
+
+	if (!have_ioopstats)
+		return false;
+
+	stats_shmem =
+		&pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];
+
+	if (!nowait)
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
+		return true;
+
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *sharedent = &stats_shmem->data[i];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[i];
+
+#define IO_OP_ACC(fld) sharedent->fld += pendingent->fld
+		IO_OP_ACC(allocs);
+		IO_OP_ACC(extends);
+		IO_OP_ACC(fsyncs);
+		IO_OP_ACC(writes);
+#undef IO_OP_ACC
+	}
+
+	LWLockRelease(&stats_shmem->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+	memcpy(pgStatLocal.snapshot.io_path_ops, &stats_shmem->stats,
+			sizeof(stats_shmem->stats));
+
+	LWLockRelease(&stats_shmem->lock);
+}
+
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+	memset(&stats_shmem->stats, 0, sizeof(stats_shmem->stats));
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		stats_shmem->stats[i].stat_reset_timestamp = ts;
+
+	LWLockRelease(&stats_shmem->lock);
+}
+
+void
+pgstat_count_io_op(IOOp io_op, IOPath io_path)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+	have_ioopstats = true;
+}
+
+PgStat_IOPathOps *
+pgstat_fetch_backend_io_path_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return pgStatLocal.snapshot.io_path_ops;
+}
+
+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_DIRECT:
+			return "Direct";
+		case IOPATH_LOCAL:
+			return "Local";
+		case IOPATH_SHARED:
+			return "Shared";
+		case IOPATH_STRATEGY:
+			return "Strategy";
+	}
+
+	elog(ERROR, "unrecognized IOPath value: %d", io_path);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "Alloc";
+		case IOOP_EXTEND:
+			return "Extend";
+		case IOOP_FSYNC:
+			return "Fsync";
+		case IOOP_WRITE:
+			return "Write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..0f8048eaa9 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad5..6259cc4f4c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2104,6 +2104,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2112,7 +2114,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 5276bf25a1..61e95135f2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..36a4b89a58 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/lwlock.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -48,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +278,45 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Paths
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_DIRECT,
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOPathOps
+{
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+	TimestampTz stat_reset_timestamp;
+} PgStat_IOPathOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +494,18 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOPath io_path);
+extern bool pgstat_flush_io_ops(bool nowait);
+extern PgStat_IOPathOps *pgstat_fetch_backend_io_path_ops(void);
+extern PgStat_Counter pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+extern const char *pgstat_io_path_desc(IOPath io_path);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 69e45900ba..b69c5f7e3c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -313,7 +313,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *write_from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7403bca25e..49d062b1af 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -306,6 +306,40 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..adffdd147d 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,12 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_BackendIOPathOps
+{
+	LWLock lock;
+	PgStat_IOPathOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOPathOps;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +425,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOPathOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +449,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IOPathOps io_path_ops[BACKEND_NUM_TYPES];
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +558,14 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_snapshot_cb(void);
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+
+
 /*
  * Functions in pgstat_relation.c
  */
-- 
2.34.1

#58

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#57)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:

I also realized that I am not differentiating between IOPATH_SHARED and
IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
of buffer we are fsync'ing by the time we call register_dirty_segment(),
I'm not sure how we would fix this.

I think there scarcely happens flush for strategy-loaded buffers. If
that is sensible, IOOP_FSYNC would not make much sense for
IOPATH_STRATEGY.

Why would it be less likely for a backend to do its own fsync when
flushing a dirty strategy buffer than a regular dirty shared buffer?

We really just don't expect a backend to do many segment fsyncs at
all. Otherwise there's something wrong with the forwarding mechanism.

It'd be different if we tracked WAL fsyncs more granularly - which would be
quite interesting - but that's something for another day^Wpatch.

Wonder if it's worth making the lock specific to the backend type?

I've added another Lock into PgStat_IOPathOps so that each BackendType
can be locked separately. But, I've also kept the lock in
PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
done easily.

Looks fine about the lock separation.

Actually, I think it is not safe to use both of these locks. So for
picking one method, it is probably better to go with the locks in
PgStat_IOPathOps, it will be more efficient for flush (and not for
fetching and resetting), so that is probably the way to go, right?

I think it's good to just use one kind of lock, and efficiency of snapshotting
/ resetting is nearly irrelevant. But I don't see why it's not safe to use
both kinds of locks?

Looks fine, but I think pgstat_flush_io_ops() need more comments like
other pgstat_flush_* functions.
+       for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+               stats_shmem->stats[i].stat_reset_timestamp = ts;
I'm not sure we need a separate reset timestamp for each backend type
but SLRU counter does the same thing..
Yes, I think for SLRU stats it is because you can reset individual SLRU
stats. Also there is no wrapper data structure to put it in. I could
keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
operation stats at once, but I am thinking of getting rid of
PgStatShared_BackendIOPathOps since it is not needed if I only keep the
locks in PgStat_IOPathOps and make the global shared value an array of
PgStat_IOPathOps.

I'm strongly against introducing super granular reset timestamps. I think that
was a mistake for SLRU stats, but we can't fix that as easily.

Currently, strategy allocs count only reuses of a strategy buffer (not
initial shared buffers which are added to the ring).
strategy writes count only the writing out of dirty buffers which are
already in the ring and are being reused.

That seems right to me.

Alternatively, we could also count as strategy allocs all those buffers
which are added to the ring and count as strategy writes all those
shared buffers which are dirty when initially added to the ring.

I don't think that'd provide valuable information. The whole reason that
strategy writes are interesting is that they can lead to writing out data a
lot sooner than they would be written out without a strategy being used.

Subject: [PATCH v24 2/3] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp write on
an IOPath IOPATH_SHARED by BackendType "checkpointer".

I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
the file path. What about 'origin'?

Each IOOp (alloc, fsync, extend, write) is counted per IOPath
(direct, local, shared, or strategy) through a call to
pgstat_count_io_op().

It seems we should track reads too - it's quite interesting to know whether
reads happened because of a strategy, for example. You do reference reads in a
later part of the commit message even :)

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

We could extend this at a later stage, if we really want to. But I'm not sure
it's interesting or fully possible. E.g. the archiver's write are largely not
done by the archiver itself, but by a command (or module these days) it shells
out to.

Note that this commit does not add code to increment IOPATH_DIRECT. A
future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call
pgstat_count_io_op() for unbuffered IO and avoid regressions for future
users of these functions.

Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then?

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live.

Perhaps mention here that this later could be extended to make per-connection
stats visible?

Some BackendTypes will not execute pgstat_report_stat() and thus must
explicitly call pgstat_flush_io_ops() in order to flush their backend
local IO operation statistics to shared memory.

Maybe add "flush ... during ongoing operation" or such? Because they'd all
flush at commit, IIRC.

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 088556ab54..963b05321e 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -33,6 +33,7 @@
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "pg_getopt.h"
+#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/condition_variable.h"

Hm?

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..beb46dcb55 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
}
if (ShutdownRequestPending)
- {
- /*
- * Force reporting remaining WAL statistics at process exit.
- *
- * Since pgstat_report_wal is invoked with 'force' is false in main
- * loop to avoid overloading the cumulative stats system, there may
- * exist unreported stats counters for the WAL writer.
- */
- pgstat_report_wal(true);
-
proc_exit(0);
- }

/* Perform logging of memory contexts of this process */
if (LogMemoryContextPending)

Let's do this in a separate commit and get it out of the way...

@@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
* if this buffer should be written and re-used.
*/
bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *write_from_ring)
{
-	/* We only do this in bulkread mode */
+
+	/*
+	 * We only reject reusing and writing out the strategy buffer this in
+	 * bulkread mode.
+	 */
if (strategy->btype != BAS_BULKREAD)
+	{
+		/*
+		 * If the buffer was from the ring and we are not rejecting it, consider it
+		 * a write of a strategy buffer.
+		 */
+		if (strategy->current_was_in_ring)
+			*write_from_ring = true;

Hm. This is set even if the buffer wasn't dirty? I guess we don't expect
StrategyRejectBuffer() to be called for clean buffers...

/*
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index d9275611f0..d3963f59d0 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
}

/*
- * Called from autovacuum.c to report startup of an autovacuum process.
+ * Called from autovacuum.c to report startup of an autovacuum process and
+ * flush IO Operation statistics.
* We are called before InitPostgres is done, so can't rely on MyDatabaseId;
* the db OID must be passed in, instead.
*/
@@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
dbentry->stats.last_autovac_time = GetCurrentTimestamp();

pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
}

Hm. I suspect this will always be zero - at this point we haven't connected to
a database, so there really can't have been much, if any, IO. I think I
suggested doing something here, but on a second look it really doesn't make
much sense.

Note that that's different from doing something in
pgstat_report_(vacuum|analyze) - clearly we've done something at that point.

/*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
}

pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
}

/*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
*
* Caller must provide new live- and dead-tuples estimates, as well as a
* flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
}

pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
}

Think it'd be good to amend these comments to say that otherwise stats would
only get flushed after a multi-relatio autovacuum cycle is done / a
VACUUM/ANALYZE command processed all tables. Perhaps add the comment to one
of the two functions, and just reference it in the other place?

--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -306,6 +306,40 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
int buflen);
extern uint64 pgstat_get_my_query_id(void);

+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}

In function definitions (vs declarations) we put the 'static inline int' in a
separate line from the rest of the function signature.

+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.

The comments around this seem a bit over the top, but I also don't mind them
much.

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
writes, fsyncs, and extends) done through each IOPath (e.g. shared
buffers, local buffers, unbuffered IO) by each type of backend.

Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
latter, except that we already have a bunch of views with that prefix.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IOPath for
checkpointer will be 0 for all IOOps.

What do you think about returning NULL for the values that we except to never
be non-zero? Perhaps with an assert against non-zero values? Seems like it
might be helpful for understanding the view.

+/*
+* When adding a new column to the pg_stat_io view, add a new enum
+* value here above IO_NUM_COLUMNS.
+*/
+enum
+{
+	IO_COLUMN_BACKEND_TYPE,
+	IO_COLUMN_IO_PATH,
+	IO_COLUMN_ALLOCS,
+	IO_COLUMN_EXTENDS,
+	IO_COLUMN_FSYNCS,
+	IO_COLUMN_WRITES,
+	IO_COLUMN_RESET_TIME,
+	IO_NUM_COLUMNS,
+};

We typedef pretty much every enum so the enum can be referenced without the
'enum' prefix. I'd do that here, even if we don't need it.

Greetings,

Andres Freund

#59

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#55)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-07-11 22:22:28 -0400, Melanie Plageman wrote:

Yes, per an off list suggestion by you, I have changed the tests to use a
sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
the missing calls to count IO Operations for IOPATH_LOCAL and
IOPATH_STRATEGY.

I struggled to come up with a way to test writes for a particular
type of backend are counted correctly since a dirty buffer could be
written out by another type of backend before the target BackendType has
a chance to write it out.

I guess temp file writes would be reliably done by one backend... Don't have a
good idea otherwise.

I also struggled to come up with a way to test IO operations for
background workers. I'm not sure of a way to deterministically have a
background worker do a particular kind of IO in a test scenario.

I think it's perfectly fine to not test that - for it to be broken we'd have
to somehow screw up setting the backend type. Everything else is the same as
other types of backends anyway.

If you *do* want to test it, you probably could use
SET parallel_leader_participation = false;
SET force_parallel_mode = 'regress';
SELECT something_triggering_io;

I'm not sure how to cause a strategy "extend" for testing.

COPY into a table should work. But might be unattractive due to the size of of
the COPY ringbuffer.

Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

I've added a test to test that reused strategy buffers are counted as
allocs. I would like to add a test which checks that if a buffer in the
ring is pinned and thus not reused, that it is not counted as a strategy
alloc, but I found it challenging without a way to pause vacuuming, pin
a buffer, then resume vacuuming.

Yea, that's probably too hard to make reliable to be worth it.

Greetings,

Andres Freund

#60

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#57)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Tue, 12 Jul 2022 12:19:06 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

+
&pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];

backend_type_get_idx(x) is actually (x - 1) plus assertion on the
value range. And the only use-case is here. There's an reverse
function and also used only at one place.
+               Datum           backend_type_desc =
+
CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
In this usage GetBackendTypeDesc() gracefully treats out-of-domain
values but idx_get_backend_type keenly kills the process for the
same. This is inconsistent.

My humbel opinion on this is we don't define the two functions and
replace the calls to them with (x +/- 1). Addition to that, I think
we should not abort() by invalid backend types. In that sense, I
wonder if we could use B_INVALIDth element for this purpose.
I think that GetBackendTypeDesc() should probably also error out for an
unknown value.

I would be open to not using the helper functions. I thought it would be
less error-prone, but since it is limited to the code in
pgstat_io_ops.c, it is probably okay. Let me think a bit more.

Could you explain more about what you mean about using B_INVALID
BackendType?

I imagined to use B_INVALID as a kind of "default" partition, which
accepts all unknown backend types. We can just ignore that values but
then we lose the clue for malfunction of stats machinery. I thought
that that backend-type as the sentinel for malfunctions. Thus we can
emit logs instead.

I feel that the stats machinery shouldn't stop the server as possible,
or I think it is overreaction to abort for invalid values that can be
easily coped with.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#61

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#60)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-07-13 11:00:07 +0900, Kyotaro Horiguchi wrote:

I imagined to use B_INVALID as a kind of "default" partition, which
accepts all unknown backend types.

There shouldn't be any unknown backend types. Something has gone wrong if we
get far without a backend type set.

We can just ignore that values but then we lose the clue for malfunction of
stats machinery. I thought that that backend-type as the sentinel for
malfunctions. Thus we can emit logs instead.

I feel that the stats machinery shouldn't stop the server as possible,
or I think it is overreaction to abort for invalid values that can be
easily coped with.

I strongly disagree. That just ends up with hard to find bugs.

Greetings,

Andres Freund

#62

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#61)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Tue, 12 Jul 2022 19:18:22 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2022-07-13 11:00:07 +0900, Kyotaro Horiguchi wrote:

I imagined to use B_INVALID as a kind of "default" partition, which
accepts all unknown backend types.

There shouldn't be any unknown backend types. Something has gone wrong if we
get far without a backend type set.

We can just ignore that values but then we lose the clue for malfunction of
stats machinery. I thought that that backend-type as the sentinel for
malfunctions. Thus we can emit logs instead.

I feel that the stats machinery shouldn't stop the server as possible,
or I think it is overreaction to abort for invalid values that can be
easily coped with.

I strongly disagree. That just ends up with hard to find bugs.

I was not sure about the policy on that since, as Melanie (and I)
mentioned, GetBackendTypeDesc() is gracefully treating invalid values.

Since both of you are agreeing on this point, I'm fine with
Assert()ing assuming that GetBackendTypeDesc() (or other places
backend-type is handled) is modified to behave the same way.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#63

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#58)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Attached patch set is substantially different enough from previous
versions that I kept it as a new patch set.
Note that local buffer allocations are now correctly tracked.

On Tue, Jul 12, 2022 at 1:01 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:

I also realized that I am not differentiating between IOPATH_SHARED

and

IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what

type

of buffer we are fsync'ing by the time we call

register_dirty_segment(),

I'm not sure how we would fix this.

I think there scarcely happens flush for strategy-loaded buffers. If
that is sensible, IOOP_FSYNC would not make much sense for
IOPATH_STRATEGY.

Why would it be less likely for a backend to do its own fsync when
flushing a dirty strategy buffer than a regular dirty shared buffer?

We really just don't expect a backend to do many segment fsyncs at
all. Otherwise there's something wrong with the forwarding mechanism.

When a dirty strategy buffer is written out, if pendingOps sync queue is
full and the backend has to fsync the segment itself instead of relying
on the checkpointer, this will show in the statistics as an IOOP_FSYNC
for IOPATH_SHARED not IOPATH_STRATEGY.
IOPATH_STRATEGY + IOOP_FSYNC will always be 0 for all BackendTypes.
Does this seem right?

It'd be different if we tracked WAL fsyncs more granularly - which would be
quite interesting - but that's something for another day^Wpatch.

I do have a question about this.
So, if we were to start tracking WAL IO would it fit within this
paradigm to have a new IOPATH_WAL for WAL or would it add a separate
dimension?

I was thinking that we might want to consider calling this view
pg_stat_io_data because we might want to have a separate view,
pg_stat_io_wal and then, maybe eventually, convert pg_stat_slru to
pg_stat_io_slru (or a subset of what is in pg_stat_slru).
And maybe then later add pg_stat_io_[archiver/other]

Is pg_stat_io_data a good name that gives us flexibility to
introduce views which expose per-backend IO operation stats (maybe that
goes in pg_stat_activity, though [or maybe not because it wouldn't
include exited backends?]) and per query IO operation stats?

I would like to add roughly the same additional columns to all of
these during AIO development (basically the columns from iostat):
- average block size (will usually be 8kB for pg_stat_io_data but won't
necessarily for the others)
- IOPS/BW
- avg read/write wait time
- demand rate/completion rate
- merges
- maybe queue depth

And I would like to be able to see all of these per query, per backend,
per relation, per BackendType, per IOPath, per SLRU type, etc.

Basically, what I'm asking is
1) what can we name the view to enable these future stats to exist with
the least confusing/wordy view names?
2) will the current view layout and column titles work with minimal
changes for future stats extensions like what I mention above?

Wonder if it's worth making the lock specific to the backend type?

I've added another Lock into PgStat_IOPathOps so that each

BackendType

can be locked separately. But, I've also kept the lock in
PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
done easily.

Looks fine about the lock separation.

Actually, I think it is not safe to use both of these locks. So for
picking one method, it is probably better to go with the locks in
PgStat_IOPathOps, it will be more efficient for flush (and not for
fetching and resetting), so that is probably the way to go, right?

I think it's good to just use one kind of lock, and efficiency of
snapshotting
/ resetting is nearly irrelevant. But I don't see why it's not safe to use
both kinds of locks?

The way I implemented it was not safe because I didn't use both locks
when resetting the stats.

In this new version of the patch, I've done the following: In shared
memory I've put the lock in PgStatShared_IOPathOps -- the data structure
which contains an array of PgStat_IOOpCounters for all IOOp types for
all IOPaths. Thus, different BackendType + IOPath combinations can be
updated concurrently without contending for the same lock.

To make this work, I made two versions of the PgStat_IOPathOps -- one
that has the lock, PgStatShared_IOPathOps, and one without,
PgStat_IOPathOps, so that I can persist it to the stats file without
writing and reading the LWLock and can have a local and snapshot version
of the data structure without the lock.

This also necessitated two versions of the data structure wrapping
PgStat_IOPathOps, PgStat_BackendIOPathOps, which contains an array with
a PgStat_IOPathOps for each BackendType, and
PgStatShared_BackendIOPathOps, containing an array of
PgStatShared_IOPathOps.

Looks fine, but I think pgstat_flush_io_ops() need more comments like
other pgstat_flush_* functions.
+       for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+               stats_shmem->stats[i].stat_reset_timestamp = ts;
I'm not sure we need a separate reset timestamp for each backend type
but SLRU counter does the same thing..
Yes, I think for SLRU stats it is because you can reset individual SLRU
stats. Also there is no wrapper data structure to put it in. I could
keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
operation stats at once, but I am thinking of getting rid of
PgStatShared_BackendIOPathOps since it is not needed if I only keep the
locks in PgStat_IOPathOps and make the global shared value an array of
PgStat_IOPathOps.
I'm strongly against introducing super granular reset timestamps. I think
that
was a mistake for SLRU stats, but we can't fix that as easily.

Since all stats in pg_stat_io must be reset at the same time, I've put
the reset timestamp can in the PgStat[Shared]_BackendIOPathOps and
removed it from each PgStat[Shared]_IOPathOps.

Currently, strategy allocs count only reuses of a strategy buffer (not
initial shared buffers which are added to the ring).
strategy writes count only the writing out of dirty buffers which are
already in the ring and are being reused.

That seems right to me.

Alternatively, we could also count as strategy allocs all those buffers
which are added to the ring and count as strategy writes all those
shared buffers which are dirty when initially added to the ring.

I don't think that'd provide valuable information. The whole reason that
strategy writes are interesting is that they can lead to writing out data a
lot sooner than they would be written out without a strategy being used.

Then I agree that strategy writes should only count strategy buffers
that are written out in order to reuse the buffer (which is in lieu of
getting a new, potentially clean, shared buffer). This patch implements
that behavior.

However, for strategy allocs, it seems like we would want to count all
demand for buffers as part of a BufferAccessStrategy. So, that would
include allocating buffers to initially fill the ring, allocations of
new shared buffers after the ring was already full that are added to the
ring because all existing buffers in the ring are pinned, and buffers
already in the ring which are being reused.

This version of the patch only counts the third scenario as a strategy
allocation, but I think it would make more sense to count all three as
strategy allocs.

The downside of this behavior is that strategy allocs count different
scenarios than strategy writes, reads, and extends. But, I think that
this is okay.

I'll clarify it in the docs once there is a decision.

Also, note that, as stated above, there will never be any strategy
fsyncs (that is, IOPATH_STRATEGY + IOOP_FSYNC will always be 0) because
the code path starting with register_dirty_segment() which ends with a
regular backend doing its own fsync when pendingOps is full does not
know what the current IOPATH is and checkpointer does not use a
BufferAccessStrategy.

Subject: [PATCH v24 2/3] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp write on
an IOPath IOPATH_SHARED by BackendType "checkpointer".

I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
the file path. What about 'origin'?

Enough has changed in this version of the patch that I decided to defer
renaming until some of the other issues are resolved.

Each IOOp (alloc, fsync, extend, write) is counted per IOPath
(direct, local, shared, or strategy) through a call to
pgstat_count_io_op().

It seems we should track reads too - it's quite interesting to know whether
reads happened because of a strategy, for example. You do reference reads
in a
later part of the commit message even :)

I've added reads to what is counted.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

We could extend this at a later stage, if we really want to. But I'm not
sure
it's interesting or fully possible. E.g. the archiver's write are largely
not
done by the archiver itself, but by a command (or module these days) it
shells
out to.

I've added note of this to some of the comments and the commit message.
I also omit rows for these BackendTypes from the view. See my later
comment in this email for more detail on that.

Note that this commit does not add code to increment IOPATH_DIRECT. A
future patch adding wrappers for smgrwrite(), smgrextend(), and
smgrimmedsync() would provide a good location to call
pgstat_count_io_op() for unbuffered IO and avoid regressions for future
users of these functions.

Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then?

It's gone.

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live.

Perhaps mention here that this later could be extended to make
per-connection
stats visible?

Mentioned.

Some BackendTypes will not execute pgstat_report_stat() and thus must
explicitly call pgstat_flush_io_ops() in order to flush their backend
local IO operation statistics to shared memory.

Maybe add "flush ... during ongoing operation" or such? Because they'd all
flush at commit, IIRC.

Added.

diff --git a/src/backend/bootstrap/bootstrap.c

b/src/backend/bootstrap/bootstrap.c

index 088556ab54..963b05321e 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -33,6 +33,7 @@
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "pg_getopt.h"
+#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/condition_variable.h"

Hm?

Removed

diff --git a/src/backend/postmaster/walwriter.c
b/src/backend/postmaster/walwriter.c
index e926f8c27c..beb46dcb55 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
}
if (ShutdownRequestPending)
- {
- /*
- * Force reporting remaining WAL statistics at process
exit.

- *
- * Since pgstat_report_wal is invoked with 'force' is

false in main

- * loop to avoid overloading the cumulative stats system,

there may

- * exist unreported stats counters for the WAL writer.
- */
- pgstat_report_wal(true);
-
proc_exit(0);
- }

/* Perform logging of memory contexts of this process */
if (LogMemoryContextPending)

Let's do this in a separate commit and get it out of the way...

I've put it in a separate commit.

@@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy,

BufferDesc *buf)

* if this buffer should be written and re-used.
*/
bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf,

bool *write_from_ring)

{
-     /* We only do this in bulkread mode */
+
+     /*
+      * We only reject reusing and writing out the strategy buffer this

+      * bulkread mode.
+      */
if (strategy->btype != BAS_BULKREAD)
+     {
+             /*
+              * If the buffer was from the ring and we are not

rejecting it, consider it

+              * a write of a strategy buffer.
+              */
+             if (strategy->current_was_in_ring)
+                     *write_from_ring = true;

Hm. This is set even if the buffer wasn't dirty? I guess we don't expect
StrategyRejectBuffer() to be called for clean buffers...

Yes, we do not expect it to be called for clean buffers.
I've added a comment about this assumption.

/*
diff --git a/src/backend/utils/activity/pgstat_database.c

b/src/backend/utils/activity/pgstat_database.c
index d9275611f0..d3963f59d0 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
}
/*
- * Called from autovacuum.c to report startup of an autovacuum process.
+ * Called from autovacuum.c to report startup of an autovacuum process
and

+ * flush IO Operation statistics.
* We are called before InitPostgres is done, so can't rely on

MyDatabaseId;
* the db OID must be passed in, instead.
*/
@@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
dbentry->stats.last_autovac_time = GetCurrentTimestamp();
pgstat_unlock_entry(entry_ref);
+
+     /*
+      * Report IO Operation statistics
+      */
+     pgstat_flush_io_ops(false);
}
Hm. I suspect this will always be zero - at this point we haven't
connected to
a database, so there really can't have been much, if any, IO. I think I
suggested doing something here, but on a second look it really doesn't make
much sense.

Note that that's different from doing something in
pgstat_report_(vacuum|analyze) - clearly we've done something at that
point.

I've removed this.

/*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation
statistics.
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
}
pgstat_unlock_entry(entry_ref);
+
+     /*
+      * Report IO Operations statistics
+      */
+     pgstat_flush_io_ops(false);
}
/*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation
statistics.
*
* Caller must provide new live- and dead-tuples estimates, as well as a
* flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
}
pgstat_unlock_entry(entry_ref);
+
+     /*
+      * Report IO Operations statistics
+      */
+     pgstat_flush_io_ops(false);
}
Think it'd be good to amend these comments to say that otherwise stats
would
only get flushed after a multi-relatio autovacuum cycle is done / a
VACUUM/ANALYZE command processed all tables. Perhaps add the comment to
one
of the two functions, and just reference it in the other place?

Done

--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -306,6 +306,40 @@ extern const char
*pgstat_get_crashed_backend_activity(int pid, char *buffer,

int buflen);
extern uint64 pgstat_get_my_query_id(void);
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid
BackendTypes, in

+ * order to avoid wasting the 0th spot, use this helper to convert a

valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int backend_type_get_idx(BackendType backend_type)
+{
+     /*
+      * backend_type must be one of the valid backend types. If caller
is

+ * maintaining backend information in an array that includes

B_INVALID,
+      * this function is unnecessary.
+      */
+     Assert(backend_type > B_INVALID && backend_type <=
BACKEND_NUM_TYPES);

+ return backend_type - 1;
+}

In function definitions (vs declarations) we put the 'static inline int'
in a
separate line from the rest of the function signature.

Fixed.

+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to

adjust

+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType idx_get_backend_type(int idx)
+{
+     int backend_type = idx + 1;
+     /*
+      * If the array includes a spot for B_INVALID BackendType this

function is

+ * not required.

The comments around this seem a bit over the top, but I also don't mind
them
much.

Feel free to change them to something shorter. I couldn't think of
something I liked.

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
writes, fsyncs, and extends) done through each IOPath (e.g. shared
buffers, local buffers, unbuffered IO) by each type of backend.

Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
latter, except that we already have a bunch of views with that prefix.

I have thoughts on this but thought it best deferred until after the _data
decision.

Some of these should always be zero. For example, checkpointer does not
use a BufferAccessStrategy (currently), so the "strategy" IOPath for
checkpointer will be 0 for all IOOps.

What do you think about returning NULL for the values that we except to
never
be non-zero? Perhaps with an assert against non-zero values? Seems like it
might be helpful for understanding the view.

Yes, I like this idea.

Beyond just setting individual cells to NULL, if an entire row would be
NULL, I have now dropped it from the view.

So far, I have omitted from the view all rows for BackendTypes
B_ARCHIVER, B_LOGGER, and B_STARTUP.

Should I also omit rows for B_WAL_RECEIVER and B_WAL_WRITER for now?

I have also omitted rows for IOPATH_STRATEGY for all BackendTypes
*except* B_AUTOVAC_WORKER, B_BACKEND, B_STANDALONE_BACKEND, and
B_BG_WORKER.

Do these seem correct?

I think there are some BackendTypes which will never do IO Operations on
IOPATH_LOCAL but I am not sure which. Do you know which?

As for individual cells which should be NULL, so far what I have is:
- IOPATH_LOCAL + IOOP_FSYNC
I am sure there are others as well. Can you think of any?

+/*
+* When adding a new column to the pg_stat_io view, add a new enum
+* value here above IO_NUM_COLUMNS.
+*/
+enum
+{
+     IO_COLUMN_BACKEND_TYPE,
+     IO_COLUMN_IO_PATH,
+     IO_COLUMN_ALLOCS,
+     IO_COLUMN_EXTENDS,
+     IO_COLUMN_FSYNCS,
+     IO_COLUMN_WRITES,
+     IO_COLUMN_RESET_TIME,
+     IO_NUM_COLUMNS,
+};

We typedef pretty much every enum so the enum can be referenced without the
'enum' prefix. I'd do that here, even if we don't need it.

So, I left it anonymous because I didn't want it being used as a type
or referenced anywhere else.

I am interested to hear more about your SQL enums idea from upthread.

- Melanie

Attachments:

v25-0001-Add-BackendType-for-standalone-backends.patchtext/x-patch; charset=US-ASCII; name=v25-0001-Add-BackendType-for-standalone-backends.patchDownload

From 5d3e3e702cd95e52cb015a23c0bbeccc5debd46d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Jun 2022 11:33:04 -0400
Subject: [PATCH v25 1/4] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 17 +++++++++++------
 src/include/miscadmin.h           |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..07e6db1a1c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,6 +278,12 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
@@ -285,12 +296,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ea9a56d395..5276bf25a1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -316,18 +316,19 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_LOGGER,
 } BackendType;
 
 extern PGDLLIMPORT BackendType MyBackendType;
-- 
2.34.1

v25-0002-Remove-unneeded-call-to-pgstat_report_wal.patchtext/x-patch; charset=US-ASCII; name=v25-0002-Remove-unneeded-call-to-pgstat_report_wal.patchDownload

From 965923536cfe72819b2877e9f1ad4a7e6373b0e8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 12 Jul 2022 19:53:23 -0400
Subject: [PATCH v25 2/4] Remove unneeded call to pgstat_report_wal()

pgstat_report_stat() will be called before shutdown so an explicit call
to pgstat_report_wal() is wasted.
---
 src/backend/postmaster/walwriter.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..beb46dcb55 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
 	}
 
 	if (ShutdownRequestPending)
-	{
-		/*
-		 * Force reporting remaining WAL statistics at process exit.
-		 *
-		 * Since pgstat_report_wal is invoked with 'force' is false in main
-		 * loop to avoid overloading the cumulative stats system, there may
-		 * exist unreported stats counters for the WAL writer.
-		 */
-		pgstat_report_wal(true);
-
 		proc_exit(0);
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
-- 
2.34.1

v25-0003-Track-IO-operation-statistics.patchtext/x-patch; charset=US-ASCII; name=v25-0003-Track-IO-operation-statistics.patchDownload

From 7ba696105c6a45d7b9c7c08fc178d8af4f60c910 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Jun 2022 18:37:42 -0400
Subject: [PATCH v25 3/4] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp "write"
on an IOPath IOPATH_SHARED by BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOPath
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOPATH_LOCAL and IOPATH_SHARED IOPaths concern operations on local
and shared buffers.

The IOPATH_STRATEGY IOPath concerns buffers
alloc'd/extended/fsync'd/read/written as part of a BufferAccessStrategy.

IOOP_ALLOC is counted for IOPATH_SHARED and IOPATH_LOCAL whenever a
buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC for
IOPATH_STRATEGY is counted whenever a buffer already in the strategy
ring is reused. And IOOP_WRITE for IOPATH_STRATEGY is counted whenever
the reused dirty buffer is written out.

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live. The accumulated stats in shared
memory could be extended in the future with per-backend stats -- useful
for per connection IO statistics and monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c         |   1 +
 src/backend/storage/buffer/bufmgr.c           |  53 ++++-
 src/backend/storage/buffer/freelist.c         |  51 ++++-
 src/backend/storage/buffer/localbuf.c         |   6 +
 src/backend/storage/sync/sync.c               |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/pgstat.c           |  36 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 192 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  19 +-
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  58 ++++++
 src/include/storage/buf_internals.h           |   2 +-
 src/include/utils/backend_status.h            |  36 ++++
 src/include/utils/pgstat_internal.h           |  24 +++
 18 files changed, 485 insertions(+), 20 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..a06331e1eb 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c7d7abcd73..e872d7edc6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -813,6 +813,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOPath io_path;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+	if (isLocalBuf)
+		io_path = IOPATH_LOCAL;
+	else if (strategy != NULL)
+		io_path = IOPATH_STRATEGY;
+	else
+		io_path = IOPATH_SHARED;
+
 	if (isExtend)
 	{
+
+		pgstat_count_io_op(IOOP_EXTEND, io_path);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1010,6 +1020,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_path);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1180,6 +1192,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool write_from_ring = false;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1227,6 +1240,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath iopath;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1244,7 +1258,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &write_from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1253,13 +1267,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an existing
+				 * strategy buffer being reused, count this as a strategy write for the
+				 * purposes of IO Operations statistics tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring will be
+				 * counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				iopath = write_from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2563,7 +2591,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2810,9 +2838,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2892,6 +2923,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3539,6 +3572,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3574,7 +3609,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3669,7 +3704,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3877,7 +3912,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3904,7 +3939,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..29f5cbeab6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -212,8 +213,20 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		if (strategy->current_was_in_ring)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy ring will
+			 * be counted as allocations for the purposes of IO Operation statistics
+			 * tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must be
+			 * allocated from shared buffers and added to the ring, this is counted
+			 * as a IOPATH_SHARED allocation.
+			 */
+			pgstat_count_io_op(IOOP_ALLOC, IOPATH_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +260,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -682,16 +696,38 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *write_from_ring)
 {
-	/* We only do this in bulkread mode */
+
+	/*
+	 * We only reject reusing and writing out the strategy buffer in bulkread
+	 * mode.
+	 */
 	if (strategy->btype != BAS_BULKREAD)
+	{
+		/*
+		 * If the buffer was from the ring and we are not rejecting it, consider it
+		 * a write of a strategy buffer. Note that this assumes that the buffer is
+		 * dirty.
+		 */
+		if (strategy->current_was_in_ring)
+			*write_from_ring = true;
 		return false;
+	}
 
-	/* Don't muck with behavior of normal buffer-replacement strategy */
+	/*
+	 * Don't muck with behavior of normal buffer-replacement strategy. Though we
+	 * are not rejecting this buffer, write_from_ring is false because shared
+	 * buffers that are added to the ring, either initially or when reuse is not
+	 * possible because all existing strategy buffers are pinned, are not
+	 * considered strategy writes for the purposes of IO Operation statistics.
+	 */
 	if (!strategy->current_was_in_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
+	{
+		*write_from_ring = false;
 		return false;
+	}
 
 	/*
 	 * Remove the dirty buffer from the ring; necessary to prevent infinite
@@ -699,5 +735,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Caller should not use this flag since the buffer is being rejected (and it
+	 * should have been initialized to false anyway) and will not be written out.
+	 * Set the flag here anyway for clarity.
+	 */
+	*write_from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 9c038851d7..edd3296dd7 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
@@ -123,6 +124,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	if (LocalBufHash == NULL)
 		InitLocalBuffers();
 
+
 	/* See if the desired buffer already exists */
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
@@ -196,6 +198,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ALLOC, IOPATH_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +230,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e1fb631003..20e259edef 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..3238d9ba85 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1312,6 +1324,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void)
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
+	PgStat_BackendIOPathOps io_stats;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &io_stats))
+		goto error;
+
+	io_stats_shmem->stat_reset_timestamp = io_stats.stat_reset_timestamp;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOPathOps *stats = &io_stats.stats[i];
+		PgStatShared_IOPathOps *stats_shmem = &io_stats_shmem->stats[i];
+
+		memcpy(stats_shmem->data, stats->data, sizeof(stats->data));
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..6e7351660f
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,192 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOPathOps pending_IOOpStats;
+bool have_ioopstats = false;
+
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOPathOps *stats_shmem;
+
+	if (!have_ioopstats)
+		return false;
+
+	stats_shmem =
+		&pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];
+
+	if (!nowait)
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
+		return true;
+
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *sharedent = &stats_shmem->data[i];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[i];
+
+#define IO_OP_ACC(fld) sharedent->fld += pendingent->fld
+		IO_OP_ACC(allocs);
+		IO_OP_ACC(extends);
+		IO_OP_ACC(fsyncs);
+		IO_OP_ACC(reads);
+		IO_OP_ACC(writes);
+#undef IO_OP_ACC
+	}
+
+	LWLockRelease(&stats_shmem->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOPathOps *all_backend_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
+		PgStat_IOPathOps *stats_snap = &all_backend_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOPathOps to protect the
+		 * reset timestamp as well.
+		 */
+		if (i == 0)
+			all_backend_stats_snap->stat_reset_timestamp = all_backend_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOPathOps to protect the
+		 * reset timestamp as well.
+		 */
+		if (i == 0)
+			all_backend_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_count_io_op(IOOp io_op, IOPath io_path)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOPathOps*
+pgstat_fetch_backend_io_path_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_LOCAL:
+			return "Local";
+		case IOPATH_SHARED:
+			return "Shared";
+		case IOPATH_STRATEGY:
+			return "Strategy";
+	}
+
+	elog(ERROR, "unrecognized IOPath value: %d", io_path);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "Alloc";
+		case IOOP_EXTEND:
+			return "Extend";
+		case IOOP_FSYNC:
+			return "Fsync";
+		case IOOP_READ:
+			return "Read";
+		case IOOP_WRITE:
+			return "Write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..a17b3336db 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations -- or
+	 * until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,13 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics explicitly for the same reason as in
+	 * pgstat_report_vacuum(). We don't want to wait for an entire ANALYZE
+	 * command to complete before updating stats.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad5..6259cc4f4c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2104,6 +2104,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2112,7 +2114,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 5276bf25a1..61e95135f2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..d6ed6ec864 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/lwlock.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -48,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +278,50 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Paths
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOPathOps
+{
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+} PgStat_IOPathOps;
+
+typedef struct PgStat_BackendIOPathOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOPathOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +499,18 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOPath io_path);
+extern bool pgstat_flush_io_ops(bool nowait);
+extern PgStat_BackendIOPathOps *pgstat_fetch_backend_io_path_ops(void);
+extern PgStat_Counter pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+extern const char *pgstat_io_path_desc(IOPath io_path);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 69e45900ba..b69c5f7e3c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -313,7 +313,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *write_from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7403bca25e..d9b6d12acc 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -306,6 +306,42 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int
+backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType
+idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..3151c43dfe 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,19 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOPathOps
+{
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+} PgStatShared_IOPathOps;
+
+typedef struct PgStatShared_BackendIOPathOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOPathOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOPathOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +432,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOPathOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +456,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOPathOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +565,14 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_snapshot_cb(void);
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+
+
 /*
  * Functions in pgstat_relation.c
  */
-- 
2.34.1

v25-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v25-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From f1dd9c1ccddce6ee4cad4df70f3475ac2a83bca3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v25 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
extends, fsyncs, reads, and writes) done through each IOPath (shared
buffers, local buffers, strategy buffers) by each type of backend (e.g.
client backend, checkpointer).

Some IOPaths are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the "strategy" IOPath for
checkpointer.

Some IOOps are invalid in combination with certain IOPaths. Those cells
will be NULL in the view. For example, local buffers are not fsync'd so
cells for all BackendTypes for IOPATH_STRATEGY and IOOP_FSYNC will be
NULL.

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view is stats for a particular BackendType for a
particular IOPath (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 117 ++++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  12 +++
 src/backend/utils/adt/pgstatfuncs.c  | 106 ++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +++
 src/test/regress/expected/rules.out  |   9 +++
 src/test/regress/expected/stats.out  |  59 ++++++++++++++
 src/test/regress/sql/stats.sql       |  34 ++++++++
 7 files changed, 345 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4549c2560e..2b0ee495ee 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3595,7 +3604,111 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5355,6 +5468,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..1fe3b07daa 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,18 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.read,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6259cc4f4c..21d54ec9b1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1739,6 +1739,112 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum
+* value here above IO_NUM_COLUMNS.
+*/
+enum
+{
+	IO_COLUMN_BACKEND_TYPE,
+	IO_COLUMN_IO_PATH,
+	IO_COLUMN_ALLOCS,
+	IO_COLUMN_EXTENDS,
+	IO_COLUMN_FSYNCS,
+	IO_COLUMN_READS,
+	IO_COLUMN_WRITES,
+	IO_COLUMN_RESET_TIME,
+	IO_NUM_COLUMNS,
+};
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *io_stats;
+	PgStat_IOPathOps *io_path_ops;
+	ReturnSetInfo *rsinfo;
+	Datum reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	io_stats = pgstat_fetch_backend_io_path_ops();
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/*
+		* Currently it is not permitted to reset IO operation stats for individual
+		* IO Paths or individual BackendTypes. All IO Operation statistics are
+		* reset together. As such, it is easiest to reuse the first reset timestamp
+		* available.
+		*/
+	reset_time = TimestampTzGetDatum(io_stats->stat_reset_timestamp);
+
+	io_path_ops = io_stats->stats;
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		bool can_use_strategy;
+		PgStat_IOOpCounters *counters = io_path_ops->data;
+		BackendType backend_type = idx_get_backend_type(i);
+
+	 /*
+		* IO Operation statistics are not collected for all BackendTypes.
+		* For those BackendTypes without IO Operation stats, skip representing them
+		* in the view altogether.
+		*
+		* The following BackendTypes do not participate in the cumulative stats
+		* subsystem or do not do IO operations worth reporting statistics on:
+		* - Startup process because it does not have relation OIDs
+		* - Syslogger because it is not connected to shared memory
+		* - Archiver because most relevant archiving IO is delegated to a
+		*   specialized command or module
+		*/
+		if (backend_type == B_ARCHIVER || backend_type == B_LOGGER || backend_type
+				== B_STARTUP)
+			continue;
+
+		/*
+		 * Not all BackendTypes will use a BufferAccessStrategy. Omit those rows
+		 * from the view.
+		 */
+		can_use_strategy = backend_type == B_AUTOVAC_WORKER || backend_type ==
+			B_BACKEND || backend_type == B_STANDALONE_BACKEND || backend_type ==
+			B_BG_WORKER;
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum values[IO_NUM_COLUMNS];
+			bool nulls[IO_NUM_COLUMNS];
+
+			if (j == IOPATH_STRATEGY && !can_use_strategy)
+				continue;
+
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[IO_COLUMN_BACKEND_TYPE] = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+			values[IO_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));
+			values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+			values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COLUMN_READS] = Int64GetDatum(counters->reads);
+			values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+
+		 /*
+			* Temporary tables using local buffers are not logged and thus do not
+			* require fsync'ing. Set this cell to NULL to differentiate between an
+			* invalid combination and 0 observed IO Operations.
+			*/
+			if (j == IOPATH_LOCAL)
+				nulls[IO_COLUMN_FSYNCS] = true;
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			counters++;
+		}
+
+		io_path_ops++;
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..bec3c93991 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_io', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,read,write,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..2b269e005e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,15 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.read,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_path, alloc, extend, fsync, read, write, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 5b0ebf090f..6dade03b65 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -554,4 +554,63 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+-- Test that writes to Shared Buffers are tracked in pg_stat_io
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+CREATE TABLE test_io_shared_writes(a int);
+INSERT INTO test_io_shared_writes SELECT i FROM generate_series(1,100)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared_writes;
+-- Test that extends of temporary tables are tracked in pg_stat_io
+CREATE TEMPORARY TABLE test_io_local_extends(a int);
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+INSERT INTO test_io_local_extends VALUES(1);
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs.
+CREATE TABLE test_io_strategy_stats(a INT, b INT);
+ALTER TABLE test_io_strategy_stats SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy_stats SELECT i, i from generate_series(1,8000)i;
+-- Ensure that the next VACUUM will need to perform IO
+VACUUM (FULL) test_io_strategy_stats;
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+VACUUM (PARALLEL 0) test_io_strategy_stats;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_stats;
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 3f3cf8fb56..fbd3977605 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -285,4 +285,38 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+
+-- Test that writes to Shared Buffers are tracked in pg_stat_io
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+CREATE TABLE test_io_shared_writes(a int);
+INSERT INTO test_io_shared_writes SELECT i FROM generate_series(1,100)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+DROP TABLE test_io_shared_writes;
+
+-- Test that extends of temporary tables are tracked in pg_stat_io
+CREATE TEMPORARY TABLE test_io_local_extends(a int);
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+INSERT INTO test_io_local_extends VALUES(1);
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs.
+CREATE TABLE test_io_strategy_stats(a INT, b INT);
+ALTER TABLE test_io_strategy_stats SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy_stats SELECT i, i from generate_series(1,8000)i;
+-- Ensure that the next VACUUM will need to perform IO
+VACUUM (FULL) test_io_strategy_stats;
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+VACUUM (PARALLEL 0) test_io_strategy_stats;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+DROP TABLE test_io_strategy_stats;
+
+
 -- End of Stats Test
-- 
2.34.1

#64

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#59)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

In addition to adding several new tests, the attached version 26 fixes a
major bug in constructing the view.

The only valid combination of IOPATH/IOOP that is not tested now is
IOPATH_STRATEGY + IOOP_WRITE. In most cases when I ran this in regress,
the checkpointer wrote out the dirty strategy buffer before VACUUM got
around to reusing and writing it out in my tests.

I've also changed the BACKEND_NUM_TYPES definition. Now arrays will have
that dead spot for B_INVALID, but I feel like it is much easier to
understand without trying to skip that spot and use those special helper
functions.

I also started skipping adding rows to the view for WAL_RECEIVER and
WAL_WRITER and for BackendTypes except B_BACKEND and WAL_SENDER for
IOPATH_LOCAL.

On Tue, Jul 12, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-11 22:22:28 -0400, Melanie Plageman wrote:

Yes, per an off list suggestion by you, I have changed the tests to use a
sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
the missing calls to count IO Operations for IOPATH_LOCAL and
IOPATH_STRATEGY.

I struggled to come up with a way to test writes for a particular
type of backend are counted correctly since a dirty buffer could be
written out by another type of backend before the target BackendType has
a chance to write it out.

I guess temp file writes would be reliably done by one backend... Don't
have a
good idea otherwise.

This was mainly an issue for IOPATH_STRATEGY writes as I mentioned. I
still have not solved this.

I'm not sure how to cause a strategy "extend" for testing.

COPY into a table should work. But might be unattractive due to the size
of of
the COPY ringbuffer.

Did it with a CTAS as Horiguchi-san suggested.

Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

I've added a test to test that reused strategy buffers are counted as
allocs. I would like to add a test which checks that if a buffer in the
ring is pinned and thus not reused, that it is not counted as a strategy
alloc, but I found it challenging without a way to pause vacuuming, pin
a buffer, then resume vacuuming.

Yea, that's probably too hard to make reliable to be worth it.

Yes, I have skipped this.

- Melanie

Attachments:

v26-0003-Track-IO-operation-statistics.patchtext/x-patch; charset=US-ASCII; name=v26-0003-Track-IO-operation-statistics.patchDownload

From f7772e4d19821e0aeb19e906ba6f5e4bb046cfdb Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 29 Jun 2022 18:37:42 -0400
Subject: [PATCH v26 3/4] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp "write"
on an IOPath IOPATH_SHARED by BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOPath
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOPATH_LOCAL and IOPATH_SHARED IOPaths concern operations on local
and shared buffers.

The IOPATH_STRATEGY IOPath concerns buffers
alloc'd/extended/fsync'd/read/written as part of a BufferAccessStrategy.

IOOP_ALLOC is counted for IOPATH_SHARED and IOPATH_LOCAL whenever a
buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC for
IOPATH_STRATEGY is counted whenever a buffer already in the strategy
ring is reused. And IOOP_WRITE for IOPATH_STRATEGY is counted whenever
the reused dirty buffer is written out.

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live. The accumulated stats in shared
memory could be extended in the future with per-backend stats -- useful
for per connection IO statistics and monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c         |   1 +
 src/backend/storage/buffer/bufmgr.c           |  53 ++++-
 src/backend/storage/buffer/freelist.c         |  25 ++-
 src/backend/storage/buffer/localbuf.c         |   5 +
 src/backend/storage/sync/sync.c               |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/pgstat.c           |  36 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 193 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  19 +-
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  58 ++++++
 src/include/storage/buf_internals.h           |   2 +-
 src/include/utils/backend_status.h            |   1 -
 src/include/utils/pgstat_internal.h           |  24 +++
 18 files changed, 424 insertions(+), 20 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..a06331e1eb 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c7d7abcd73..536d422df2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -813,6 +813,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOPath io_path;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+	if (isLocalBuf)
+		io_path = IOPATH_LOCAL;
+	else if (strategy != NULL)
+		io_path = IOPATH_STRATEGY;
+	else
+		io_path = IOPATH_SHARED;
+
 	if (isExtend)
 	{
+
+		pgstat_count_io_op(IOOP_EXTEND, io_path);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1010,6 +1020,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_path);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1180,6 +1192,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool from_ring;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1190,7 +1203,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1227,6 +1240,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath iopath;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1253,13 +1267,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an existing
+				 * strategy buffer being reused, count this as a strategy write for the
+				 * purposes of IO Operations statistics tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring will be
+				 * counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				iopath = from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2563,7 +2591,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2810,9 +2838,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2892,6 +2923,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3539,6 +3572,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3574,7 +3609,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3669,7 +3704,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3877,7 +3912,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3904,7 +3939,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..62ec4518c8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,13 +199,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -212,8 +215,21 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy ring will
+			 * be counted as allocations for the purposes of IO Operation statistics
+			 * tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must be
+			 * allocated from shared buffers and added to the ring, this is counted
+			 * as a IOPATH_SHARED allocation.
+			 */
+			pgstat_count_io_op(IOOP_ALLOC, IOPATH_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +263,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -684,11 +701,13 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 {
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
-	/* Don't muck with behavior of normal buffer-replacement strategy */
+	/*
+	 * Don't muck with behavior of normal buffer-replacement strategy */
 	if (!strategy->current_was_in_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 9c038851d7..2d231daef0 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
@@ -196,6 +197,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ALLOC, IOPATH_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..65b69c4cbd 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..3238d9ba85 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1312,6 +1324,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void)
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
+	PgStat_BackendIOPathOps io_stats;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &io_stats))
+		goto error;
+
+	io_stats_shmem->stat_reset_timestamp = io_stats.stat_reset_timestamp;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOPathOps *stats = &io_stats.stats[i];
+		PgStatShared_IOPathOps *stats_shmem = &io_stats_shmem->stats[i];
+
+		memcpy(stats_shmem->data, stats->data, sizeof(stats->data));
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..ec2919cca6
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,193 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOPathOps pending_IOOpStats;
+bool have_ioopstats = false;
+
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOPathOps *stats_shmem;
+
+	if (!have_ioopstats)
+		return false;
+
+
+	stats_shmem =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
+		return true;
+
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *sharedent = &stats_shmem->data[i];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[i];
+
+#define IO_OP_ACC(fld) sharedent->fld += pendingent->fld
+		IO_OP_ACC(allocs);
+		IO_OP_ACC(extends);
+		IO_OP_ACC(fsyncs);
+		IO_OP_ACC(reads);
+		IO_OP_ACC(writes);
+#undef IO_OP_ACC
+	}
+
+	LWLockRelease(&stats_shmem->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOPathOps *all_backend_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
+		PgStat_IOPathOps *stats_snap = &all_backend_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOPathOps to protect the
+		 * reset timestamp as well.
+		 */
+		if (i == 0)
+			all_backend_stats_snap->stat_reset_timestamp = all_backend_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOPathOps to protect the
+		 * reset timestamp as well.
+		 */
+		if (i == 0)
+			all_backend_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_count_io_op(IOOp io_op, IOPath io_path)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOPathOps*
+pgstat_fetch_backend_io_path_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_LOCAL:
+			return "Local";
+		case IOPATH_SHARED:
+			return "Shared";
+		case IOPATH_STRATEGY:
+			return "Strategy";
+	}
+
+	elog(ERROR, "unrecognized IOPath value: %d", io_path);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "Alloc";
+		case IOOP_EXTEND:
+			return "Extend";
+		case IOOP_FSYNC:
+			return "Fsync";
+		case IOOP_READ:
+			return "Read";
+		case IOOP_WRITE:
+			return "Write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..a17b3336db 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations -- or
+	 * until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,13 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics explicitly for the same reason as in
+	 * pgstat_report_vacuum(). We don't want to wait for an entire ANALYZE
+	 * command to complete before updating stats.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad5..6259cc4f4c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2104,6 +2104,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2112,7 +2114,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 5276bf25a1..e0b25c6815 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..d6ed6ec864 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/lwlock.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -48,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +278,50 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Paths
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOPathOps
+{
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+} PgStat_IOPathOps;
+
+typedef struct PgStat_BackendIOPathOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOPathOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +499,18 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOPath io_path);
+extern bool pgstat_flush_io_ops(bool nowait);
+extern PgStat_BackendIOPathOps *pgstat_fetch_backend_io_path_ops(void);
+extern PgStat_Counter pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+extern const char *pgstat_io_path_desc(IOPath io_path);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 69e45900ba..da18999f59 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -310,7 +310,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7403bca25e..b401c7ade2 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -306,7 +306,6 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
-
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..3151c43dfe 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,19 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOPathOps
+{
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+} PgStatShared_IOPathOps;
+
+typedef struct PgStatShared_BackendIOPathOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOPathOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOPathOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +432,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOPathOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +456,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOPathOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +565,14 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_snapshot_cb(void);
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+
+
 /*
  * Functions in pgstat_relation.c
  */
-- 
2.34.1

v26-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v26-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 6d28dbdd174df4da115735c67f2dc3f5ff51555b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v26 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
extends, fsyncs, reads, and writes) done through each IOPath (shared
buffers, local buffers, strategy buffers) by each type of backend (e.g.
client backend, checkpointer).

Some IOPaths are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the "strategy" IOPath for
checkpointer.

Some IOOps are invalid in combination with certain IOPaths. Those cells
will be NULL in the view. For example, local buffers are not fsync'd so
cells for all BackendTypes for IOPATH_STRATEGY and IOOP_FSYNC will be
NULL.

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view is stats for a particular BackendType for a
particular IOPath (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 117 ++++++++++++++++-
 src/backend/catalog/system_views.sql |  12 ++
 src/backend/utils/adt/pgstatfuncs.c  | 110 ++++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 ++
 src/test/regress/expected/rules.out  |   9 ++
 src/test/regress/expected/stats.out  | 190 +++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql       |  91 +++++++++++++
 7 files changed, 537 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f2066e5f0f..105e86d678 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,7 +3609,111 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5360,6 +5473,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..1fe3b07daa 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,18 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.read,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6259cc4f4c..1c905d5413 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1739,6 +1739,116 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum
+* value here above IO_NUM_COLUMNS.
+*/
+enum
+{
+	IO_COLUMN_BACKEND_TYPE,
+	IO_COLUMN_IO_PATH,
+	IO_COLUMN_ALLOCS,
+	IO_COLUMN_EXTENDS,
+	IO_COLUMN_FSYNCS,
+	IO_COLUMN_READS,
+	IO_COLUMN_WRITES,
+	IO_COLUMN_RESET_TIME,
+	IO_NUM_COLUMNS,
+};
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	io_stats = pgstat_fetch_backend_io_path_ops();
+
+	/*
+	* Currently it is not permitted to reset IO operation stats for individual
+	* IO Paths or individual BackendTypes. All IO Operation statistics are
+	* reset together. As such, it is easiest to reuse the first reset timestamp
+	* available.
+	*/
+	reset_time = TimestampTzGetDatum(io_stats->stat_reset_timestamp);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		bool uses_local;
+		bool uses_strategy;
+		Datum backend_type_desc = CStringGetTextDatum(GetBackendTypeDesc(i));
+		PgStat_IOPathOps *io_path_ops = &io_stats->stats[i];
+
+
+	 /*
+		* IO Operation statistics are not collected for all BackendTypes.
+		* For those BackendTypes without IO Operation stats, skip representing them
+		* in the view altogether.
+		*
+		* The following BackendTypes do not participate in the cumulative stats
+		* subsystem or do not do IO operations worth reporting statistics on:
+		* - Startup process because it does not have relation OIDs
+		* - Syslogger because it is not connected to shared memory
+		* - Archiver because most relevant archiving IO is delegated to a
+		*   specialized command or module
+		* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+		*/
+		if (i == B_INVALID || i == B_ARCHIVER || i == B_LOGGER || i == B_STARTUP ||
+				i == B_WAL_RECEIVER || i == B_WAL_WRITER)
+			continue;
+
+		/*
+		 * Not all BackendTypes will use a BufferAccessStrategy. Omit those rows
+		 * from the view.
+		 */
+		uses_strategy = i == B_AUTOVAC_WORKER || i == B_BACKEND || i ==
+			B_STANDALONE_BACKEND || i == B_BG_WORKER;
+
+		uses_local = i == B_BACKEND || i == B_WAL_SENDER;
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			PgStat_IOOpCounters *counters = &io_path_ops->data[j];
+			Datum values[IO_NUM_COLUMNS];
+			bool nulls[IO_NUM_COLUMNS];
+
+			if (j == IOPATH_STRATEGY && !uses_strategy)
+				continue;
+
+			if (j == IOPATH_LOCAL && !uses_local)
+				continue;
+
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[IO_COLUMN_BACKEND_TYPE] = backend_type_desc;
+			values[IO_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));
+			values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+			values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COLUMN_READS] = Int64GetDatum(counters->reads);
+			values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+
+		 /*
+			* Temporary tables using local buffers are not logged and thus do not
+			* require fsync'ing. Set this cell to NULL to differentiate between an
+			* invalid combination and 0 observed IO Operations.
+			*/
+			if (j == IOPATH_LOCAL)
+				nulls[IO_COLUMN_FSYNCS] = true;
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..bec3c93991 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_io', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,read,write,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..2b269e005e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,15 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.read,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_path, alloc, extend, fsync, read, write, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 5b0ebf090f..7cdcadacd6 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -554,4 +554,194 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
+-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+-- Create a regular table and insert some data to generate IOPATH_SHARED allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOPATH_SHARED writes and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into Shared Buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
+'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+ 80000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs in pg_stat_io. Also test that Strategy reads are
+-- counted as such.
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+CREATE TABLE test_io_strategy(a INT, b INT);
+ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy;
+-- Test that, when using a Strategy, if creating a relation, Strategy extends
+-- are counted in pg_stat_io.
+-- A CTAS uses a Bulkwrite strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_extend;
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 3f3cf8fb56..4d8932feba 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -285,4 +285,95 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+
+-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
+-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+-- Create a regular table and insert some data to generate IOPATH_SHARED allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOPATH_SHARED writes and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into Shared Buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
+'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs in pg_stat_io. Also test that Strategy reads are
+-- counted as such.
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+CREATE TABLE test_io_strategy(a INT, b INT);
+ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+DROP TABLE test_io_strategy;
+
+-- Test that, when using a Strategy, if creating a relation, Strategy extends
+-- are counted in pg_stat_io.
+-- A CTAS uses a Bulkwrite strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+DROP TABLE test_io_strategy_extend;
+
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v26-0002-Remove-unneeded-call-to-pgstat_report_wal.patchtext/x-patch; charset=US-ASCII; name=v26-0002-Remove-unneeded-call-to-pgstat_report_wal.patchDownload

From 2c869d7c48ddcedf52d61b2c18173c19e588c48b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 12 Jul 2022 19:53:23 -0400
Subject: [PATCH v26 2/4] Remove unneeded call to pgstat_report_wal()

pgstat_report_stat() will be called before shutdown so an explicit call
to pgstat_report_wal() is wasted.
---
 src/backend/postmaster/walwriter.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..beb46dcb55 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
 	}
 
 	if (ShutdownRequestPending)
-	{
-		/*
-		 * Force reporting remaining WAL statistics at process exit.
-		 *
-		 * Since pgstat_report_wal is invoked with 'force' is false in main
-		 * loop to avoid overloading the cumulative stats system, there may
-		 * exist unreported stats counters for the WAL writer.
-		 */
-		pgstat_report_wal(true);
-
 		proc_exit(0);
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
-- 
2.34.1

v26-0001-Add-BackendType-for-standalone-backends.patchtext/x-patch; charset=US-ASCII; name=v26-0001-Add-BackendType-for-standalone-backends.patchDownload

From 49251edea9c1d02420ae358db5d78cf0ef36504b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 28 Jun 2022 11:33:04 -0400
Subject: [PATCH v26 1/4] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 17 +++++++++++------
 src/include/miscadmin.h           |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..07e6db1a1c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,6 +278,12 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
@@ -285,12 +296,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ea9a56d395..5276bf25a1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -316,18 +316,19 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_LOGGER,
 } BackendType;
 
 extern PGDLLIMPORT BackendType MyBackendType;
-- 
2.34.1

#65

melanieplageman@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#64)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

I am consolidating the various naming points from this thread into one
email:

From Horiguchi-san:

A bit different thing, but I felt a little uneasy about some uses of
"pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
other hand, actually iopath is used as an attribute of io_ops in many
places. Couldn't we be more consistent about the relationship between
the names?

IOOp -> PgStat_IOOpType
IOPath -> PgStat_IOPath
PgStat_IOOpCOonters -> PgStat_IOCounters
PgStat_IOPathOps -> PgStat_IO
pgstat_count_io_op -> pgstat_count_io

So, because of the way the data structures contain arrays of each other
the naming was meant to specify all the information contained in the
data structure:

PgStat_IOOpCounters are all IOOp (I could see removing the word
"counters" from the name for more consistency)

PgStat_IOPathOps are all IOOp for all IOPath

PgStat_BackendIOPathOps are all IOOp for all IOPath for all BackendType

The downside of this naming is that, when choosing a local variable name
for all of the IOOp for all IOPath for a single BackendType,
"backend_io_path_ops" seems accurate but is actually confusing if the
type name for all IOOp for all IOPath for all BackendType is
PgStat_BackendIOPathOps.

I would be open to changing PgStat_BackendIOPathOps to PgStat_IO, but I
don't see how I could omit Path or Op from PgStat_IOPathOps without
making its meaning unclear.

I'm not sure about the idea of prefixing the IOOp and IOPath enums with
Pg_Stat. I could see them being used outside of statistics (though they
are defined in pgstat.h) and could see myself using them in, for
example, calculations for the prefetcher.

From Andres:

Quoting me (Melanie):

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp write on
an IOPath IOPATH_SHARED by BackendType "checkpointer".

I'm still not 100% happy with IOPath - seems a bit too easy to confuse

with

the file path. What about 'origin'?

I can see the point about IOPATH.
I'm not wild about origin mostly because of the number of O's given that
IO Operation already has two O's. It gets kind of hard to read when
using Pascal Case: IOOrigin and IOOp.
Also, it doesn't totally make sense for alloc. I could be convinced,
though.

IOSOURCE doesn't have the O problem but does still not make sense for
alloc. I also thought of IOSITE and IOVENUE.

Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting

the

latter, except that we already have a bunch of views with that prefix.

As far as pg_stat_io vs pg_statio, they are the only stats views which
don't have an underscore between stat and the rest of the view name, so
perhaps we should move away from statio to stat_io going forward anyway.
I am imagining adding to them with other iostat type metrics once direct
IO is introduced, so they may well be changing soon anyway.

- Melanie

#66

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#65)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-07-15 11:59:41 -0400, Melanie Plageman wrote:

I'm not sure about the idea of prefixing the IOOp and IOPath enums with
Pg_Stat. I could see them being used outside of statistics (though they
are defined in pgstat.h)

From Andres:

Quoting me (Melanie):

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp write on
an IOPath IOPATH_SHARED by BackendType "checkpointer".

I'm still not 100% happy with IOPath - seems a bit too easy to confuse

with

the file path. What about 'origin'?

I can see the point about IOPATH.
I'm not wild about origin mostly because of the number of O's given that
IO Operation already has two O's. It gets kind of hard to read when
using Pascal Case: IOOrigin and IOOp.
Also, it doesn't totally make sense for alloc. I could be convinced,
though.

IOSOURCE doesn't have the O problem but does still not make sense for
alloc. I also thought of IOSITE and IOVENUE.

I like "source" - not too bothered by the alloc aspect. I can also see
"context" working.

Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting

the

latter, except that we already have a bunch of views with that prefix.

As far as pg_stat_io vs pg_statio, they are the only stats views which
don't have an underscore between stat and the rest of the view name, so
perhaps we should move away from statio to stat_io going forward anyway.
I am imagining adding to them with other iostat type metrics once direct
IO is introduced, so they may well be changing soon anyway.

I don't think I have strong opinions on this one. I can see arguments for
either naming.

Greetings,

Andres Freund

#67

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#64)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-07-14 18:44:48 -0400, Melanie Plageman wrote:

Subject: [PATCH v26 1/4] Add BackendType for standalone backends
Subject: [PATCH v26 2/4] Remove unneeded call to pgstat_report_wal()

LGTM.

Subject: [PATCH v26 3/4] Track IO operation statistics

@@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,

bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+		io_path = IOPATH_LOCAL;
+	else if (strategy != NULL)
+		io_path = IOPATH_STRATEGY;
+	else
+		io_path = IOPATH_SHARED;

Seems a bit ugly to have an if (isLocalBuf) just after an isLocalBuf ?.

+			/*
+			 * When a strategy is in use, reused buffers from the strategy ring will
+			 * be counted as allocations for the purposes of IO Operation statistics
+			 * tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must be
+			 * allocated from shared buffers and added to the ring, this is counted
+			 * as a IOPATH_SHARED allocation.
+			 */

There's a bit too much duplication between the paragraphs...

@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
/* flush database / relation / function / ... stats */
partial_flush |= pgstat_flush_pending_entries(nowait);
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);

Could you either add a note to the commit message that the stats file
version needs to be increased, or just iclude that in the patch.

@@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void)
FILE	   *fpin;
int32		format_id;
bool		found;
+	PgStat_BackendIOPathOps io_stats;
const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops;

/* shouldn't be called from postmaster */
Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void)
if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
goto error;

+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &io_stats))
+		goto error;
+
+	io_stats_shmem->stat_reset_timestamp = io_stats.stat_reset_timestamp;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOPathOps *stats = &io_stats.stats[i];
+		PgStatShared_IOPathOps *stats_shmem = &io_stats_shmem->stats[i];
+
+		memcpy(stats_shmem->data, stats->data, sizeof(stats->data));
+	}

Why can't the data be read directly into shared memory?

/*

+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOPathOps *all_backend_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
+		PgStat_IOPathOps *stats_snap = &all_backend_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);

Why acquire the same lock repeatedly for each type, rather than once for
the whole?

+		/*
+		 * Use the lock in the first BackendType's PgStat_IOPathOps to protect the
+		 * reset timestamp as well.
+		 */
+		if (i == 0)
+			all_backend_stats_snap->stat_reset_timestamp = all_backend_stats_shmem->stat_reset_timestamp;

Which also would make this look a bit less awkward.

Starting to look pretty good...

- Andres

#68

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#67)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Jul 20, 2022 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-14 18:44:48 -0400, Melanie Plageman wrote:
@@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void)
FILE       *fpin;
int32           format_id;
bool            found;
+     PgStat_BackendIOPathOps io_stats;
const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+     PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops;
/* shouldn't be called from postmaster */
Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void)
if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
goto error;
+     /*
+      * Read IO Operations stats struct
+      */
+     if (!read_chunk_s(fpin, &io_stats))
+             goto error;
+
+     io_stats_shmem->stat_reset_timestamp =
io_stats.stat_reset_timestamp;
+
+     for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+     {
+             PgStat_IOPathOps *stats = &io_stats.stats[i];
+             PgStatShared_IOPathOps *stats_shmem =
&io_stats_shmem->stats[i];
+
+             memcpy(stats_shmem->data, stats->data,
sizeof(stats->data));

+ }

Why can't the data be read directly into shared memory?

It is not the same lock. Each PgStatShared_IOPathOps has a lock so that
they can be accessed individually (per BackendType in
PgStatShared_BackendIOPathOps). It is optimized for the more common
operation of flushing at the expense of the snapshot operation (which
should be less common) and reset operation.

+void
+pgstat_io_ops_snapshot_cb(void)
+{
+     PgStatShared_BackendIOPathOps *all_backend_stats_shmem =
&pgStatLocal.shmem->io_ops;

+ PgStat_BackendIOPathOps *all_backend_stats_snap =

&pgStatLocal.snapshot.io_ops;
+
+     for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+     {
+             PgStatShared_IOPathOps *stats_shmem =
&all_backend_stats_shmem->stats[i];

+ PgStat_IOPathOps *stats_snap =

&all_backend_stats_snap->stats[i];
+
+             LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
Why acquire the same lock repeatedly for each type, rather than once for
the whole?

This is also because of having a LWLock in each PgStatShared_IOPathOps.
Because I don't want a lock in the backend local stats, I have two data
structures PgStatShared_IOPathOps and PgStat_IOPathOps. I thought it was
odd to write out the lock to the file, so when persisting the stats, I
write out the relevant data only and when reading it back in to shared
memory, I read in the data member of PgStatShared_IOPathOps.

#69

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#67)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

I've attached v27 of the patch.

I've renamed IOPATH to IOCONTEXT. I also have added assertions to
confirm that unexpected statistics are not being accumulated.

There are also assorted other cleanups and changes.

It would be good to confirm that the rows being skipped and cells that
are NULL in the view are the correct ones.
The startup process will never use a BufferAccessStrategy, right?

On Wed, Jul 20, 2022 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:

Subject: [PATCH v26 3/4] Track IO operation statistics

@@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char

relpersistence, ForkNumber forkNum,

bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) :

BufHdrGetBlock(bufHdr);
+     if (isLocalBuf)
+             io_path = IOPATH_LOCAL;
+     else if (strategy != NULL)
+             io_path = IOPATH_STRATEGY;
+     else
+             io_path = IOPATH_SHARED;
Seems a bit ugly to have an if (isLocalBuf) just after an isLocalBuf ?.

Changed this.

+                     /*
+                      * When a strategy is in use, reused buffers from
the strategy ring will

+ * be counted as allocations for the purposes of

IO Operation statistics
+                      * tracking.
+                      *
+                      * However, even when a strategy is in use, if a
new buffer must be

+ * allocated from shared buffers and added to the

ring, this is counted
+                      * as a IOPATH_SHARED allocation.
+                      */
There's a bit too much duplication between the paragraphs...

I actually think the two paragraphs are making separate points. I've
edited this, so see if you like it better now.

@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
/* flush database / relation / function / ... stats */
partial_flush |= pgstat_flush_pending_entries(nowait);
+     /* flush IO Operations stats */
+     partial_flush |= pgstat_flush_io_ops(nowait);
Could you either add a note to the commit message that the stats file
version needs to be increased, or just iclude that in the patch.

Bumped the stats file version in attached patchset.

- Melanie

Attachments:

v27-0003-Track-IO-operation-statistics.patchtext/x-patch; charset=US-ASCII; name=v27-0003-Track-IO-operation-statistics.patchDownload

From b382e216b4a3f1dae91b043c5c8d647ea17821b7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:46 -0400
Subject: [PATCH v27 3/4] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO location source or target or IO type done by a backend. For
example, the checkpointer may write a shared buffer out. This would be
counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by
BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_STRATEGY IOContext concerns buffers
alloc'd/extended/fsync'd/read/written as part of a BufferAccessStrategy.

IOOP_ALLOC is counted for IOCONTEXT_SHARED and IOCONTEXT_LOCAL whenever
a buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC for
IOCONTEXT_STRATEGY is counted whenever a buffer already in the strategy
ring is reused. And IOOP_WRITE for IOCONTEXT_STRATEGY is counted
whenever the reused dirty buffer is written out.

Stats on IOOps for all IOContexts for a backend are initially
accumulated locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live. The accumulated stats in shared
memory could be extended in the future with per-backend stats -- useful
for per connection IO statistics and monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/postmaster/checkpointer.c         |  12 +
 src/backend/storage/buffer/bufmgr.c           |  64 +++-
 src/backend/storage/buffer/freelist.c         |  23 +-
 src/backend/storage/buffer/localbuf.c         |   5 +
 src/backend/storage/sync/sync.c               |   9 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/pgstat.c           |  31 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 297 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          | 103 +++++-
 src/include/storage/buf_internals.h           |   2 +-
 src/include/utils/pgstat_internal.h           |  29 ++
 19 files changed, 601 insertions(+), 20 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a6e7e3b69d..14d97ec92c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5360,6 +5360,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..bd2e1de7c2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,18 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_STRATEGY at this point, so count the
+		 * fsync as being in the IOCONTEXT_SHARED IOContext. This is probably
+		 * okay, because the number of backend fsyncs doesn't say anything
+		 * about the efficacy of the BufferAccessStrategy. And counting both
+		 * fsyncs done in IOCONTEXT_SHARED and IOCONTEXT_STRATEGY under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8ef0436c52..d4c9bf7c4f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -986,10 +987,25 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_context = IOCONTEXT_LOCAL;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+
+		if (strategy != NULL)
+			io_context = IOCONTEXT_STRATEGY;
+		else
+			io_context = IOCONTEXT_SHARED;
+	}
 
 	if (isExtend)
 	{
+
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1020,6 +1036,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1190,6 +1208,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1220,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1257,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1285,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an
+				 * existing strategy buffer being reused, count this as a
+				 * strategy write for the purposes of IO Operations statistics
+				 * tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring
+				 * will be counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				io_context = from_ring ? IOCONTEXT_STRATEGY : IOCONTEXT_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2573,7 +2610,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,9 +2857,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOContext will always be IOCONTEXT_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2902,6 +2942,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3549,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3584,7 +3628,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3679,7 +3723,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3883,7 +3927,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3910,7 +3954,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..237a48e8d8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,13 +199,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -212,8 +215,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy
+			 * ring will be counted as IOCONTEXT_STRATEGY allocations for the
+			 * purposes of IO Operation statistics tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must
+			 * be allocated from shared buffers and added to the ring, this is
+			 * counted instead as an IOCONTEXT_SHARED allocation. So, only
+			 * reused buffers are counted as being in the IOCONTEXT_STRATEGY
+			 * IOContext.
+			 */
+			pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +265,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 014f644bf9..a3d76599bf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
@@ -196,6 +197,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..f310b7a435 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,15 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					/*
+					 * Note that if a backend using a BufferAccessStrategy is
+					 * forced to do its own fsync (as opposed to the
+					 * checkpointer doing it), it will not be counted as an
+					 * IOCONTEXT_STRATEGY IOOP_FSYNC and instead will be
+					 * counted as an IOCONTEXT_SHARED IOOP_FSYNC.
+					 */
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..c30954d90a 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1312,6 +1324,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1486,6 +1506,17 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[i].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..72d02f4dda
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,297 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_context];
+
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	for (int i = 0; i < IOCONTEXT_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[i];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[i];
+
+#define IO_OP_ACC(fld) sharedent->fld += pendingent->fld
+		IO_OP_ACC(allocs);
+		IO_OP_ACC(extends);
+		IO_OP_ACC(fsyncs);
+		IO_OP_ACC(reads);
+		IO_OP_ACC(writes);
+#undef IO_OP_ACC
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_LOCAL:
+			return "Local";
+		case IOCONTEXT_SHARED:
+			return "Shared";
+		case IOCONTEXT_STRATEGY:
+			return "Strategy";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "Alloc";
+		case IOOP_EXTEND:
+			return "Extend";
+		case IOOP_FSYNC:
+			return "Fsync";
+		case IOOP_READ:
+			return "Read";
+		case IOOP_WRITE:
+			return "Write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp = backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_strategy;
+	bool		no_local;
+
+	/*
+	 * Not all BackendTypes will use a BufferAccessStrategy.
+	 */
+	no_strategy = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+		B_BG_WRITER || bktype == B_CHECKPOINTER;
+
+	/*
+	 * Only regular backends and WAL Sender processes executing queries should
+	 * use local buffers.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+		B_BG_WRITER || bktype == B_CHECKPOINTER || bktype ==
+		B_AUTOVAC_WORKER || bktype == B_BG_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_STRATEGY && no_strategy)
+		return false;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op)
+{
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
+		IOOP_READ)
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op)
+{
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing. Set this cell to NULL to differentiate between an
+	 * invalid combination and 0 observed IO Operations.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if ((io_context == IOCONTEXT_LOCAL || io_context == IOCONTEXT_STRATEGY) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!pgstat_bktype_io_op_valid(bktype, io_op))
+		return false;
+
+	if (!pgstat_io_context_io_op_valid(io_context, io_op))
+		return false;
+
+	/*
+	 * There are currently no cases of a BackendType, IOContext, IOOp
+	 * combination that are specifically invalid.
+	 */
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..7a2fd1ccf9 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..2acfeb3192 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index d9e2a79382..cda4447e53 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2092,6 +2092,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2100,7 +2102,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7c41b27994..f65e9635a3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..83b416c59d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -242,7 +243,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA7
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA8
 
 typedef struct PgStat_ArchiverStats
 {
@@ -276,6 +277,50 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_WRITE,
+}			IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_STRATEGY,
+}			IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+}			PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+}			PgStat_IOContextOps;
+
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+}			PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +498,62 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps * pgstat_fetch_backend_io_context_ops(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op);
+extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+/*
+ * Functions to assert that invalid IO Operation counters are zero. Used with
+ * the validation functions in pgstat_io_ops.c
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters * counters)
+{
+	Assert(counters->allocs == 0 && counters->extends == 0 &&
+		   counters->fsyncs == 0 && counters->reads == 0 &&
+		   counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters * counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			Assert(counters->allocs == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 72466551d7..aa064173ee 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -346,7 +346,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..26e8ec2331 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,24 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data, PgStatShared_BackendIOContextOps->stats[0] also
+	 * protects PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+}			PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+}			PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +437,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +461,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +570,14 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+
+
 /*
  * Functions in pgstat_relation.c
  */
-- 
2.34.1

v27-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v27-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 17036e78a92a75da83ea7811fd738d490fc6c65e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:50 -0400
Subject: [PATCH v27 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
extends, fsyncs, reads, and writes) done through each IOContext (shared
buffers, local buffers, strategy buffers) by each type of backend (e.g.
client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the "strategy" IOContext for
checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsync'd so cells for all BackendTypes for IOCONTEXT_STRATEGY and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view is stats for a particular BackendType for a
particular IOContext (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 115 ++++++++++++++-
 src/backend/catalog/system_views.sql |  12 ++
 src/backend/utils/adt/pgstatfuncs.c  | 102 ++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 ++
 src/test/regress/expected/rules.out  |   9 ++
 src/test/regress/expected/stats.out  | 201 +++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql       | 103 ++++++++++++++
 7 files changed, 550 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 14d97ec92c..98750121c5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,7 +3609,111 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend
+   type for each possible IO Context containing global data for the cluster for
+   that backend and IO Context.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f369b1fc14..5fab964219 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,18 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.read,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index cda4447e53..821216d01e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1733,6 +1733,108 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	/*
+	 * When adding a new column to the pg_stat_io view, add a new enum value
+	 * here above IO_NUM_COLUMNS.
+	 */
+	enum
+	{
+		IO_COLUMN_BACKEND_TYPE,
+		IO_COLUMN_IO_CONTEXT,
+		IO_COLUMN_ALLOCS,
+		IO_COLUMN_EXTENDS,
+		IO_COLUMN_FSYNCS,
+		IO_COLUMN_READS,
+		IO_COLUMN_WRITES,
+		IO_COLUMN_RESET_TIME,
+		IO_NUM_COLUMNS,
+	};
+
+#define IO_COLUMN_IOOP_OFFSET (IO_COLUMN_IO_CONTEXT + 1)
+
+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_io_op_stats_collected(bktype))
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+				pgstat_io_context_ops_assert_zero(&io_context_ops->data[io_context]);
+			continue;
+		}
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			Datum		values[IO_NUM_COLUMNS];
+			bool		nulls[IO_NUM_COLUMNS];
+
+			/*
+			 * Some combinations of IOCONTEXT and BackendType are not valid
+			 * for any type of IO Operation. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!pgstat_bktype_io_context_valid(bktype, io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[IO_COLUMN_BACKEND_TYPE] = bktype_desc;
+			values[IO_COLUMN_IO_CONTEXT] = CStringGetTextDatum(
+															   pgstat_io_context_desc(io_context));
+			values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COLUMN_READS] = Int64GetDatum(counters->reads);
+			values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_bktype_io_op_valid(bktype, io_op) ||
+					!pgstat_io_context_io_op_valid(io_context, io_op))
+				{
+					pgstat_io_op_assert_zero(counters, io_op);
+					nulls[io_op + IO_COLUMN_IOOP_OFFSET] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index be47583122..4aefebc7f8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '14', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,alloc,extend,fsync,read,write,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..d122c36556 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,15 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.read,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, alloc, extend, fsync, read, write, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 6b233ff4c0..a75fc91c57 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -796,4 +796,205 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
+-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into Shared Buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
+'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+ 80000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs in pg_stat_io. Also test that Strategy reads are
+-- counted as such.
+-- Set wal_skip_threshold smaller than the expected size of test_io_strategy so
+-- that, even if wal_level is minimal, VACUUM FULL will fsync the newly
+-- rewritten test_io_strategy instead of writing it to WAL. Writing it to WAL
+-- will result in the newly written relation pages being in shared buffers --
+-- preventing us from testing BufferAccessStrategy allocs and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+CREATE TABLE test_io_strategy(a INT, b INT);
+ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy;
+-- Hope that the previous value of wal_skip_threshold was the default. We
+-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction
+-- block.
+RESET wal_skip_threshold;
+-- Test that, when using a Strategy, if creating a relation, Strategy extends
+-- are counted in pg_stat_io.
+-- A CTAS uses a Bulkwrite strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_extend;
+-- Test stats reset
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 096f00ce8b..090cc67296 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -396,4 +396,107 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+
+-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
+-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into Shared Buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
+'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs in pg_stat_io. Also test that Strategy reads are
+-- counted as such.
+
+-- Set wal_skip_threshold smaller than the expected size of test_io_strategy so
+-- that, even if wal_level is minimal, VACUUM FULL will fsync the newly
+-- rewritten test_io_strategy instead of writing it to WAL. Writing it to WAL
+-- will result in the newly written relation pages being in shared buffers --
+-- preventing us from testing BufferAccessStrategy allocs and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+CREATE TABLE test_io_strategy(a INT, b INT);
+ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+DROP TABLE test_io_strategy;
+-- Hope that the previous value of wal_skip_threshold was the default. We
+-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction
+-- block.
+RESET wal_skip_threshold;
+
+-- Test that, when using a Strategy, if creating a relation, Strategy extends
+-- are counted in pg_stat_io.
+-- A CTAS uses a Bulkwrite strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+DROP TABLE test_io_strategy_extend;
+
+-- Test stats reset
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v27-0001-Add-BackendType-for-standalone-backends.patchtext/x-patch; charset=US-ASCII; name=v27-0001-Add-BackendType-for-standalone-backends.patchDownload

From 2f67aadbf36ddca626d03474b962346ffeab89fb Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:24 -0400
Subject: [PATCH v27 1/4] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 17 +++++++++++------
 src/include/miscadmin.h           |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bd973ba613..bf3871a774 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,6 +278,12 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
@@ -285,12 +296,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 067b729d5a..7c41b27994 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -316,18 +316,19 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_LOGGER,
 } BackendType;
 
 extern PGDLLIMPORT BackendType MyBackendType;
-- 
2.34.1

v27-0002-Remove-unneeded-call-to-pgstat_report_wal.patchtext/x-patch; charset=US-ASCII; name=v27-0002-Remove-unneeded-call-to-pgstat_report_wal.patchDownload

From 8035d5ca54b670e366af4a8688b942dad299eb66 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:40 -0400
Subject: [PATCH v27 2/4] Remove unneeded call to pgstat_report_wal()

pgstat_report_stat() will be called before shutdown so an explicit call
to pgstat_report_wal() is wasted.
---
 src/backend/postmaster/walwriter.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..beb46dcb55 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
 	}
 
 	if (ShutdownRequestPending)
-	{
-		/*
-		 * Force reporting remaining WAL statistics at process exit.
-		 *
-		 * Since pgstat_report_wal is invoked with 'force' is false in main
-		 * loop to avoid overloading the cumulative stats system, there may
-		 * exist unreported stats counters for the WAL writer.
-		 */
-		pgstat_report_wal(true);
-
 		proc_exit(0);
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
-- 
2.34.1

#70

melanieplageman@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#69)

5 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v28 attached.

I've added the new structs I added to typedefs.list.

I've split the commit which adds all of the logic to track
IO operation statistics into two commits -- one which includes all of
the code to count IOOps for IOContexts locally in a backend and a second
which includes all of the code to accumulate and manage these with the
cumulative stats system.

A few notes about the commit which adds local IO Operation stats:

- There is a comment above pgstat_io_op_stats_collected() which mentions
the cumulative stats system even though this commit doesn't engage the
cumulative stats system. I wasn't sure if it was more or less
confusing to have two different versions of this comment.

- should pgstat_count_io_op() take BackendType as a parameter instead of
using MyBackendType internally?

- pgstat_count_io_op() Assert()s that the passed-in IOOp and IOContext
are valid for this BackendType, but it doesn't check that all of the
pending stats which should be zero are zero. I thought this was okay
because if I did add that zero-check, it would be added to
pgstat_count_ioop() as well, and we already Assert() there that we can
count the op. Thus, it doesn't seem like checking that the stats are
zero would add any additional regression protection.

- I've kept pgstat_io_context_desc() and pgstat_io_op_desc() in the
commit which adds those types (the local stats commit), however they
are not used in that commit. I wasn't sure if I should keep them in
that commit or move them to the first commit using them (the commit
adding the new view).

Notes on the commit which accumulates IO Operation stats in shared
memory:

- I've extended the usage of the Assert()s that IO Operation stats that
should be zero are. Previously we only checked the stats validity when
querying the view. Now we check it when flushing pending stats and
when reading the stats file into shared memory.

Note that the three locations with these validity checks (when
flushing pending stats, when reading stats file into shared memory,
and when querying the view) have similar looking code to loop through
and validate the stats. However, the actual action they perform if the
stats are valid is different for each site (adding counters together,
doing a read, setting nulls in a tuple column to true). Also, some of
these instances have other code interspersed in the loops which would
require additional looping if separated from this logic. So it was
difficult to see a way of combining these into a single helper
function.

- I've left pgstat_fetch_backend_io_context_ops() in the shared stats
commit, however it is not used until the commit which adds the view in
pg_stat_get_io(). I wasn't sure which way seemed better.

- Melanie

Attachments:

v28-0003-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v28-0003-Track-IO-operation-statistics-locally.patchDownload

From 19f89b5ba164272758f17a35973eae69475f7400 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Aug 2022 11:08:23 -0400
Subject: [PATCH v28 3/5] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO location source or target or IO type done by a backend. For
example, the checkpointer may write a shared buffer out. This would be
counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by
BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_STRATEGY IOContext concerns buffers
alloc'd/extended/fsync'd/read/written as part of a BufferAccessStrategy.

IOOP_ALLOC is counted for IOCONTEXT_SHARED and IOCONTEXT_LOCAL whenever
a buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC for
IOCONTEXT_STRATEGY is counted whenever a buffer already in the strategy
ring is reused. And IOOP_WRITE for IOCONTEXT_STRATEGY is counted
whenever the reused dirty buffer is written out.

Stats on IOOps for all IOContexts for a backend are counted in a
backend's local memory. This commit does not expose any functions for
aggregating or viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  12 ++
 src/backend/storage/buffer/bufmgr.c        |  64 +++++--
 src/backend/storage/buffer/freelist.c      |  23 ++-
 src/backend/storage/buffer/localbuf.c      |   5 +
 src/backend/storage/sync/sync.c            |   9 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 191 +++++++++++++++++++++
 src/include/pgstat.h                       |  54 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/tools/pgindent/typedefs.list           |   4 +
 10 files changed, 352 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..bd2e1de7c2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,18 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_STRATEGY at this point, so count the
+		 * fsync as being in the IOCONTEXT_SHARED IOContext. This is probably
+		 * okay, because the number of backend fsyncs doesn't say anything
+		 * about the efficacy of the BufferAccessStrategy. And counting both
+		 * fsyncs done in IOCONTEXT_SHARED and IOCONTEXT_STRATEGY under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7a1202c609..fbe8f69b7b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -986,10 +987,25 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_context = IOCONTEXT_LOCAL;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+
+		if (strategy != NULL)
+			io_context = IOCONTEXT_STRATEGY;
+		else
+			io_context = IOCONTEXT_SHARED;
+	}
 
 	if (isExtend)
 	{
+
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1020,6 +1036,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1190,6 +1208,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1220,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1257,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1285,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an
+				 * existing strategy buffer being reused, count this as a
+				 * strategy write for the purposes of IO Operations statistics
+				 * tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring
+				 * will be counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				io_context = from_ring ? IOCONTEXT_STRATEGY : IOCONTEXT_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2573,7 +2610,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,9 +2857,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOContext will always be IOCONTEXT_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2902,6 +2942,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3549,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3584,7 +3628,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3679,7 +3723,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3889,7 +3933,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3916,7 +3960,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..237a48e8d8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,13 +199,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -212,8 +215,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy
+			 * ring will be counted as IOCONTEXT_STRATEGY allocations for the
+			 * purposes of IO Operation statistics tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must
+			 * be allocated from shared buffers and added to the ring, this is
+			 * counted instead as an IOCONTEXT_SHARED allocation. So, only
+			 * reused buffers are counted as being in the IOCONTEXT_STRATEGY
+			 * IOContext.
+			 */
+			pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +265,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 014f644bf9..a3d76599bf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
@@ -196,6 +197,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..f310b7a435 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,15 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					/*
+					 * Note that if a backend using a BufferAccessStrategy is
+					 * forced to do its own fsync (as opposed to the
+					 * checkpointer doing it), it will not be counted as an
+					 * IOCONTEXT_STRATEGY IOOP_FSYNC and instead will be
+					 * counted as an IOCONTEXT_SHARED IOOP_FSYNC.
+					 */
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..9c70a4a6dd
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,191 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_context];
+
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_LOCAL:
+			return "Local";
+		case IOCONTEXT_SHARED:
+			return "Shared";
+		case IOCONTEXT_STRATEGY:
+			return "Strategy";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "Alloc";
+		case IOOP_EXTEND:
+			return "Extend";
+		case IOOP_FSYNC:
+			return "Fsync";
+		case IOOP_READ:
+			return "Read";
+		case IOOP_WRITE:
+			return "Write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_strategy;
+	bool		no_local;
+
+	/*
+	 * Not all BackendTypes will use a BufferAccessStrategy.
+	 */
+	no_strategy = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+		B_BG_WRITER || bktype == B_CHECKPOINTER;
+
+	/*
+	 * Only regular backends and WAL Sender processes executing queries should
+	 * use local buffers.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+		B_BG_WRITER || bktype == B_CHECKPOINTER || bktype ==
+		B_AUTOVAC_WORKER || bktype == B_BG_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_STRATEGY && no_strategy)
+		return false;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op)
+{
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
+		IOOP_READ)
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op)
+{
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing. Set this cell to NULL to differentiate between an
+	 * invalid combination and 0 observed IO Operations.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if ((io_context == IOCONTEXT_LOCAL || io_context == IOCONTEXT_STRATEGY) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!pgstat_bktype_io_op_valid(bktype, io_op))
+		return false;
+
+	if (!pgstat_io_context_io_op_valid(io_context, io_op))
+		return false;
+
+	/*
+	 * There are currently no cases of a BackendType, IOContext, IOOp
+	 * combination that are specifically invalid.
+	 */
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..32d1aec540 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -276,6 +276,44 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_STRATEGY,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +491,22 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op);
+extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 72466551d7..aa064173ee 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -346,7 +346,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 35c9f1efce..72378b2148 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1105,7 +1105,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2037,6 +2039,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v28-0005-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v28-0005-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From e2dbdaf6cb71a1494df28d4fd7b5b13ed09c348e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:50 -0400
Subject: [PATCH v28 5/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
extends, fsyncs, reads, and writes) done through each IOContext (shared
buffers, local buffers, strategy buffers) by each type of backend (e.g.
client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the "strategy" IOContext for
checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsync'd so cells for all BackendTypes for IOCONTEXT_STRATEGY and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view is stats for a particular BackendType for a
particular IOContext (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 115 ++++++++++++++-
 src/backend/catalog/system_views.sql |  12 ++
 src/backend/utils/adt/pgstatfuncs.c  | 100 +++++++++++++
 src/include/catalog/pg_proc.dat      |   9 ++
 src/test/regress/expected/rules.out  |   9 ++
 src/test/regress/expected/stats.out  | 201 +++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql       | 103 ++++++++++++++
 7 files changed, 548 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9440b41770..9949011ba3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,7 +3609,111 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend
+   type for each possible IO Context containing global data for the cluster for
+   that backend and IO Context.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5a844b63a1..3d52a664b0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,18 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.read,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index cda4447e53..262a5ac6c7 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1733,6 +1733,106 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	/*
+	 * When adding a new column to the pg_stat_io view, add a new enum value
+	 * here above IO_NUM_COLUMNS.
+	 */
+	enum
+	{
+		IO_COLUMN_BACKEND_TYPE,
+		IO_COLUMN_IO_CONTEXT,
+		IO_COLUMN_ALLOCS,
+		IO_COLUMN_EXTENDS,
+		IO_COLUMN_FSYNCS,
+		IO_COLUMN_READS,
+		IO_COLUMN_WRITES,
+		IO_COLUMN_RESET_TIME,
+		IO_NUM_COLUMNS,
+	};
+
+#define IO_COLUMN_IOOP_OFFSET (IO_COLUMN_IO_CONTEXT + 1)
+
+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_io_op_stats_collected(bktype))
+			expect_backend_stats = false;
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			Datum		values[IO_NUM_COLUMNS];
+			bool		nulls[IO_NUM_COLUMNS];
+
+			/*
+			 * Some combinations of IOCONTEXT and BackendType are not valid
+			 * for any type of IO Operation. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid(bktype, io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[IO_COLUMN_BACKEND_TYPE] = bktype_desc;
+			values[IO_COLUMN_IO_CONTEXT] = CStringGetTextDatum(
+															   pgstat_io_context_desc(io_context));
+			values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COLUMN_READS] = Int64GetDatum(counters->reads);
+			values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_bktype_io_op_valid(bktype, io_op) ||
+					!pgstat_io_context_io_op_valid(io_context, io_op))
+				{
+					pgstat_io_op_assert_zero(counters, io_op);
+					nulls[io_op + IO_COLUMN_IOOP_OFFSET] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index be47583122..4aefebc7f8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '14', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,alloc,extend,fsync,read,write,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..d122c36556 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,15 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.read,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, alloc, extend, fsync, read, write, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 6b233ff4c0..a75fc91c57 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -796,4 +796,205 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
+-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into Shared Buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
+'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+ 80000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs in pg_stat_io. Also test that Strategy reads are
+-- counted as such.
+-- Set wal_skip_threshold smaller than the expected size of test_io_strategy so
+-- that, even if wal_level is minimal, VACUUM FULL will fsync the newly
+-- rewritten test_io_strategy instead of writing it to WAL. Writing it to WAL
+-- will result in the newly written relation pages being in shared buffers --
+-- preventing us from testing BufferAccessStrategy allocs and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+CREATE TABLE test_io_strategy(a INT, b INT);
+ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy;
+-- Hope that the previous value of wal_skip_threshold was the default. We
+-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction
+-- block.
+RESET wal_skip_threshold;
+-- Test that, when using a Strategy, if creating a relation, Strategy extends
+-- are counted in pg_stat_io.
+-- A CTAS uses a Bulkwrite strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_extend;
+-- Test stats reset
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 096f00ce8b..090cc67296 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -396,4 +396,107 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+
+-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
+-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into Shared Buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
+'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs in pg_stat_io. Also test that Strategy reads are
+-- counted as such.
+
+-- Set wal_skip_threshold smaller than the expected size of test_io_strategy so
+-- that, even if wal_level is minimal, VACUUM FULL will fsync the newly
+-- rewritten test_io_strategy instead of writing it to WAL. Writing it to WAL
+-- will result in the newly written relation pages being in shared buffers --
+-- preventing us from testing BufferAccessStrategy allocs and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+CREATE TABLE test_io_strategy(a INT, b INT);
+ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+DROP TABLE test_io_strategy;
+-- Hope that the previous value of wal_skip_threshold was the default. We
+-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction
+-- block.
+RESET wal_skip_threshold;
+
+-- Test that, when using a Strategy, if creating a relation, Strategy extends
+-- are counted in pg_stat_io.
+-- A CTAS uses a Bulkwrite strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_context = 'Strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+DROP TABLE test_io_strategy_extend;
+
+-- Test stats reset
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v28-0002-Remove-unneeded-call-to-pgstat_report_wal.patchtext/x-patch; charset=US-ASCII; name=v28-0002-Remove-unneeded-call-to-pgstat_report_wal.patchDownload

From b203ed8e9900e861e2f99b2703461a065781ab17 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:40 -0400
Subject: [PATCH v28 2/5] Remove unneeded call to pgstat_report_wal()

pgstat_report_stat() will be called before shutdown so an explicit call
to pgstat_report_wal() is wasted.
---
 src/backend/postmaster/walwriter.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..beb46dcb55 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
 	}
 
 	if (ShutdownRequestPending)
-	{
-		/*
-		 * Force reporting remaining WAL statistics at process exit.
-		 *
-		 * Since pgstat_report_wal is invoked with 'force' is false in main
-		 * loop to avoid overloading the cumulative stats system, there may
-		 * exist unreported stats counters for the WAL writer.
-		 */
-		pgstat_report_wal(true);
-
 		proc_exit(0);
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
-- 
2.34.1

v28-0001-Add-BackendType-for-standalone-backends.patchtext/x-patch; charset=US-ASCII; name=v28-0001-Add-BackendType-for-standalone-backends.patchDownload

From 478d4747270df3b811723f7828353abe056c1631 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:24 -0400
Subject: [PATCH v28 1/5] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 17 +++++++++++------
 src/include/miscadmin.h           |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bd973ba613..bf3871a774 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,6 +278,12 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
@@ -285,12 +296,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 067b729d5a..7c41b27994 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -316,18 +316,19 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_LOGGER,
 } BackendType;
 
 extern PGDLLIMPORT BackendType MyBackendType;
-- 
2.34.1

v28-0004-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v28-0004-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 0f141fa7f97a57b8628b1b6fd6029bd3782f16a1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Aug 2022 11:35:20 -0400
Subject: [PATCH v28 4/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats are checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  57 +++++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 155 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  48 +++++-
 src/include/utils/pgstat_internal.h           |  34 ++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 335 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1d9509a2f6..9440b41770 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5360,6 +5360,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..ae33d090c0 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1312,6 +1325,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1486,6 +1507,42 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int backend_type = 0; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatShared_IOContextOps *backend_io_context_ops = &shmem->io_ops.stats[backend_type];
+		bool		expect_backend_stats = true;
+
+		if (!pgstat_io_op_stats_collected(backend_type))
+			expect_backend_stats = false;
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid(backend_type, io_context))
+			{
+				pgstat_io_context_ops_assert_zero(&backend_io_context_ops->data[io_context]);
+				continue;
+			}
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_bktype_io_op_valid(backend_type, io_op) ||
+					!pgstat_io_context_io_op_valid(io_context, io_op))
+					pgstat_io_op_assert_zero(&backend_io_context_ops->data[io_context],
+											 io_op);
+			}
+		}
+
+		if (!read_chunk_s(fpin, &backend_io_context_ops->data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 9c70a4a6dd..5d5728e8bc 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,39 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+/*
+ * Helper function to accumulate PgStat_IOOpCounters. If either of the
+ * passed-in PgStat_IOOpCounters are members of PgStatShared_IOContextOps, the
+ * caller is responsible for ensuring that the appropriate lock is held. This
+ * is not asserted because this function could plausibly be used to accumulate
+ * two local/pending PgStat_IOOpCounters.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *shared, PgStat_IOOpCounters *local, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			shared->allocs += local->allocs;
+			return;
+		case IOOP_EXTEND:
+			shared->extends += local->extends;
+			return;
+		case IOOP_FSYNC:
+			shared->fsyncs += local->fsyncs;
+			return;
+		case IOOP_READ:
+			shared->reads += local->reads;
+			return;
+		case IOOP_WRITE:
+			shared->writes += local->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -47,6 +80,79 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	if (!pgstat_io_op_stats_collected(MyBackendType))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_bktype_io_op_valid(MyBackendType, io_op) ||
+				!pgstat_io_context_io_op_valid(io_context, io_op))
+			{
+				pgstat_io_op_assert_zero(sharedent, io_op);
+				pgstat_io_op_assert_zero(pendingent, io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -85,6 +191,55 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp = backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..7a2fd1ccf9 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..2acfeb3192 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index d9e2a79382..cda4447e53 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2092,6 +2092,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2100,7 +2102,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7c41b27994..f65e9635a3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 32d1aec540..24a056fae6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -242,7 +243,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA7
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA8
 
 typedef struct PgStat_ArchiverStats
 {
@@ -314,6 +315,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -496,6 +503,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
+extern bool pgstat_flush_io_ops(bool nowait);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -506,6 +515,43 @@ extern bool pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op);
 extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op);
 extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
 
+/*
+ * Functions to assert that invalid IO Operation counters are zero. Used with
+ * the validation functions in pgstat_io_ops.c
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->allocs == 0 && counters->extends == 0 &&
+		   counters->fsyncs == 0 && counters->reads == 0 &&
+		   counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			Assert(counters->allocs == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
 
 /*
  * Functions in pgstat_database.c
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..14c28ee787 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,24 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data, PgStatShared_BackendIOContextOps->stats[0] also
+	 * protects PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +437,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +461,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +570,14 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -640,6 +669,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 72378b2148..7f8029059c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2016,12 +2016,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2029,6 +2031,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

#71

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#70)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote:

v28 attached.

Pushed 0001, 0002. Thanks!

- Andres

#72

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#70)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote:

v28 attached.

I've added the new structs I added to typedefs.list.

I've split the commit which adds all of the logic to track
IO operation statistics into two commits -- one which includes all of
the code to count IOOps for IOContexts locally in a backend and a second
which includes all of the code to accumulate and manage these with the
cumulative stats system.

Thanks!

A few notes about the commit which adds local IO Operation stats:

- There is a comment above pgstat_io_op_stats_collected() which mentions
the cumulative stats system even though this commit doesn't engage the
cumulative stats system. I wasn't sure if it was more or less
confusing to have two different versions of this comment.

Not worth being worried about...

- should pgstat_count_io_op() take BackendType as a parameter instead of
using MyBackendType internally?

I don't forsee a case where a different value would be passed in.

- pgstat_count_io_op() Assert()s that the passed-in IOOp and IOContext
are valid for this BackendType, but it doesn't check that all of the
pending stats which should be zero are zero. I thought this was okay
because if I did add that zero-check, it would be added to
pgstat_count_ioop() as well, and we already Assert() there that we can
count the op. Thus, it doesn't seem like checking that the stats are
zero would add any additional regression protection.

It's probably ok.

- I've kept pgstat_io_context_desc() and pgstat_io_op_desc() in the
commit which adds those types (the local stats commit), however they
are not used in that commit. I wasn't sure if I should keep them in
that commit or move them to the first commit using them (the commit
adding the new view).

- I've left pgstat_fetch_backend_io_context_ops() in the shared stats
commit, however it is not used until the commit which adds the view in
pg_stat_get_io(). I wasn't sure which way seemed better.

Think that's fine.

Notes on the commit which accumulates IO Operation stats in shared
memory:

- I've extended the usage of the Assert()s that IO Operation stats that
should be zero are. Previously we only checked the stats validity when
querying the view. Now we check it when flushing pending stats and
when reading the stats file into shared memory.

Note that the three locations with these validity checks (when
flushing pending stats, when reading stats file into shared memory,
and when querying the view) have similar looking code to loop through
and validate the stats. However, the actual action they perform if the
stats are valid is different for each site (adding counters together,
doing a read, setting nulls in a tuple column to true). Also, some of
these instances have other code interspersed in the loops which would
require additional looping if separated from this logic. So it was
difficult to see a way of combining these into a single helper
function.

All of them seem to repeat something like

+				if (!pgstat_bktype_io_op_valid(bktype, io_op) ||
+					!pgstat_io_context_io_op_valid(io_context, io_op))

perhaps those could be combined? Afaics nothing uses pgstat_bktype_io_op_valid
separately.

Subject: [PATCH v28 3/5] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO location source or target or IO type done by a backend. For
example, the checkpointer may write a shared buffer out. This would be
counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by
BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

s/is/are/?

Stats on IOOps for all IOContexts for a backend are counted in a
backend's local memory. This commit does not expose any functions for
aggregating or viewing these stats.

s/This commit does not/A subsequent commit will expose/...

@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
BufferDesc *bufHdr;
Block bufBlock;
bool found;
+ IOContext io_context;
bool isExtend;
bool isLocalBuf = SmgrIsTemp(smgr);

@@ -986,10 +987,25 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_context = IOCONTEXT_LOCAL;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+
+		if (strategy != NULL)
+			io_context = IOCONTEXT_STRATEGY;
+		else
+			io_context = IOCONTEXT_SHARED;
+	}

There's a isLocalBuf block earlier on, couldn't we just determine the context
there? I guess there's a branch here already, so it's probably fine as is.

if (isExtend)
{
+
+		pgstat_count_io_op(IOOP_EXTEND, io_context);

Spurious newline.

@@ -2820,9 +2857,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
*
* If the caller has an smgr reference for the buffer's relation, pass it
* as the second parameter.  If not, pass NULL.
+ *
+ * IOContext will always be IOCONTEXT_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
*/
static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)

Too long line?

But also, why document the possible values here? Seems likely to get out of
date at some point, and it doesn't seem important to know?

@@ -3549,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
localpage,
false);
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

Probably not worth doing, but these made me wonder whether there should be a
function for counting N operations at once.

@@ -212,8 +215,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		*from_ring = buf != NULL;
+		if (*from_ring)
+		{

Don't really like the if (*from_ring) - why not keep it as buf != NULL? Seems
a bit confusing this way, making it less obvious what's being changed.

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 014f644bf9..a3d76599bf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
*/
#include "postgres.h"

+#include "pgstat.h"
#include "access/parallel.h"
#include "catalog/catalog.h"
#include "executor/instrument.h"

Do most other places not put pgstat.h in the alphabetical order of headers?

@@ -432,6 +432,15 @@ ProcessSyncRequests(void)
total_elapsed += elapsed;
processed++;

+					/*
+					 * Note that if a backend using a BufferAccessStrategy is
+					 * forced to do its own fsync (as opposed to the
+					 * checkpointer doing it), it will not be counted as an
+					 * IOCONTEXT_STRATEGY IOOP_FSYNC and instead will be
+					 * counted as an IOCONTEXT_SHARED IOOP_FSYNC.
+					 */
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);

Why is this noted here? Perhaps just point to the place where that happens
instead? I think it's also documented in ForwardSyncRequest()? Or just only
mention it there...

@@ -0,0 +1,191 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group

Arguably this would just be 2021-2022

+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_context];
+
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}

How about replacing the breaks with a return and then erroring out if we reach
the end of the function? You did that below, and I think it makes sense.

+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{

Maybe add a tiny comment about what 'valid' means here? Something like
'return whether the backend type counts io in io_context'.

+	/*
+	 * Only regular backends and WAL Sender processes executing queries should
+	 * use local buffers.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+		B_BG_WRITER || bktype == B_CHECKPOINTER || bktype ==
+		B_AUTOVAC_WORKER || bktype == B_BG_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;

I think BG_WORKERS could end up using local buffers, extensions can do just
about everything in them.

+bool
+pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op)
+{
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
+		IOOP_READ)
+		return false;

Perhaps we should add an assertion about the backend type making sense here?
I.e. that it's not archiver, walwriter etc?

+bool
+pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op)
+{
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing. Set this cell to NULL to differentiate between an
+	 * invalid combination and 0 observed IO Operations.

This comment feels a bit out of place?

+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!pgstat_bktype_io_op_valid(bktype, io_op))
+		return false;
+
+	if (!pgstat_io_context_io_op_valid(io_context, io_op))
+		return false;
+
+	/*
+	 * There are currently no cases of a BackendType, IOContext, IOOp
+	 * combination that are specifically invalid.
+	 */

"specifically"?

From 0f141fa7f97a57b8628b1b6fd6029bd3782f16a1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Aug 2022 11:35:20 -0400
Subject: [PATCH v28 4/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats are checked before flushing pending stats and
before reading in the existing stats file to shared memory.

s/are checked/is checked/?

@@ -1486,6 +1507,42 @@ pgstat_read_statsfile(void)
if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
goto error;

+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int backend_type = 0; backend_type < BACKEND_NUM_TYPES; backend_type++)
+	{
+		PgStatShared_IOContextOps *backend_io_context_ops = &shmem->io_ops.stats[backend_type];
+		bool		expect_backend_stats = true;
+
+		if (!pgstat_io_op_stats_collected(backend_type))
+			expect_backend_stats = false;
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid(backend_type, io_context))
+			{
+				pgstat_io_context_ops_assert_zero(&backend_io_context_ops->data[io_context]);
+				continue;
+			}
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_bktype_io_op_valid(backend_type, io_op) ||
+					!pgstat_io_context_io_op_valid(io_context, io_op))
+					pgstat_io_op_assert_zero(&backend_io_context_ops->data[io_context],
+											 io_op);
+			}
+		}
+
+		if (!read_chunk_s(fpin, &backend_io_context_ops->data))
+			goto error;
+	}

Could we put the validation out of line? That's a lot of io stats specific
code to be in pgstat_read_statsfile().

+/*
+ * Helper function to accumulate PgStat_IOOpCounters. If either of the
+ * passed-in PgStat_IOOpCounters are members of PgStatShared_IOContextOps, the
+ * caller is responsible for ensuring that the appropriate lock is held. This
+ * is not asserted because this function could plausibly be used to accumulate
+ * two local/pending PgStat_IOOpCounters.

What's "this" here?

+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *shared, PgStat_IOOpCounters *local, IOOp io_op)

Given that the comment above says both of them may be local, it's a bit odd to
call it 'shared' here...

+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}

Not for this patch series, but we really should replace this set of functions
with storing the relevant offset in the kind_info.

@@ -496,6 +503,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
*/

extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
+extern bool pgstat_flush_io_ops(bool nowait);
extern const char *pgstat_io_context_desc(IOContext io_context);
extern const char *pgstat_io_op_desc(IOOp io_op);

Is there any call to pgstat_flush_io_ops() from outside pgstat*.c? So possibly
it could be in pgstat_internal.h? Not that it's particularly important...

@@ -506,6 +515,43 @@ extern bool pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op);
extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op);
extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);

+/*
+ * Functions to assert that invalid IO Operation counters are zero. Used with
+ * the validation functions in pgstat_io_ops.c
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->allocs == 0 && counters->extends == 0 &&
+		   counters->fsyncs == 0 && counters->reads == 0 &&
+		   counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			Assert(counters->allocs == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);

Hm. This means it'll emit code even in non-assertion builds - this should
probably just be an Assert(false) or pg_unreachable().

Subject: [PATCH v28 5/5] Add system view tracking IO ops per backend type

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

"fetched from statistics incremented"?

Each row of the view is stats for a particular BackendType for a
particular IOContext (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).

s/is/shows/?

s/for a particular BackendType for a particular IOContext/for a particularl
BackendType and IOContext/? Somehow the repetition is weird.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'io'.

I suspect we should still consider doing that in the future, perhaps by
documenting that the relevant fields in pg_stat_bgwriter aren't reset by the
'bgwriter' target anymore? And noting that reliance on those fields is
"deprecated" and that pg_stat_io should be used instead?

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: /messages/by-id/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
---
doc/src/sgml/monitoring.sgml | 115 ++++++++++++++-
src/backend/catalog/system_views.sql | 12 ++
src/backend/utils/adt/pgstatfuncs.c | 100 +++++++++++++
src/include/catalog/pg_proc.dat | 9 ++
src/test/regress/expected/rules.out | 9 ++
src/test/regress/expected/stats.out | 201 +++++++++++++++++++++++++++
src/test/regress/sql/stats.sql | 103 ++++++++++++++
7 files changed, 548 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9440b41770..9949011ba3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
</entry>
</row>
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>

The "for each for each" thing again :)

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used (e.g. shared buffers, direct).
+      </para></entry>
+     </row>

Wrong list of contexts.

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
</para></entry>
</row>
</tbody>

Part of me thinks it'd be nicer if it were "allocated, read, written, extended,
fsynced, stats_reset", instead of alphabetical order. The order already isn't
alphabetical.

+	/*
+	 * When adding a new column to the pg_stat_io view, add a new enum value
+	 * here above IO_NUM_COLUMNS.
+	 */
+	enum
+	{
+		IO_COLUMN_BACKEND_TYPE,
+		IO_COLUMN_IO_CONTEXT,
+		IO_COLUMN_ALLOCS,
+		IO_COLUMN_EXTENDS,
+		IO_COLUMN_FSYNCS,
+		IO_COLUMN_READS,
+		IO_COLUMN_WRITES,
+		IO_COLUMN_RESET_TIME,
+		IO_NUM_COLUMNS,
+	};

Given it's local and some of the lines are long, maybe just use COL?

+#define IO_COLUMN_IOOP_OFFSET (IO_COLUMN_IO_CONTEXT + 1)

Undef'ing it probably worth doing.

+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_io_op_stats_collected(bktype))
+			expect_backend_stats = false;

Why not just expect_backend_stats = pgstat_io_op_stats_collected()?

+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			Datum		values[IO_NUM_COLUMNS];
+			bool		nulls[IO_NUM_COLUMNS];
+
+			/*
+			 * Some combinations of IOCONTEXT and BackendType are not valid
+			 * for any type of IO Operation. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid(bktype, io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));

I'd replace the memset with values[...] = {0} etc.

+			values[IO_COLUMN_BACKEND_TYPE] = bktype_desc;
+			values[IO_COLUMN_IO_CONTEXT] = CStringGetTextDatum(
+															   pgstat_io_context_desc(io_context));

Pgindent, I hate you.

Perhaps put it the context desc in a local var, so it doesn't look quite this
ugly?

+			values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COLUMN_READS] = Int64GetDatum(counters->reads);
+			values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_bktype_io_op_valid(bktype, io_op) ||
+					!pgstat_io_context_io_op_valid(io_context, io_op))
+				{
+					pgstat_io_op_assert_zero(counters, io_op);
+					nulls[io_op + IO_COLUMN_IOOP_OFFSET] = true;
+				}
+			}

A bit weird that we first assign a value and then set nulls separately. But
it's not obvious how to make it look nice otherwise.

+-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
+-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs.
+CHECKPOINT;

Does that work reliably? A checkpoint could have started just before the
CREATE TABLE, I think? Then it'd not have flushed those writes yet. I think
doing two checkpoints would protect against that.

+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;

Tablespace creation is somewhat expensive, do we really need that? There
should be one set up in setup.sql or such.

+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
+'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';

Could be abbreviated with repeat('a', some-number) :P

Can the table be smaller than this? That might show up on a slow machine.

+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;

Random q: Why are we uppercasing the first letter of the context?

+CREATE TABLE test_io_strategy(a INT, b INT);
+ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');

I think you can specify that as part of the CREATE TABLE. Not sure if
otherwise there's not a race where autovac coul start before you do the ALTER.

+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).

... because VACUUM FULL currently doesn't set all-visible etc on the pages,
which the subsequent vacuum will then do.

+-- Hope that the previous value of wal_skip_threshold was the default. We
+-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction
+-- block.
+RESET wal_skip_threshold;

Nothing in this file set it before, so that's a pretty sure-to-be-fulfilled
hope.

+-- Test that, when using a Strategy, if creating a relation, Strategy extends

s/if/when/?

Looks good!

Greetings,

Andres Freund

#73

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#72)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v29 attached

On Thu, Aug 25, 2022 at 3:15 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote:

Notes on the commit which accumulates IO Operation stats in shared
memory:

- I've extended the usage of the Assert()s that IO Operation stats that
should be zero are. Previously we only checked the stats validity when
querying the view. Now we check it when flushing pending stats and
when reading the stats file into shared memory.

Note that the three locations with these validity checks (when
flushing pending stats, when reading stats file into shared memory,
and when querying the view) have similar looking code to loop through
and validate the stats. However, the actual action they perform if the
stats are valid is different for each site (adding counters together,
doing a read, setting nulls in a tuple column to true). Also, some of
these instances have other code interspersed in the loops which would
require additional looping if separated from this logic. So it was
difficult to see a way of combining these into a single helper
function.

All of them seem to repeat something like

+ if (!pgstat_bktype_io_op_valid(bktype,

io_op) ||

+

!pgstat_io_context_io_op_valid(io_context, io_op))

perhaps those could be combined? Afaics nothing uses
pgstat_bktype_io_op_valid
separately.

I've combined these into pgstat_io_op_valid().

Subject: [PATCH v28 3/5] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO location source or target or IO type done by a backend. For
example, the checkpointer may write a shared buffer out. This would be
counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by
BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

s/is/are/?

changed

Stats on IOOps for all IOContexts for a backend are counted in a
backend's local memory. This commit does not expose any functions for
aggregating or viewing these stats.

s/This commit does not/A subsequent commit will expose/...

changed

@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char

relpersistence, ForkNumber forkNum,

BufferDesc *bufHdr;
Block bufBlock;
bool found;
+ IOContext io_context;
bool isExtend;
bool isLocalBuf = SmgrIsTemp(smgr);

@@ -986,10 +987,25 @@ ReadBuffer_common(SMgrRelation smgr, char

relpersistence, ForkNumber forkNum,

*/
Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /*

spinlock not needed */

- bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) :

BufHdrGetBlock(bufHdr);
+     if (isLocalBuf)
+     {
+             bufBlock = LocalBufHdrGetBlock(bufHdr);
+             io_context = IOCONTEXT_LOCAL;
+     }
+     else
+     {
+             bufBlock = BufHdrGetBlock(bufHdr);
+
+             if (strategy != NULL)
+                     io_context = IOCONTEXT_STRATEGY;
+             else
+                     io_context = IOCONTEXT_SHARED;
+     }
There's a isLocalBuf block earlier on, couldn't we just determine the
context
there? I guess there's a branch here already, so it's probably fine as is.

I've added this as close as possible to the code where we use the
io_context. If I were to move it, it would make sense to move it all the
way to the top of ReadBuffer_common() where we first define isLocalBuf.
I've left it as is.

if (isExtend)
{
+
+             pgstat_count_io_op(IOOP_EXTEND, io_context);

Spurious newline.

fixed

@@ -2820,9 +2857,12 @@ BufferGetTag(Buffer buffer, RelFileLocator

*rlocator, ForkNumber *forknum,

*
* If the caller has an smgr reference for the buffer's relation, pass

it
* as the second parameter.  If not, pass NULL.
+ *
+ * IOContext will always be IOCONTEXT_SHARED except when a buffer
access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
*/
static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
Too long line?

But also, why document the possible values here? Seems likely to get out of
date at some point, and it doesn't seem important to know?

Deleted.

@@ -3549,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
localpage,
false);

+ pgstat_count_io_op(IOOP_WRITE,

IOCONTEXT_LOCAL);

+
buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);

pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

Probably not worth doing, but these made me wonder whether there should be
a
function for counting N operations at once.

Would it be worth it here? We would need a local variable to track how
many local buffers we end up writing. Do you think that
pgstat_count_io_op() will not be inlined and thus we will end up with
lots of extra function calls if we do a pgstat_count_io_op() on every
iteration? And that it will matter in FlushRelationBuffers()?
The other times that pgstat_count_io_op() is used in a loop, it is
part of the branch that will exit the loop and only be called once-ish.

Or are you thinking that just generally it might be nice to have?

@@ -212,8 +215,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy,

uint32 *buf_state)
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
-             if (buf != NULL)
+             *from_ring = buf != NULL;
+             if (*from_ring)
+             {
Don't really like the if (*from_ring) - why not keep it as buf != NULL?
Seems
a bit confusing this way, making it less obvious what's being changed.

Changed

diff --git a/src/backend/storage/buffer/localbuf.c
b/src/backend/storage/buffer/localbuf.c
index 014f644bf9..a3d76599bf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
*/
#include "postgres.h"
+#include "pgstat.h"
#include "access/parallel.h"
#include "catalog/catalog.h"
#include "executor/instrument.h"
Do most other places not put pgstat.h in the alphabetical order of headers?

Fixed

@@ -432,6 +432,15 @@ ProcessSyncRequests(void)
total_elapsed += elapsed;
processed++;
+                                     /*
+                                      * Note that if a backend using a
BufferAccessStrategy is

+ * forced to do its own fsync (as

opposed to the

+ * checkpointer doing it), it will

not be counted as an

+ * IOCONTEXT_STRATEGY IOOP_FSYNC

and instead will be

+ * counted as an IOCONTEXT_SHARED

IOOP_FSYNC.
+                                      */
+                                     pgstat_count_io_op(IOOP_FSYNC,
IOCONTEXT_SHARED);

Why is this noted here? Perhaps just point to the place where that happens
instead? I think it's also documented in ForwardSyncRequest()? Or just only
mention it there...

Removed

@@ -0,0 +1,191 @@
+/*
-------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *     Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It
is kept

+ * separate from pgstat.c to enforce the line between the statistics

access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
Arguably this would just be 2021-2022

Changed

+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+     PgStat_IOOpCounters *pending_counters =

&pending_IOOpStats.data[io_context];

+
+     Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+     switch (io_op)
+     {
+             case IOOP_ALLOC:
+                     pending_counters->allocs++;
+                     break;
+             case IOOP_EXTEND:
+                     pending_counters->extends++;
+                     break;
+             case IOOP_FSYNC:
+                     pending_counters->fsyncs++;
+                     break;
+             case IOOP_READ:
+                     pending_counters->reads++;
+                     break;
+             case IOOP_WRITE:
+                     pending_counters->writes++;
+                     break;
+     }
+
+}

How about replacing the breaks with a return and then erroring out if we
reach
the end of the function? You did that below, and I think it makes sense.

I used breaks because in the subsequent commit I introduce the variable
"have_ioopstats", and I set have_ioopstats to false in
pgstat_count_io_op() after counting.
It is probably safe to set have_ioopstats to true before incrementing it
since this backend is the only one that can see have_ioopstats and it
shouldn't fail while incrementing the counter but it seems less clear
than doing it after.

Instead of erroring out for an unknown IOOp, I decided to add Asserts
about the IOContext and IOOp being valid and that the combination of
MyBackendType, IOContext, and IOOp are valid. I think it will be good to
assert that the IOContext is valid before using it as an array index for
lookup in pending stats.

+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
Maybe add a tiny comment about what 'valid' means here? Something like
'return whether the backend type counts io in io_context'.

Changed

+     /*
+      * Only regular backends and WAL Sender processes executing

queries should

+      * use local buffers.
+      */
+     no_local = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+             B_BG_WRITER || bktype == B_CHECKPOINTER || bktype ==
+             B_AUTOVAC_WORKER || bktype == B_BG_WORKER || bktype ==
+             B_STANDALONE_BACKEND || bktype == B_STARTUP;

I think BG_WORKERS could end up using local buffers, extensions can do just
about everything in them.

Fixed and added comment.

+bool
+pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op)
+{
+     if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
+             IOOP_READ)
+             return false;
Perhaps we should add an assertion about the backend type making sense
here?
I.e. that it's not archiver, walwriter etc?

Done

+bool
+pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op)
+{
+     /*
+      * Temporary tables using local buffers are not logged and thus do
not

+ * require fsync'ing. Set this cell to NULL to differentiate

between an

+ * invalid combination and 0 observed IO Operations.

This comment feels a bit out of place?

Deleted

+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp

io_op)

+{
+     if (!pgstat_io_op_stats_collected(bktype))
+             return false;
+
+     if (!pgstat_bktype_io_context_valid(bktype, io_context))
+             return false;
+
+     if (!pgstat_bktype_io_op_valid(bktype, io_op))
+             return false;
+
+     if (!pgstat_io_context_io_op_valid(io_context, io_op))
+             return false;
+
+     /*
+      * There are currently no cases of a BackendType, IOContext, IOOp
+      * combination that are specifically invalid.
+      */

"specifically"?

I removed this and mentioned it (rephrased) above pgstat_io_op_valid()

From 0f141fa7f97a57b8628b1b6fd6029bd3782f16a1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Aug 2022 11:35:20 -0400
Subject: [PATCH v28 4/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats are checked before flushing pending stats and
before reading in the existing stats file to shared memory.

s/are checked/is checked/?

Fixed

@@ -1486,6 +1507,42 @@ pgstat_read_statsfile(void)
if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
goto error;
+     /*
+      * Read IO Operations stats struct
+      */
+     if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+             goto error;
+
+     for (int backend_type = 0; backend_type < BACKEND_NUM_TYPES;
backend_type++)
+     {
+             PgStatShared_IOContextOps *backend_io_context_ops =
&shmem->io_ops.stats[backend_type];
+             bool            expect_backend_stats = true;
+
+             if (!pgstat_io_op_stats_collected(backend_type))
+                     expect_backend_stats = false;
+
+             for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES;
io_context++)
+             {
+                     if (!expect_backend_stats ||
+
!pgstat_bktype_io_context_valid(backend_type, io_context))

+ {
+

pgstat_io_context_ops_assert_zero(&backend_io_context_ops->data[io_context]);
+                             continue;
+                     }
+
+                     for (int io_op = 0; io_op < IOOP_NUM_TYPES;
io_op++)

+ {
+ if

(!pgstat_bktype_io_op_valid(backend_type, io_op) ||

+

!pgstat_io_context_io_op_valid(io_context, io_op))

+

pgstat_io_op_assert_zero(&backend_io_context_ops->data[io_context],

+

io_op);
+                     }
+             }
+
+             if (!read_chunk_s(fpin, &backend_io_context_ops->data))
+                     goto error;
+     }
Could we put the validation out of line? That's a lot of io stats specific
code to be in pgstat_read_statsfile().

Done.

+/*
+ * Helper function to accumulate PgStat_IOOpCounters. If either of the
+ * passed-in PgStat_IOOpCounters are members of
PgStatShared_IOContextOps, the

+ * caller is responsible for ensuring that the appropriate lock is

held. This

+ * is not asserted because this function could plausibly be used to

accumulate

+ * two local/pending PgStat_IOOpCounters.

What's "this" here?

I rephrased it.

@@ -496,6 +503,8 @@ extern PgStat_CheckpointerStats

*pgstat_fetch_stat_checkpointer(void);

*/

extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps

*pgstat_fetch_backend_io_context_ops(void);

+extern bool pgstat_flush_io_ops(bool nowait);
extern const char *pgstat_io_context_desc(IOContext io_context);
extern const char *pgstat_io_op_desc(IOOp io_op);

Is there any call to pgstat_flush_io_ops() from outside pgstat*.c? So
possibly
it could be in pgstat_internal.h? Not that it's particularly important...

Moved it.

@@ -506,6 +515,43 @@ extern bool pgstat_bktype_io_op_valid(BackendType

bktype, IOOp io_op);

extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp

io_op);

extern bool pgstat_expect_io_op(BackendType bktype, IOContext

io_context, IOOp io_op);

+/*
+ * Functions to assert that invalid IO Operation counters are zero.

Used with

+ * the validation functions in pgstat_io_ops.c
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+     Assert(counters->allocs == 0 && counters->extends == 0 &&
+                counters->fsyncs == 0 && counters->reads == 0 &&
+                counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+     switch (io_op)
+     {
+             case IOOP_ALLOC:
+                     Assert(counters->allocs == 0);
+                     return;
+             case IOOP_EXTEND:
+                     Assert(counters->extends == 0);
+                     return;
+             case IOOP_FSYNC:
+                     Assert(counters->fsyncs == 0);
+                     return;
+             case IOOP_READ:
+                     Assert(counters->reads == 0);
+                     return;
+             case IOOP_WRITE:
+                     Assert(counters->writes == 0);
+                     return;
+     }
+
+     elog(ERROR, "unrecognized IOOp value: %d", io_op);

Hm. This means it'll emit code even in non-assertion builds - this should
probably just be an Assert(false) or pg_unreachable().

Fixed.

Subject: [PATCH v28 5/5] Add system view tracking IO ops per backend type

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

"fetched from statistics incremented"?

Rephrased it.

Each row of the view is stats for a particular BackendType for a
particular IOContext (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).

s/is/shows/?

s/for a particular BackendType for a particular IOContext/for a particularl
BackendType and IOContext/? Somehow the repetition is weird.

Both of the above wordings are now changed.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9440b41770..9949011ba3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?

Ss 11:34 0:00 postgres: ser

</entry>
</row>

+ <row>
+

+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>

The "for each for each" thing again :)

Changed it.

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used (e.g. shared buffers, direct).
+      </para></entry>
+     </row>

Wrong list of contexts.

Fixed it.

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time

zone</type>

+      </para>
+      <para>
+       Time at which these statistics were last reset.
</para></entry>
</row>
</tbody>

Part of me thinks it'd be nicer if it were "allocated, read, written,
extended,
fsynced, stats_reset", instead of alphabetical order. The order already
isn't
alphabetical.

I've updated the order in the view and docs.

+     /*
+      * When adding a new column to the pg_stat_io view, add a new enum

value

+      * here above IO_NUM_COLUMNS.
+      */
+     enum
+     {
+             IO_COLUMN_BACKEND_TYPE,
+             IO_COLUMN_IO_CONTEXT,
+             IO_COLUMN_ALLOCS,
+             IO_COLUMN_EXTENDS,
+             IO_COLUMN_FSYNCS,
+             IO_COLUMN_READS,
+             IO_COLUMN_WRITES,
+             IO_COLUMN_RESET_TIME,
+             IO_NUM_COLUMNS,
+     };

Given it's local and some of the lines are long, maybe just use COL?

I've shortened COLUMN to COL. However, I've also moved this enum outside
of the function and typedef'd it. I did this because, upon changing the
order of the columns in the view, I could no longer use
IO_COLUMN_IOOP_OFFSET and the IOOp value in the loop at the bottom of
pg_sta_get_io() to set the correct column to NULL. So, I created a
helper function which translates IOOp to io_stat_col.

+#define IO_COLUMN_IOOP_OFFSET (IO_COLUMN_IO_CONTEXT + 1)

Undef'ing it probably worth doing.

It's gone now anyway.

+     SetSingleFuncCall(fcinfo, 0);
+     rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+     backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+     reset_time =

TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);

+
+     for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+     {
+             Datum           bktype_desc =

CStringGetTextDatum(GetBackendTypeDesc(bktype));

+             bool            expect_backend_stats = true;
+             PgStat_IOContextOps *io_context_ops =

&backends_io_stats->stats[bktype];

+
+             /*
+              * For those BackendTypes without IO Operation stats, skip
+              * representing them in the view altogether.
+              */
+             if (!pgstat_io_op_stats_collected(bktype))
+                     expect_backend_stats = false;

Why not just expect_backend_stats = pgstat_io_op_stats_collected()?

Updated this everywhere it occurred.

+ for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES;

io_context++)
+             {
+                     PgStat_IOOpCounters *counters =
&io_context_ops->data[io_context];
+                     Datum           values[IO_NUM_COLUMNS];
+                     bool            nulls[IO_NUM_COLUMNS];
+
+                     /*
+                      * Some combinations of IOCONTEXT and BackendType
are not valid

+ * for any type of IO Operation. In such cases,

omit the entire
+                      * row from the view.
+                      */
+                     if (!expect_backend_stats ||
+                             !pgstat_bktype_io_context_valid(bktype,
io_context))

+ {
+

pgstat_io_context_ops_assert_zero(counters);
+                             continue;
+                     }
+
+                     memset(values, 0, sizeof(values));
+                     memset(nulls, 0, sizeof(nulls));
I'd replace the memset with values[...] = {0} etc.

Done.

+                     values[IO_COLUMN_BACKEND_TYPE] = bktype_desc;
+                     values[IO_COLUMN_IO_CONTEXT] = CStringGetTextDatum(
+
pgstat_io_context_desc(io_context));

Pgindent, I hate you.

Perhaps put it the context desc in a local var, so it doesn't look quite
this
ugly?

Did this.

+-- Test that allocs, extends, reads, and writes to Shared Buffers and

fsyncs

+-- done to ensure durability of Shared Buffers are tracked in

pg_stat_io.

+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE

io_context = 'Shared' \gset

+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io

WHERE io_context = 'Shared' \gset

+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE

io_context = 'Shared' \gset

+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE

io_context = 'Shared' \gset

+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE

io_context = 'Shared' \gset

+-- Create a regular table and insert some data to generate

IOCONTEXT_SHARED allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush
+--------------------------
+
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED
writes and fsyncs.

+CHECKPOINT;

Does that work reliably? A checkpoint could have started just before the
CREATE TABLE, I think? Then it'd not have flushed those writes yet. I think
doing two checkpoints would protect against that.

If the first checkpoint starts just before creating the table and those
buffers are dirtied during that checkpoint and thus not written out by
checkpointer during that checkpoint, then the test's (single) explicit
checkpoint would end up picking up those dirty buffers and writing them
out, right?

+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
Tablespace creation is somewhat expensive, do we really need that? There
should be one set up in setup.sql or such.

The only ones I see in regress are for tablespace.sql which drops them
in the same test and is testing dropping tablespaces.

+-- Test that allocs, extends, reads, and writes of temporary tables are

tracked
+-- in pg_stat_io.
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE
io_context = 'Local' \gset

+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE

io_context = 'Local' \gset

+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE

io_context = 'Local' \gset

+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE

io_context = 'Local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
+'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';

Could be abbreviated with repeat('a', some-number) :P

Done.

Can the table be smaller than this? That might show up on a slow machine.

Setting temp_buffers to 1MB, 7500 tuples of this width seem like enough.
I inserted 8000 to be safe -- seems like an order of magnitude less
should be good.

+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE

io_context = 'Local' \gset

+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE

io_context = 'Local' \gset

+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE

io_context = 'Local' \gset

+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE

io_context = 'Local' \gset

+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;

Random q: Why are we uppercasing the first letter of the context?

hmm. dunno. I changed it to be lowercase now.

+CREATE TABLE test_io_strategy(a INT, b INT);
+ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');
I think you can specify that as part of the CREATE TABLE. Not sure if
otherwise there's not a race where autovac coul start before you do the
ALTER.

Done.

+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the
table

+-- first with VACUUM (FULL).

... because VACUUM FULL currently doesn't set all-visible etc on the pages,
which the subsequent vacuum will then do.

It is true that the second VACUUM will set all-visible while VACUUM FULL
will not. However, I didn't think that that writing was what allowed us
to test strategy reads and allocs. It would theoretically allow us to
test strategy writes, however, in practice, checkpointer or background
writer often wrote out these dirty pages with all-visible set before
this backend had a chance to reuse them and write them out itself.

Unless you are saying that the subsequent VACUUM would be a no-op were
VACUUM FULL to set all-visible on the rewritten pages?

+-- Hope that the previous value of wal_skip_threshold was the default.

We

+-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a

transaction

+-- block.
+RESET wal_skip_threshold;

Nothing in this file set it before, so that's a pretty sure-to-be-fulfilled
hope.

I've removed the comment.

+-- Test that, when using a Strategy, if creating a relation, Strategy

extends

s/if/when/?

Changed this.

Thanks for the detailed review!

- Melanie

Attachments:

v29-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v29-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From b5deb62dc5c50434191cc2d4c12de46b6ea22ce2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:50 -0400
Subject: [PATCH v29 3/3] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (allocs,
extends, fsyncs, reads, and writes) done through each IOContext (shared
buffers, local buffers, strategy buffers) by each type of backend (e.g.
client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the "strategy" IOContext for
checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsync'd so cells for all BackendTypes for IOCONTEXT_STRATEGY and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 115 ++++++++++++++-
 src/backend/catalog/system_views.sql |  12 ++
 src/backend/utils/adt/pgstatfuncs.c  | 117 ++++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 ++
 src/test/regress/expected/rules.out  |   9 ++
 src/test/regress/expected/stats.out  | 202 +++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql       | 104 ++++++++++++++
 7 files changed, 567 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9440b41770..c7ca078bc8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,7 +3609,111 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO Context containing global data for the cluster on IO Operations done
+   by that backend type in that IO Context.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used (e.g. shared buffers, local buffers).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5a844b63a1..fa5cac7759 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,18 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.alloc,
+       b.read,
+       b.write,
+       b.extend,
+       b.fsync,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index dc3a1a26a4..b7f5818028 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1726,6 +1726,123 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_ALLOCS,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return IO_COL_ALLOCS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected(bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 * Some combinations of IOCONTEXT and BackendType are not valid
+			 * for any type of IO Operation. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid(bktype, io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, io_op);
+					nulls[pgstat_io_op_get_index(io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index be47583122..760055689f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '14', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,alloc,read,write,extend,fsync,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..d79a484df9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,15 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.alloc,
+    b.read,
+    b.write,
+    b.extend,
+    b.fsync,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, alloc, read, write, extend, fsync, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 6b233ff4c0..8aace2d81c 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -796,4 +796,206 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that allocs, reads, writes, and extends to shared buffers and fsyncs
+-- done to ensure durability of shared buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples.
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that, when using a BufferAccessStrategy, reusing buffers from the
+-- Strategy ring count as "Strategy" allocs in pg_stat_io. Also test that
+-- Strategy reads are counted as such.
+-- Set wal_skip_threshold smaller than the expected size of test_io_strategy so
+-- that, even if wal_level is minimal, VACUUM FULL will fsync the newly
+-- rewritten test_io_strategy instead of writing it to WAL. Writing it to WAL
+-- will result in the newly written relation pages being in shared buffers --
+-- preventing us from testing BufferAccessStrategy allocs and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+CREATE TABLE test_io_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy;
+RESET wal_skip_threshold;
+-- Test that, when using a BufferAccessStrategy and creating a relation,
+-- Strategy extends are counted in pg_stat_io.
+-- A CTAS uses a Bulk Write strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_extend;
+-- Test stats reset
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 096f00ce8b..3c61dd9c4f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -396,4 +396,108 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that allocs, reads, writes, and extends to shared buffers and fsyncs
+-- done to ensure durability of shared buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples.
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that, when using a BufferAccessStrategy, reusing buffers from the
+-- Strategy ring count as "Strategy" allocs in pg_stat_io. Also test that
+-- Strategy reads are counted as such.
+
+-- Set wal_skip_threshold smaller than the expected size of test_io_strategy so
+-- that, even if wal_level is minimal, VACUUM FULL will fsync the newly
+-- rewritten test_io_strategy instead of writing it to WAL. Writing it to WAL
+-- will result in the newly written relation pages being in shared buffers --
+-- preventing us from testing BufferAccessStrategy allocs and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+CREATE TABLE test_io_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+DROP TABLE test_io_strategy;
+RESET wal_skip_threshold;
+
+-- Test that, when using a BufferAccessStrategy and creating a relation,
+-- Strategy extends are counted in pg_stat_io.
+-- A CTAS uses a Bulk Write strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+DROP TABLE test_io_strategy_extend;
+
+-- Test stats reset
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v29-0001-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v29-0001-Track-IO-operation-statistics-locally.patchDownload

From fbc1c53f9595cc4724539e199b2622844e746c03 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Aug 2022 11:08:23 -0400
Subject: [PATCH v29 1/3] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "write" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (alloc, read, write, extend, fsync) is counted per IOContext
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_STRATEGY IOContext concerns IO operations on buffers as
part of a BufferAccessStrategy.

IOOP_ALLOC IOOps are counted in IOCONTEXT_SHARED and IOCONTEXT_LOCAL
IOContexts whenever a buffer is acquired through [Local]BufferAlloc().

IOOP_ALLOC IOOps are counted in the IOCONTEXT_STRATEGY IOContext
whenever a buffer already in the strategy ring is reused. IOOP_WRITE
IOOps are counted in the IOCONTEXT_STRATEGY IOContext whenever the
reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  12 ++
 src/backend/storage/buffer/bufmgr.c        |  60 +++++--
 src/backend/storage/buffer/freelist.c      |  21 ++-
 src/backend/storage/buffer/localbuf.c      |   5 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 199 +++++++++++++++++++++
 src/include/pgstat.h                       |  53 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/tools/pgindent/typedefs.list           |   4 +
 10 files changed, 347 insertions(+), 12 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..bd2e1de7c2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,18 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_STRATEGY at this point, so count the
+		 * fsync as being in the IOCONTEXT_SHARED IOContext. This is probably
+		 * okay, because the number of backend fsyncs doesn't say anything
+		 * about the efficacy of the BufferAccessStrategy. And counting both
+		 * fsyncs done in IOCONTEXT_SHARED and IOCONTEXT_STRATEGY under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e898ffad7b..a49223ee96 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -986,10 +987,24 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_context = IOCONTEXT_LOCAL;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+
+		if (strategy != NULL)
+			io_context = IOCONTEXT_STRATEGY;
+		else
+			io_context = IOCONTEXT_SHARED;
+	}
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1020,6 +1035,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1190,6 +1207,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1219,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1256,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1284,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an
+				 * existing strategy buffer being reused, count this as a
+				 * strategy write for the purposes of IO Operations statistics
+				 * tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring
+				 * will be counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				io_context = from_ring ? IOCONTEXT_STRATEGY : IOCONTEXT_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2573,7 +2609,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2823,7 +2859,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2903,6 +2939,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3554,6 +3592,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3589,7 +3629,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3687,7 +3727,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3897,7 +3937,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3924,7 +3964,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..50d04b2b6d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,13 +199,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +216,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy
+			 * ring will be counted as IOCONTEXT_STRATEGY allocations for the
+			 * purposes of IO Operation statistics tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must
+			 * be allocated from shared buffers and added to the ring, this is
+			 * counted instead as an IOCONTEXT_SHARED allocation. So, only
+			 * reused buffers are counted as having been allocated in the
+			 * IOCONTEXT_STRATEGY IOContext.
+			 */
+			*from_ring = true;
+			pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +265,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 98530078a6..5d8f0c98eb 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc.h"
@@ -196,6 +197,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..5cfd40068d
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,199 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_STRATEGY:
+			return "strategy";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "alloc";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes will never do IO in certain IOContexts. Check that the
+ * given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_strategy;
+	bool		no_local;
+
+	/*
+	 * Not all BackendTypes will use a BufferAccessStrategy.
+	 */
+	no_strategy = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+		B_BG_WRITER || bktype == B_CHECKPOINTER;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_STRATEGY && no_strategy)
+		return false;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IO Operation is valid for
+ * the given BackendType in the given IOContext. Note that there are currently
+ * no cases of an IO Operation being invalid for a particular BackendType only
+ * within a certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
+		IOOP_READ)
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if ((io_context == IOCONTEXT_LOCAL || io_context == IOCONTEXT_STRATEGY) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..d91bff3be7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -276,6 +276,44 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_STRATEGY,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +491,21 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..50d7e586e9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a4a4e356e5..5db1caa112 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2038,6 +2040,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v29-0002-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v29-0002-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 8d43b0f8975640c8945702f0abf90e653af98868 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Aug 2022 11:35:20 -0400
Subject: [PATCH v29 2/3] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 153 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  81 +++++++++-
 src/include/utils/pgstat_internal.h           |  37 +++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 347 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1d9509a2f6..9440b41770 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5360,6 +5360,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 6224c498c2..6c95a9d5de 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 5cfd40068d..8cd1239622 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,39 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			target->allocs += source->allocs;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -51,6 +84,77 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, io_context, io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, io_op);
+				pgstat_io_op_assert_zero(pendingent, io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -89,6 +193,55 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp = backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..7a2fd1ccf9 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index ac98918688..bfbe1c7942 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 4cca30aae7..dc3a1a26a4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2085,6 +2085,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2093,7 +2095,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 65cf4ba50f..194bae52c8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d91bff3be7..b9604bb2ad 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -242,7 +243,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA7
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA8
 
 typedef struct PgStat_ArchiverStats
 {
@@ -314,6 +315,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -496,6 +503,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -505,6 +513,77 @@ extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_cont
 extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
 extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
 
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->allocs == 0 && counters->extends == 0 &&
+		   counters->fsyncs == 0 && counters->reads == 0 &&
+		   counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			Assert(counters->allocs == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+		backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, io_context, io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+										 io_op);
+		}
+	}
+}
+
 
 /*
  * Functions in pgstat_database.c
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 901d2041d6..289610249a 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,26 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data
+	 * If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +439,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +463,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +572,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -641,6 +673,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5db1caa112..71e5f62e40 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2017,12 +2017,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2030,6 +2032,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

#74

melanieplageman@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#73)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v30 attached
rebased and pgstat_io_ops.c builds with meson now
also, I tested with pgstat_report_stat() only flushing when forced and
tests still pass

Attachments:

v30-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v30-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 153b06200b36891c9e07df526a86dbd913e36e3e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 11 Aug 2022 18:28:50 -0400
Subject: [PATCH v30 3/3] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (allocs,
extends, fsyncs, reads, and writes) done through each IOContext (shared
buffers, local buffers, strategy buffers) by each type of backend (e.g.
client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the "strategy" IOContext for
checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsync'd so cells for all BackendTypes for IOCONTEXT_STRATEGY and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 115 ++++++++++++++-
 src/backend/catalog/system_views.sql |  12 ++
 src/backend/utils/adt/pgstatfuncs.c  | 117 ++++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 ++
 src/test/regress/expected/rules.out  |   9 ++
 src/test/regress/expected/stats.out  | 202 +++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql       | 104 ++++++++++++++
 7 files changed, 567 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9440b41770..c7ca078bc8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,7 +3609,111 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO Context containing global data for the cluster on IO Operations done
+   by that backend type in that IO Context.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used (e.g. shared buffers, local buffers).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f7ec79e0..f3c10ae711 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1114,6 +1114,18 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.alloc,
+       b.read,
+       b.write,
+       b.extend,
+       b.fsync,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 88c737f4f9..7815639350 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1726,6 +1726,123 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_ALLOCS,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return IO_COL_ALLOCS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected(bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 * Some combinations of IOCONTEXT and BackendType are not valid
+			 * for any type of IO Operation. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid(bktype, io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, io_op);
+					nulls[pgstat_io_op_get_index(io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a07e737a33..1c263abec8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '14', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,alloc,read,write,extend,fsync,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9dd137415e..d313420d67 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,6 +1868,15 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.alloc,
+    b.read,
+    b.write,
+    b.extend,
+    b.fsync,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, alloc, read, write, extend, fsync, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 6a10dc462b..d1a6502c54 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -907,4 +907,206 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that allocs, reads, writes, and extends to shared buffers and fsyncs
+-- done to ensure durability of shared buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples.
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that, when using a BufferAccessStrategy, reusing buffers from the
+-- Strategy ring count as "Strategy" allocs in pg_stat_io. Also test that
+-- Strategy reads are counted as such.
+-- Set wal_skip_threshold smaller than the expected size of test_io_strategy so
+-- that, even if wal_level is minimal, VACUUM FULL will fsync the newly
+-- rewritten test_io_strategy instead of writing it to WAL. Writing it to WAL
+-- will result in the newly written relation pages being in shared buffers --
+-- preventing us from testing BufferAccessStrategy allocs and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+CREATE TABLE test_io_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy;
+RESET wal_skip_threshold;
+-- Test that, when using a BufferAccessStrategy and creating a relation,
+-- Strategy extends are counted in pg_stat_io.
+-- A CTAS uses a Bulk Write strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_extend;
+-- Test stats reset
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index a6b0e9e042..bc8fa404a3 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -443,4 +443,108 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that allocs, reads, writes, and extends to shared buffers and fsyncs
+-- done to ensure durability of shared buffers are tracked in pg_stat_io.
+SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- allocs and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+CHECKPOINT;
+SELECT sum(alloc) AS io_sum_shared_allocs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extend) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsync) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_allocs_after > :io_sum_shared_allocs_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that allocs, extends, reads, and writes of temporary tables are tracked
+-- in pg_stat_io.
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples.
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that, when using a BufferAccessStrategy, reusing buffers from the
+-- Strategy ring count as "Strategy" allocs in pg_stat_io. Also test that
+-- Strategy reads are counted as such.
+
+-- Set wal_skip_threshold smaller than the expected size of test_io_strategy so
+-- that, even if wal_level is minimal, VACUUM FULL will fsync the newly
+-- rewritten test_io_strategy instead of writing it to WAL. Writing it to WAL
+-- will result in the newly written relation pages being in shared buffers --
+-- preventing us from testing BufferAccessStrategy allocs and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+CREATE TABLE test_io_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_strategy;
+VACUUM (PARALLEL 0) test_io_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT sum(read) AS io_sum_strategy_reads_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+SELECT :io_sum_strategy_reads_after > :io_sum_strategy_reads_before;
+DROP TABLE test_io_strategy;
+RESET wal_skip_threshold;
+
+-- Test that, when using a BufferAccessStrategy and creating a relation,
+-- Strategy extends are counted in pg_stat_io.
+-- A CTAS uses a Bulk Write strategy.
+SELECT sum(extend) AS io_sum_strategy_extends_before FROM pg_stat_io WHERE io_context = 'strategy' \gset
+CREATE TABLE test_io_strategy_extend AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_strategy_extends_after FROM pg_stat_io WHERE io_context = 'strategy' \gset
+SELECT :io_sum_strategy_extends_after > :io_sum_strategy_extends_before;
+DROP TABLE test_io_strategy_extend;
+
+-- Test stats reset
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(alloc) + sum(extend) + sum(fsync) + sum(read) + sum(write) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v30-0002-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v30-0002-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From e0df4be80d5b47a2837ad3c993798de6b4e9b618 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Aug 2022 11:35:20 -0400
Subject: [PATCH v30 2/3] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 153 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  81 +++++++++-
 src/include/utils/pgstat_internal.h           |  37 +++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 347 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1d9509a2f6..9440b41770 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5360,6 +5360,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 609f0b1ad8..e66dde0ea5 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 5cfd40068d..8cd1239622 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,39 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			target->allocs += source->allocs;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -51,6 +84,77 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, io_context, io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, io_op);
+				pgstat_io_op_assert_zero(pendingent, io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -89,6 +193,55 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp = backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..7a2fd1ccf9 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 9a4f037959..275a7be166 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index be15b4b2e5..88c737f4f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2085,6 +2085,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2093,7 +2095,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ee48e392ed..2b73deda2f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ccc836a37f..d35072968b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -242,7 +243,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA7
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA8
 
 typedef struct PgStat_ArchiverStats
 {
@@ -314,6 +315,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -496,6 +503,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -505,6 +513,77 @@ extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_cont
 extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
 extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
 
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->allocs == 0 && counters->extends == 0 &&
+		   counters->fsyncs == 0 && counters->reads == 0 &&
+		   counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			Assert(counters->allocs == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+		backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, io_context, io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+										 io_op);
+		}
+	}
+}
+
 
 /*
  * Functions in pgstat_database.c
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 40a3602855..95f19cae99 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,26 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data
+	 * If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +439,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +463,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +572,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -641,6 +673,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 67218ec6f2..33c9362257 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2020,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

v30-0001-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v30-0001-Track-IO-operation-statistics-locally.patchDownload

From 55564aca63fea479681456ff333fdbb5868589c4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Aug 2022 11:08:23 -0400
Subject: [PATCH v30 1/3] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "write" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (alloc, read, write, extend, fsync) is counted per IOContext
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_STRATEGY IOContext concerns IO operations on buffers as
part of a BufferAccessStrategy.

IOOP_ALLOC IOOps are counted in IOCONTEXT_SHARED and IOCONTEXT_LOCAL
IOContexts whenever a buffer is acquired through [Local]BufferAlloc().

IOOP_ALLOC IOOps are counted in the IOCONTEXT_STRATEGY IOContext
whenever a buffer already in the strategy ring is reused. IOOP_WRITE
IOOps are counted in the IOCONTEXT_STRATEGY IOContext whenever the
reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  12 ++
 src/backend/storage/buffer/bufmgr.c        |  60 +++++--
 src/backend/storage/buffer/freelist.c      |  21 ++-
 src/backend/storage/buffer/localbuf.c      |   5 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 199 +++++++++++++++++++++
 src/include/pgstat.h                       |  53 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/tools/pgindent/typedefs.list           |   4 +
 11 files changed, 348 insertions(+), 12 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..bd2e1de7c2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,18 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_STRATEGY at this point, so count the
+		 * fsync as being in the IOCONTEXT_SHARED IOContext. This is probably
+		 * okay, because the number of backend fsyncs doesn't say anything
+		 * about the efficacy of the BufferAccessStrategy. And counting both
+		 * fsyncs done in IOCONTEXT_SHARED and IOCONTEXT_STRATEGY under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5b0e531f97..3573047235 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -986,10 +987,24 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_context = IOCONTEXT_LOCAL;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+
+		if (strategy != NULL)
+			io_context = IOCONTEXT_STRATEGY;
+		else
+			io_context = IOCONTEXT_SHARED;
+	}
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1020,6 +1035,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1190,6 +1207,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1219,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1256,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1284,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an
+				 * existing strategy buffer being reused, count this as a
+				 * strategy write for the purposes of IO Operations statistics
+				 * tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring
+				 * will be counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				io_context = from_ring ? IOCONTEXT_STRATEGY : IOCONTEXT_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2573,7 +2609,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2823,7 +2859,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2903,6 +2939,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3554,6 +3592,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3589,7 +3629,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3687,7 +3727,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3897,7 +3937,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3924,7 +3964,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..50d04b2b6d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,13 +199,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +216,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy
+			 * ring will be counted as IOCONTEXT_STRATEGY allocations for the
+			 * purposes of IO Operation statistics tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must
+			 * be allocated from shared buffers and added to the ring, this is
+			 * counted instead as an IOCONTEXT_SHARED allocation. So, only
+			 * reused buffers are counted as having been allocated in the
+			 * IOCONTEXT_STRATEGY IOContext.
+			 */
+			*from_ring = true;
+			pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +265,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..6b2f529fe7 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -196,6 +197,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ALLOC, IOCONTEXT_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..5cfd40068d
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,199 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_STRATEGY:
+			return "strategy";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "alloc";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes will never do IO in certain IOContexts. Check that the
+ * given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_strategy;
+	bool		no_local;
+
+	/*
+	 * Not all BackendTypes will use a BufferAccessStrategy.
+	 */
+	no_strategy = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+		B_BG_WRITER || bktype == B_CHECKPOINTER;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_STRATEGY && no_strategy)
+		return false;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IO Operation is valid for
+ * the given BackendType in the given IOContext. Note that there are currently
+ * no cases of an IO Operation being invalid for a particular BackendType only
+ * within a certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
+		IOOP_READ)
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if ((io_context == IOCONTEXT_LOCAL || io_context == IOCONTEXT_STRATEGY) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ad7334a0d2..ccc836a37f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -276,6 +276,44 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_STRATEGY,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +491,21 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..50d7e586e9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97c9bc1861..67218ec6f2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2028,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

#75

Lukas Fittl

lukas@fittl.com

over 3 years ago

In reply to: Melanie Plageman (#74)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Sep 27, 2022 at 11:20 AM Melanie Plageman <melanieplageman@gmail.com>
wrote:

v30 attached
rebased and pgstat_io_ops.c builds with meson now
also, I tested with pgstat_report_stat() only flushing when forced and
tests still pass

First of all, I'm excited about this patch, and I think it will be a big
help to understand better which part of Postgres is producing I/O (and why).

I've paired up with Maciek (CCed) on a review of this patch and had a few
comments, focused on the user experience:

The term "strategy" as an "io_context" is hard to understand, as its not a
concept an end-user / DBA would be familiar with. Since this comes from
BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as
"strategy"), maybe we could instead split this out into the individual
strategy types? i.e. making "strategy" three different I/O contexts
instead: "shared_bulkread", "shared_bulkwrite" and "shared_vacuum",
retaining "shared" to mean NULL / BAS_NORMAL.

Separately, could we also track buffer hits without incurring extra
overhead? (not just allocs and reads) -- Whilst we already have shared read
and hit counters in a few other places, this would help make the common
"What's my cache hit ratio" question more accurate to answer in the
presence of different shared buffer access strategies. Tracking hits could
also help for local buffers (e.g. to tune temp_buffers based on seeing a
low cache hit ratio).

Additionally, some minor notes:

- Since the stats are counting blocks, it would make sense to prefix the
view columns with "blks_", and word them in the past tense (to match
current style), i.e. "blks_written", "blks_read", "blks_extended",
"blks_fsynced" (realistically one would combine this new view with other
data e.g. from pg_stat_database or pg_stat_statements, which all use the
"blks_" prefix, and stop using pg_stat_bgwriter for this which does not use
such a prefix)

- "alloc" as a name doesn't seem intuitive (and it may be confused with
memory allocations) - whilst this is already named this way in
pg_stat_bgwriter, it feels like this is an opportunity to eventually
deprecate the column there and make this easier to understand -
specifically, maybe we can clarify that this means buffer *acquisitions*?
(either by renaming the field to "blks_acquired", or clarifying in the
documentation)

- Assuming we think this view could realistically cover all I/O produced by
Postgres in the future (thus warranting the name "pg_stat_io"), it may be
best to have an explicit list of things that are not currently tracked in
the documentation, to reduce user confusion (i.e. WAL writes are not
tracked, temporary files are not tracked, and some forms of direct writes
are not tracked, e.g. when a table moves to a different tablespace)

- In the view documentation, it would be good to explain the different
values for "io_strategy" (and what they mean)

- Overall it would be helpful if we had a dedicated documentation page on
I/O statistics that's linked from the pg_stat_io view description, and
explains how the I/O statistics tie into the various concepts of shared
buffers / buffer access strategies / etc (and what is not tracked today)

Thanks,
Lukas

--
Lukas Fittl

#76

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#74)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-09-27 14:20:44 -0400, Melanie Plageman wrote:

v30 attached
rebased and pgstat_io_ops.c builds with meson now
also, I tested with pgstat_report_stat() only flushing when forced and
tests still pass

Unfortunately tests fail in CI / cfbot. E.g.,
https://cirrus-ci.com/task/5816109319323648

https://api.cirrus-ci.com/v1/artifact/task/5816109319323648/testrun/build/testrun/main/regress/regression.diffs
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/stats.out /tmp/cirrus-ci-build/build/testrun/main/regress/results/stats.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/stats.out	2022-10-01 12:07:47.779183501 +0000
+++ /tmp/cirrus-ci-build/build/testrun/main/regress/results/stats.out	2022-10-01 12:11:38.686433303 +0000
@@ -997,6 +997,8 @@
 -- Set temp_buffers to a low value so that we can trigger writes with fewer
 -- inserted tuples.
 SET temp_buffers TO '1MB';
+ERROR:  invalid value for parameter "temp_buffers": 128
+DETAIL:  "temp_buffers" cannot be changed after any temporary tables have been accessed in the session.
 CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
 SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'local' \gset
 SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
@@ -1037,7 +1039,7 @@
 SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
  ?column? 
 ----------
- t
+ f
 (1 row)

SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;

So the problem is just that something else accesses temp buffers earlier in
the same test.

That's likely because since you sent your email

commit d7e39d72ca1c6f188b400d7d58813ff5b5b79064
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: 2022-09-29 12:14:39 -0400

Use actual backend IDs in pg_stat_get_backend_idset() and friends.

was applied, which adds a temp table earlier in the same session.

I think the easiest way to make this robust would be to just add a reconnect
before the place you need to set temp_buffers, that way additional temp tables
won't cause a problem.

Setting the patch to waiting-for-author for now.

Greetings,

Andres Freund

#77

[1]: /messages/by-id/20221002172404.xyzhftbedh4zpio2@awork3.anarazel.de

melanieplageman@gmail.com

over 3 years ago

In reply to: Lukas Fittl (#75)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v31 attached
I've also addressed failing test mentioned by Andres in [1]/messages/by-id/20221002172404.xyzhftbedh4zpio2@awork3.anarazel.de

On Fri, Sep 30, 2022 at 7:18 PM Lukas Fittl <lukas@fittl.com> wrote:

On Tue, Sep 27, 2022 at 11:20 AM Melanie Plageman <melanieplageman@gmail.com> wrote:

First of all, I'm excited about this patch, and I think it will be a big help to understand better which part of Postgres is producing I/O (and why).

Thanks! I'm happy to hear that.

I've paired up with Maciek (CCed) on a review of this patch and had a few comments, focused on the user experience:

Thanks for taking the time to review!

The term "strategy" as an "io_context" is hard to understand, as its not a concept an end-user / DBA would be familiar with. Since this comes from BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as "strategy"), maybe we could instead split this out into the individual strategy types? i.e. making "strategy" three different I/O contexts instead: "shared_bulkread", "shared_bulkwrite" and "shared_vacuum", retaining "shared" to mean NULL / BAS_NORMAL.

I have split strategy out into "vacuum", "bulkread", and "bulkwrite". I
thought it was less clear with shared as a prefix. If we were to have
BufferAccessStrategies in the future which acquire local buffers (for
example), we could start prefixing the columns to differentiate.

This opened up some new questions about which BufferAccessStrategies
will be employed by which BackendTypes and which IOOps will be valid in
a given BufferAccessStrategy.

I've excluded IOCONTEXT_BULKREAD and IOCONTEXT_BULKWRITE for autovacuum
worker -- though those may not be inherently invalid, they seem not to
be done now and added extra rows to the view.

I've also disallowed IOOP_EXTEND for IOCONTEXT_BULKREAD.

Separately, could we also track buffer hits without incurring extra overhead? (not just allocs and reads) -- Whilst we already have shared read and hit counters in a few other places, this would help make the common "What's my cache hit ratio" question more accurate to answer in the presence of different shared buffer access strategies. Tracking hits could also help for local buffers (e.g. to tune temp_buffers based on seeing a low cache hit ratio).

I've started tracking hits and added "hit" to the view.
I added IOOP_HIT and IOOP_ACQUIRE to those IOOps disallowed for
checkpointer and bgwriter.

I have added tests for hit, but I'm not sure I can keep them. It seems
like they might fail if the blocks are evicted between the first and
second time I try to read them.

Additionally, some minor notes:

- Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them in the past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realistically one would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which all use the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix)

I have changed the column names to be in the past tense.

There are no columns equivalent to "dirty" or "misses" from the other
views containing information on buffer hits/block reads/writes/etc. I'm
not sure whether or not those make sense in this context.

Because we want to add non-block-oriented IO in the future (like
temporary file IO) to this view and want to use the same "read",
"written", "extended" columns, I would prefer not to prefix the columns
with "blks_". I have added a column "unit" which would contain the unit
in which read, written, and extended are in. Unfortunately, fsyncs are
not per block, so "unit" doesn't really work for this. I documented
this.

The most correct thing to do to accommodate block-oriented and
non-block-oriented IO would be to specify all the values in bytes.
However, I would like this view to be usable visually (as opposed to
just in scripts and by tools). The only current value of unit is
"block_size" which could potentially be combined with the value of the
GUC to get bytes.

I've hard-coded the string "block_size" into the view generation
function pg_stat_get_io(), so, if this idea makes sense, perhaps I
should do something better there.

- "alloc" as a name doesn't seem intuitive (and it may be confused with memory allocations) - whilst this is already named this way in pg_stat_bgwriter, it feels like this is an opportunity to eventually deprecate the column there and make this easier to understand - specifically, maybe we can clarify that this means buffer *acquisitions*? (either by renaming the field to "blks_acquired", or clarifying in the documentation)

I have renamed it to acquired. It doesn't overlap completely with
buffers_alloc in pg_stat_bgwriter, so I didn't mention that in docs.

- Assuming we think this view could realistically cover all I/O produced by Postgres in the future (thus warranting the name "pg_stat_io"), it may be best to have an explicit list of things that are not currently tracked in the documentation, to reduce user confusion (i.e. WAL writes are not tracked, temporary files are not tracked, and some forms of direct writes are not tracked, e.g. when a table moves to a different tablespace)

I have added this to the docs. The list is not exhaustive, so I would
love to get feedback on if there are other specific examples of IO which
is using smgr* directly that users will wonder about and I should call
out.

- In the view documentation, it would be good to explain the different values for "io_strategy" (and what they mean)

I have added this and would love feedback on my docs additions.

- Overall it would be helpful if we had a dedicated documentation page on I/O statistics that's linked from the pg_stat_io view description, and explains how the I/O statistics tie into the various concepts of shared buffers / buffer access strategies / etc (and what is not tracked today)

I haven't done this yet. How specific were you thinking -- like
interpretations of all the combinations and what to do with what you
see? Like you should run pg_prewarm if you see X? Specific checkpointer
or bgwriter GUCs to change? Or just links to other docs pages on
recommended tunings?

Were you imagining the other IO statistics views (like
pg_statio_all_tables and pg_stat_database) also being included in this
page? Like would it be a comprehensive guide to IO statistics and what
their significance/purposes are?

- Melanie

Attachments:

v31-0002-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v31-0002-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 0ee5cda16066edbbe0b992cae7272c9e2671d1a7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v31 2/3] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 157 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  83 ++++++++-
 src/include/utils/pgstat_internal.h           |  37 +++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 353 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 342b20ebeb..14dfd650f8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5360,6 +5360,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 609f0b1ad8..e66dde0ea5 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index cea46322a7..86d403fe55 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,42 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			target->acquires += source->acquires;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_HIT:
+			target->hits += source->hits;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -54,6 +90,77 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, io_context, io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, io_op);
+				pgstat_io_op_assert_zero(pendingent, io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -98,6 +205,56 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..7a2fd1ccf9 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 9a4f037959..275a7be166 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eadd8464ff..edd73e5c25 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2071,6 +2071,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2079,7 +2081,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e7ebea4ff4..bf97162e83 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 472e0def97..0d3d5e28a1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -243,7 +244,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA7
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA8
 
 typedef struct PgStat_ArchiverStats
 {
@@ -319,6 +320,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -501,6 +508,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -512,6 +520,79 @@ extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp i
 
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->acquires == 0 && counters->extends == 0 &&
+		   counters->fsyncs == 0 && counters->reads == 0 &&
+		   counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			Assert(counters->acquires == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_HIT:
+			Assert(counters->hits == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+		backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, io_context, io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+										 io_op);
+		}
+	}
+}
 
 
 /*
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 40a3602855..95f19cae99 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,26 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data
+	 * If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +439,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +463,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +572,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -641,6 +673,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 67218ec6f2..33c9362257 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2020,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

v31-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v31-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 275222925d4d207749c81982467dfc742a5c1d3e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:24:42 -0400
Subject: [PATCH v31 3/3] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(acquires, extends, fsyncs, hits, reads, and writes) done through each
IOContext (shared buffers, local buffers, and buffers reserved by a
BufferAccessStrategy) by each type of backend (e.g. client backend,
checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the relevant
BufferAccessStrategy IOContext for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsync'd so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the
"acquired", "read", "written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 200 ++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  14 ++
 src/backend/utils/adt/pgstatfuncs.c  | 123 +++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +
 src/test/regress/expected/rules.out  |  11 ++
 src/test/regress/expected/stats.out  | 256 +++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql       | 134 ++++++++++++++
 7 files changed, 745 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 14dfd650f8..9e17d1e1ec 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,13 +3609,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3615,7 +3623,195 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently, only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) could be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used. This refers to the context or location of an IO
+       operation.
+       <literal>shared</literal> refers to shared buffers, the primary
+       buffer pool for relation data.
+       <literal>local</literal> refers to
+       process-local memory used for temporary tables.
+       <literal>vacuum</literal> refers to memory reserved for use during
+       vacuumming and analyzing.
+       <literal>bulkread</literal>
+       refers to memory reserved for use during bulk read operations.
+       <literal>bulkwrite</literal>
+       refers to memory reserved for use during bulk write operations.
+       The autovacuum daemon, explicit <command>VACUUM</command>, explicit
+       <command>ANALYZE</command>, many bulk reads, and many bulk writes use a
+       fixed amount of memory, acquiring the equivalent number of shared
+       buffers and reusing them circularly to avoid occupying an undue portion
+       of the main shared buffer pool.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>acquired</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Acquisitions of memory by this <varname>backend_type</varname> for
+       performing IO operations in this <varname>io_context</varname>. For
+       block-oriented IO, <varname>acquired</varname> is the number of buffers
+       acquired or reused as part of a buffer access strategy.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>hit</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Relevant only for block-based IO of data accessed in the course of
+       satisfying queries, <varname>hit</varname> is the number of number of
+       accesses of blocks already located in a
+       <productname>PostgreSQL</productname> buffer in this specified
+       <varname>io_context</varname> by this <varname>backend_type</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into
+       memory or buffers in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>.
+       Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+       Also, the sum of <varname>written</varname> and
+       <varname>extended</varname> in this view for
+       <varname>backend_type</varname>s <literal>client backend</literal>,
+       <literal>autovacuum worker</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> in
+       <varname>io_context</varname>s <literal>shared</literal>,
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>, and
+       <literal>vacuum</literal> is equivalent to
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>unit</structfield> <type>text</type>
+      </para>
+      <para>
+      The unit in which the acquired, read, written, and extended columns can
+      be interpreted. Currently <varname>block_size</varname> is the only
+      possible value. Reads, writes, and extends of relation data are done in
+      <varname>block_size</varname> units. Future values could include
+      <varname>wal_block_size</varname>, once WAL IO is tracked in this view,
+      and <quote>bytes</quote>, once non-block-oriented IO such as temp files
+      is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsynced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of fsyncs performed by this <varname>backend_type</varname> for
+       the purpose of persisting data present in this
+       <varname>io_context</varname>. <literal>fsyncs</literal> are done at
+       segment boundaries so <varname>unit</varname> does not apply to
+       <varname>fsynced</varname> column. <literal>fsyncs</literal> done by
+       backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as an <varname>io_context</varname> <literal>shared</literal>
+       <literal>fsync</literal>.
+       Note that the sum of <varname>fsynced</varname> for all
+       <varname>io_context</varname> <literal>shared</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f7ec79e0..6d6c5c6260 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1114,6 +1114,20 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.acquired,
+       b.hit,
+       b.read,
+       b.written,
+       b.extended,
+       b.unit,
+       b.fsynced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index edd73e5c25..6044cc35d1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1712,6 +1712,129 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_ALLOCS,
+	IO_COL_HITS,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_UNIT,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			return IO_COL_ALLOCS;
+		case IOOP_HIT:
+			return IO_COL_HITS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected(bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 * Some combinations of IOContext and BackendType are not valid
+			 * for any type of IOOp. In such cases, omit the entire row from
+			 * the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid(bktype, io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_ALLOCS] = Int64GetDatum(counters->acquires);
+			values[IO_COL_HITS] = Int64GetDatum(counters->hits);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_UNIT] = CStringGetTextDatum("block_size");
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, io_op);
+					nulls[pgstat_io_op_get_index(io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 68bb032d3e..f7341c51ca 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5649,6 +5649,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '14', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,text,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,acquired,hit,read,written,extended,unit,fsynced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9dd137415e..8c00214958 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,6 +1868,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.acquired,
+    b.hit,
+    b.read,
+    b.written,
+    b.extended,
+    b.unit,
+    b.fsynced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, acquired, hit, read, written, extended, unit, fsynced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f701da2069..779fb4d398 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -918,4 +918,260 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - acquisitions of shared buffers for IO operations
+-- - reads of target blocks into shared buffers
+-- - shared buffer cache hits when target blocks reside in shared buffers
+-- - writes of shared buffers
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+SELECT sum(acquired) AS io_sum_shared_acquisitions_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(hit) AS io_sum_shared_hits_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsynced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- acquisitions and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(acquired) AS io_sum_shared_acquisitions_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsynced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_acquisitions_after > :io_sum_shared_acquisitions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Select from the table again once it is in shared buffers. There should be
+-- some hits recorded in pg_stat_io.
+SELECT sum(hit) AS io_sum_shared_hits_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_hits_after > :io_sum_shared_hits_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that acquisitions of local buffers, reads of temporary table blocks
+-- into local buffers, temporary table block cache hits in local buffers,
+-- writes of local buffers, and extends of temporary tables are tracked in
+-- pg_stat_io.
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(acquired) AS io_sum_local_acquisitions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+-- Query tuples in local buffers to ensure new local buffer cache hits.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(acquired) AS io_sum_local_acquisitions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_acquisitions_after > :io_sum_local_acquisitions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_hits_after > :io_sum_local_hits_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- acquisitions and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(acquired) AS io_sum_vac_strategy_acquisitions_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(acquired) AS io_sum_vac_strategy_acquisitions_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_acquisitions_after > :io_sum_vac_strategy_acquisitions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(acquired) + sum(extended) + sum(fsynced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(acquired) + sum(extended) + sum(fsynced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index eb081f65a4..b4f4359035 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -449,4 +449,138 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - acquisitions of shared buffers for IO operations
+-- - reads of target blocks into shared buffers
+-- - shared buffer cache hits when target blocks reside in shared buffers
+-- - writes of shared buffers
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+SELECT sum(acquired) AS io_sum_shared_acquisitions_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(hit) AS io_sum_shared_hits_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsynced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- acquisitions and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(acquired) AS io_sum_shared_acquisitions_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(fsynced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_acquisitions_after > :io_sum_shared_acquisitions_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+-- Select from the table again once it is in shared buffers. There should be
+-- some hits recorded in pg_stat_io.
+SELECT sum(hit) AS io_sum_shared_hits_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_hits_after > :io_sum_shared_hits_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that acquisitions of local buffers, reads of temporary table blocks
+-- into local buffers, temporary table block cache hits in local buffers,
+-- writes of local buffers, and extends of temporary tables are tracked in
+-- pg_stat_io.
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(acquired) AS io_sum_local_acquisitions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+-- Query tuples in local buffers to ensure new local buffer cache hits.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(acquired) AS io_sum_local_acquisitions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_acquisitions_after > :io_sum_local_acquisitions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_hits_after > :io_sum_local_hits_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- acquisitions and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(acquired) AS io_sum_vac_strategy_acquisitions_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(acquired) AS io_sum_vac_strategy_acquisitions_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_acquisitions_after > :io_sum_vac_strategy_acquisitions_before;
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+
+-- Test IO stats reset
+SELECT sum(acquired) + sum(extended) + sum(fsynced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(acquired) + sum(extended) + sum(fsynced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v31-0001-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v31-0001-Track-IO-operation-statistics-locally.patchDownload

From 6891009af39bd9fa66824a23fc34e37e5505d862 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:25 -0400
Subject: [PATCH v31 1/3] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "written" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (acquire, hit, read, write, extend, fsync) is counted per
IOContext (bulkread, bulkwrite, local, shared, or vacuum) through a call
to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_ACQUIRE IOOps are counted in IOCONTEXT_SHARED and IOCONTEXT_LOCAL
IOContexts whenever a buffer is acquired through [Local]BufferAlloc().

IOOP_ACQUIRE IOOps are counted in the BufferAccessStrategy IOContexts
whenever a buffer already in the strategy ring is reused. IOOP_WRITE
IOOps are counted in the BufferAccessStrategy IOContexts whenever the
reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 ++
 src/backend/storage/buffer/bufmgr.c        |  59 +++++-
 src/backend/storage/buffer/freelist.c      |  52 ++++-
 src/backend/storage/buffer/localbuf.c      |   5 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 234 +++++++++++++++++++++
 src/include/pgstat.h                       |  61 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   4 +
 12 files changed, 428 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..4ea4e6a298 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_SHARED
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_SHARED and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..1c14e305c1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -833,6 +834,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	isExtend = (blockNum == P_NEW);
 
+	if (strategy)
+		io_context = IOContextForStrategy(strategy);
+	else if (isLocalBuf)
+		io_context = IOCONTEXT_LOCAL;
+	else
+		io_context = IOCONTEXT_SHARED;
+
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -886,6 +894,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
+		pgstat_count_io_op(IOOP_HIT, io_context);
+
 		if (!isExtend)
 		{
 			/* Just need to update stats before we exit */
@@ -986,10 +996,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+	else
+		bufBlock = BufHdrGetBlock(bufHdr);
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1020,6 +1034,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1190,6 +1206,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1218,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1255,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1283,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an
+				 * existing strategy buffer being reused, count this as a
+				 * strategy write for the purposes of IO Operations statistics
+				 * tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring
+				 * will be counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				io_context = from_ring ? IOContextForStrategy(strategy) : IOCONTEXT_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2570,7 +2605,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2855,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2900,6 +2935,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3551,6 +3588,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3586,7 +3625,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3723,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3933,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3960,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..5fd65c17d1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,13 +199,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +216,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, reused buffers from the
+			 * strategy ring will be counted as IOCONTEXT_BULKREAD,
+			 * IOCONTEXT_BULKWRITE, or IOCONTEXT_VACUUM acquisitions for the
+			 * purposes of IO Operation statistics tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must
+			 * be acquired from shared buffers and added to the ring, this is
+			 * counted instead as an IOCONTEXT_SHARED acquisition. So, only
+			 * reused buffers are counted as having been acquired in a
+			 * BufferAccessStrategy IOContext.
+			 */
+			*from_ring = true;
+			pgstat_count_io_op(IOOP_ACQUIRE, IOContextForStrategy(strategy));
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +266,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ACQUIRE, IOCONTEXT_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -670,6 +690,36 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is unlikely
+			 * to be hit.
+			 */
+			return IOCONTEXT_SHARED;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..c2548f2b0b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -196,6 +197,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ACQUIRE, IOCONTEXT_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..cea46322a7
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,234 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			pending_counters->acquires++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_HIT:
+			pending_counters->hits++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			return "acquire";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_HIT:
+			return "hit";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes will never do IO in certain IOContexts. Check that the
+ * given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_strategy;
+	bool		no_local;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	/*
+	 * Some BackendTypes never use any BufferAccessStrategy.
+	 */
+	no_strategy = bktype == B_AUTOVAC_LAUNCHER || bktype ==
+		B_BG_WRITER || bktype == B_CHECKPOINTER;
+
+	if ((io_context == IOCONTEXT_BULKREAD || io_context == IOCONTEXT_BULKWRITE
+		 || io_context == IOCONTEXT_VACUUM) && no_strategy)
+		return false;
+
+	/*
+	 * There is not an explicit reason why an autovacuum worker could not use
+	 * a BAS_BULKREAD or BAS_BULKWRITE BufferAccessStrategy, however, they do
+	 * not currently do so and leaving these combinations out leans out the
+	 * view. If autovacuum workers use these strategies in the future, this
+	 * restriction can be removed.
+	 */
+	if ((io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_BULKREAD)
+		&& bktype == B_AUTOVAC_WORKER)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_ACQUIRE || io_op == IOOP_HIT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts
+	 */
+	if (io_op == IOOP_EXTEND && io_context == IOCONTEXT_BULKREAD)
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD || io_context ==
+		IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	if ((io_context == IOCONTEXT_LOCAL || strategy_io_context) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ad7334a0d2..472e0def97 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,48 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ */
+
+typedef enum IOOp
+{
+	IOOP_ACQUIRE,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_HIT,
+	IOOP_READ,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter acquires;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter hits;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +496,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..50d7e586e9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa0960..d0eed71f63 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97c9bc1861..67218ec6f2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2028,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

#78

melanieplageman@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#77)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v31 failed in CI, so
I've attached v32 which has a few issues fixed:
- addressed some compiler warnings I hadn't noticed locally
- autovac launcher and worker do indeed use bulkread strategy if they
end up starting before critical indexes have loaded and end up doing a
sequential scan of some catalog tables, so I have changed the
restrictions on BackendTypes allowed to track IO Operations in
IOCONTEXT_BULKREAD
- changed the name of the column "fsynced" to "files_synced" to make it
more clear what unit it is in (and that the unit differs from that of
the "unit" column)

In an off-list discussion with Andres, he mentioned that he thought
buffers reused by a BufferAccessStrategy should be split from buffers
"acquired" and that "acquired" should be renamed "clocksweeps".

I have started doing this, but for BufferAccessStrategy IO there are a
few choices about how we want to count the clocksweeps:

Currently the following situations are counted under the following
IOContexts and IOOps:

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_ACQUIRE
- reuse a buffer from the ring

IOCONTEXT_SHARED, IOOP_ACQUIRE
- add a buffer to the strategy ring initially
- add a new shared buffer to the ring when all the existing buffers in
the ring are pinned

And in the new paradigm, I think these are two good options:

1)
IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
- add a buffer to the strategy ring initially
- add a new shared buffer to the ring when all the existing buffers in
the ring are pinned

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
- reuse a buffer from the ring

2)
IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
- add a buffer to the strategy ring initially

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
- reuse a buffer from the ring

IOCONTEXT SHARED, IOOP_CLOCKSWEEP
- add a new shared buffer to the ring when all the existing buffers in
the ring are pinned

However, if we want to differentiate between buffers initially added to
the ring and buffers taken from shared buffers and added to the ring
because all strategy ring buffers are pinned or have a usage count above
one, then we would need to either do so inside of GetBufferFromRing() or
propagate this distinction out somehow (easy enough if we care to do
it).

There are other combinations that I could come up with a justification
for as well, but I wanted to know what other people thought made sense
(and would make sense to users).

- Melanie

Attachments:

v32-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v32-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 8e9da4e2ecaae4bfb5a637b8e09eadeebaae4ee0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:24:42 -0400
Subject: [PATCH v32 3/3] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(acquires, extends, fsyncs, hits, reads, and writes) done through each
IOContext (shared buffers, local buffers, and buffers reserved by a
BufferAccessStrategy) by each type of backend (e.g. client backend,
checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the relevant
BufferAccessStrategy IOContext for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsynced so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the
"acquired", "read", "written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 200 ++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  14 ++
 src/backend/utils/adt/pgstatfuncs.c  | 126 +++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +
 src/test/regress/expected/rules.out  |  11 ++
 src/test/regress/expected/stats.out  | 256 +++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql       | 134 ++++++++++++++
 7 files changed, 748 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 14dfd650f8..104c4b0f1a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,13 +3609,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3615,7 +3623,195 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently, only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) could be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used. This refers to the context or location of an IO
+       operation.
+       <literal>shared</literal> refers to shared buffers, the primary
+       buffer pool for relation data.
+       <literal>local</literal> refers to
+       process-local memory used for temporary tables.
+       <literal>vacuum</literal> refers to memory reserved for use during
+       vacuumming and analyzing.
+       <literal>bulkread</literal>
+       refers to memory reserved for use during bulk read operations.
+       <literal>bulkwrite</literal>
+       refers to memory reserved for use during bulk write operations.
+       The autovacuum daemon, explicit <command>VACUUM</command>, explicit
+       <command>ANALYZE</command>, many bulk reads, and many bulk writes use a
+       fixed amount of memory, acquiring the equivalent number of shared
+       buffers and reusing them circularly to avoid occupying an undue portion
+       of the main shared buffer pool.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>acquired</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Acquisitions of memory by this <varname>backend_type</varname> for
+       performing IO operations in this <varname>io_context</varname>. For
+       block-oriented IO, <varname>acquired</varname> is the number of buffers
+       acquired or reused as part of a buffer access strategy.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>hit</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Relevant only for block-based IO of data accessed in the course of
+       satisfying queries, <varname>hit</varname> is the number of number of
+       accesses of blocks already located in a
+       <productname>PostgreSQL</productname> buffer in this specified
+       <varname>io_context</varname> by this <varname>backend_type</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into
+       memory or buffers in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>.
+       Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+       Also, the sum of <varname>written</varname> and
+       <varname>extended</varname> in this view for
+       <varname>backend_type</varname>s <literal>client backend</literal>,
+       <literal>autovacuum worker</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> in
+       <varname>io_context</varname>s <literal>shared</literal>,
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>, and
+       <literal>vacuum</literal> is equivalent to
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>unit</structfield> <type>text</type>
+      </para>
+      <para>
+      The unit in which the acquired, read, written, and extended columns can
+      be interpreted. Currently <varname>block_size</varname> is the only
+      possible value. Reads, writes, and extends of relation data are done in
+      <varname>block_size</varname> units. Future values could include
+      <varname>wal_block_size</varname>, once WAL IO is tracked in this view,
+      and <quote>bytes</quote>, once non-block-oriented IO such as temp files
+      is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of files fsynced by this <varname>backend_type</varname> for
+       the purpose of persisting data dirtied in this
+       <varname>io_context</varname>. <literal>fsyncs</literal> are done at
+       segment boundaries so <varname>unit</varname> does not apply to the
+       <varname>files_synced</varname> column. <literal>fsyncs</literal> done by
+       backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as an <varname>io_context</varname> <literal>shared</literal>
+       <literal>fsync</literal>.
+       Note that the sum of <varname>files_synced</varname> for all
+       <varname>io_context</varname> <literal>shared</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f7ec79e0..4467a0df82 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1114,6 +1114,20 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.acquired,
+       b.hit,
+       b.read,
+       b.written,
+       b.extended,
+       b.unit,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index edd73e5c25..ee8712ace4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1712,6 +1712,132 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_ALLOCS,
+	IO_COL_HITS,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_UNIT,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			return IO_COL_ALLOCS;
+		case IOOP_HIT:
+			return IO_COL_HITS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+															bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 * Some combinations of IOContext and BackendType are not valid
+			 * for any type of IOOp. In such cases, omit the entire row from
+			 * the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid((BackendType) bktype,
+												(IOContext) io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_ALLOCS] = Int64GetDatum(counters->acquires);
+			values[IO_COL_HITS] = Int64GetDatum(counters->hits);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_UNIT] = CStringGetTextDatum("block_size");
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid((BackendType) bktype, (IOContext)
+										 io_context, (IOOp) io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, (IOOp) io_op);
+					nulls[pgstat_io_op_get_index((IOOp) io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 68bb032d3e..250a1f17dc 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5649,6 +5649,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '14', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,text,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,acquired,hit,read,written,extended,unit,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9dd137415e..24187acb61 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,6 +1868,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.acquired,
+    b.hit,
+    b.read,
+    b.written,
+    b.extended,
+    b.unit,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, acquired, hit, read, written, extended, unit, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f701da2069..a9919a3e62 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -918,4 +918,260 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - acquisitions of shared buffers for IO operations
+-- - reads of target blocks into shared buffers
+-- - shared buffer cache hits when target blocks reside in shared buffers
+-- - writes of shared buffers
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+SELECT sum(acquired) AS io_sum_shared_acquisitions_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(hit) AS io_sum_shared_hits_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- acquisitions and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(acquired) AS io_sum_shared_acquisitions_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_acquisitions_after > :io_sum_shared_acquisitions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Select from the table again once it is in shared buffers. There should be
+-- some hits recorded in pg_stat_io.
+SELECT sum(hit) AS io_sum_shared_hits_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_hits_after > :io_sum_shared_hits_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that acquisitions of local buffers, reads of temporary table blocks
+-- into local buffers, temporary table block cache hits in local buffers,
+-- writes of local buffers, and extends of temporary tables are tracked in
+-- pg_stat_io.
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(acquired) AS io_sum_local_acquisitions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+-- Query tuples in local buffers to ensure new local buffer cache hits.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(acquired) AS io_sum_local_acquisitions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_acquisitions_after > :io_sum_local_acquisitions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_hits_after > :io_sum_local_hits_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- acquisitions and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(acquired) AS io_sum_vac_strategy_acquisitions_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(acquired) AS io_sum_vac_strategy_acquisitions_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_acquisitions_after > :io_sum_vac_strategy_acquisitions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(acquired) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(acquired) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index eb081f65a4..78798bc626 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -449,4 +449,138 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - acquisitions of shared buffers for IO operations
+-- - reads of target blocks into shared buffers
+-- - shared buffer cache hits when target blocks reside in shared buffers
+-- - writes of shared buffers
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+SELECT sum(acquired) AS io_sum_shared_acquisitions_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(hit) AS io_sum_shared_hits_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- acquisitions and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(acquired) AS io_sum_shared_acquisitions_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_acquisitions_after > :io_sum_shared_acquisitions_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+-- Select from the table again once it is in shared buffers. There should be
+-- some hits recorded in pg_stat_io.
+SELECT sum(hit) AS io_sum_shared_hits_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_hits_after > :io_sum_shared_hits_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that acquisitions of local buffers, reads of temporary table blocks
+-- into local buffers, temporary table block cache hits in local buffers,
+-- writes of local buffers, and extends of temporary tables are tracked in
+-- pg_stat_io.
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(acquired) AS io_sum_local_acquisitions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+-- Query tuples in local buffers to ensure new local buffer cache hits.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(acquired) AS io_sum_local_acquisitions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_acquisitions_after > :io_sum_local_acquisitions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_hits_after > :io_sum_local_hits_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- acquisitions and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(acquired) AS io_sum_vac_strategy_acquisitions_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(acquired) AS io_sum_vac_strategy_acquisitions_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_acquisitions_after > :io_sum_vac_strategy_acquisitions_before;
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+
+-- Test IO stats reset
+SELECT sum(acquired) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(acquired) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v32-0002-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v32-0002-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From d5859f08316a185f8fb889da8552422d9b609dd1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v32 2/3] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 158 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  83 ++++++++-
 src/include/utils/pgstat_internal.h           |  36 ++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 353 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 342b20ebeb..14dfd650f8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5360,6 +5360,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 609f0b1ad8..07c02ac3d7 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index e1750b965f..237b7da6d2 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,42 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			target->acquires += source->acquires;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_HIT:
+			target->hits += source->hits;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -54,6 +90,78 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+									 (IOOp) io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, (IOOp) io_op);
+				pgstat_io_op_assert_zero(pendingent, (IOOp) io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, (IOOp) io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -98,6 +206,56 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..7a2fd1ccf9 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 9a4f037959..275a7be166 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eadd8464ff..edd73e5c25 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2071,6 +2071,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2079,7 +2081,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e7ebea4ff4..bf97162e83 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 155b0b2d48..b4f5d75949 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -243,7 +244,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA7
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCA8
 
 typedef struct PgStat_ArchiverStats
 {
@@ -319,6 +320,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -501,6 +508,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -512,6 +520,79 @@ extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp i
 
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->acquires == 0 && counters->extends == 0 &&
+		   counters->fsyncs == 0 && counters->reads == 0 &&
+		   counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			Assert(counters->acquires == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_HIT:
+			Assert(counters->hits == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+										   backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, (IOContext) io_context, (IOOp) io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+						(IOOp) io_op);
+		}
+	}
+}
 
 
 /*
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 40a3602855..3421c8a5c0 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,25 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +438,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +462,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +571,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -641,6 +672,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 67218ec6f2..33c9362257 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2020,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

v32-0001-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v32-0001-Track-IO-operation-statistics-locally.patchDownload

From caed7a11517799676a50570bad4b4d3bb412e42a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:25 -0400
Subject: [PATCH v32 1/3] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "written" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (acquire, hit, read, write, extend, fsync) is counted per
IOContext (bulkread, bulkwrite, local, shared, or vacuum) through a call
to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_ACQUIRE IOOps are counted in IOCONTEXT_SHARED and IOCONTEXT_LOCAL
IOContexts whenever a buffer is acquired through [Local]BufferAlloc().

IOOP_ACQUIRE IOOps are counted in the BufferAccessStrategy IOContexts
whenever a buffer already in the strategy ring is reused. IOOP_WRITE
IOOps are counted in the BufferAccessStrategy IOContexts whenever the
reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 ++
 src/backend/storage/buffer/bufmgr.c        |  59 +++++-
 src/backend/storage/buffer/freelist.c      |  52 ++++-
 src/backend/storage/buffer/localbuf.c      |   5 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 229 +++++++++++++++++++++
 src/include/pgstat.h                       |  61 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   4 +
 12 files changed, 423 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..4ea4e6a298 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_SHARED
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_SHARED and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..1c14e305c1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -833,6 +834,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	isExtend = (blockNum == P_NEW);
 
+	if (strategy)
+		io_context = IOContextForStrategy(strategy);
+	else if (isLocalBuf)
+		io_context = IOCONTEXT_LOCAL;
+	else
+		io_context = IOCONTEXT_SHARED;
+
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -886,6 +894,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
+		pgstat_count_io_op(IOOP_HIT, io_context);
+
 		if (!isExtend)
 		{
 			/* Just need to update stats before we exit */
@@ -986,10 +996,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+	else
+		bufBlock = BufHdrGetBlock(bufHdr);
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1020,6 +1034,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1190,6 +1206,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1218,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1255,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1283,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an
+				 * existing strategy buffer being reused, count this as a
+				 * strategy write for the purposes of IO Operations statistics
+				 * tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring
+				 * will be counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				io_context = from_ring ? IOContextForStrategy(strategy) : IOCONTEXT_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2570,7 +2605,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2855,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2900,6 +2935,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3551,6 +3588,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3586,7 +3625,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3723,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3933,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3960,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..5fd65c17d1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,13 +199,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +216,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, reused buffers from the
+			 * strategy ring will be counted as IOCONTEXT_BULKREAD,
+			 * IOCONTEXT_BULKWRITE, or IOCONTEXT_VACUUM acquisitions for the
+			 * purposes of IO Operation statistics tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must
+			 * be acquired from shared buffers and added to the ring, this is
+			 * counted instead as an IOCONTEXT_SHARED acquisition. So, only
+			 * reused buffers are counted as having been acquired in a
+			 * BufferAccessStrategy IOContext.
+			 */
+			*from_ring = true;
+			pgstat_count_io_op(IOOP_ACQUIRE, IOContextForStrategy(strategy));
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +266,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ACQUIRE, IOCONTEXT_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -670,6 +690,36 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is unlikely
+			 * to be hit.
+			 */
+			return IOCONTEXT_SHARED;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..c2548f2b0b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -196,6 +197,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ACQUIRE, IOCONTEXT_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..e1750b965f
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,229 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			pending_counters->acquires++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_HIT:
+			pending_counters->hits++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ACQUIRE:
+			return "acquire";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_HIT:
+			return "hit";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_local;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((io_context == IOCONTEXT_BULKREAD || io_context == IOCONTEXT_BULKWRITE
+		 || io_context == IOCONTEXT_VACUUM) && (bktype == B_CHECKPOINTER
+												|| bktype == B_BG_WRITER))
+		return false;
+
+	if (io_context == IOCONTEXT_VACUUM && bktype == B_AUTOVAC_LAUNCHER)
+		return false;
+
+	if (io_context == IOCONTEXT_BULKWRITE && (bktype == B_AUTOVAC_WORKER ||
+											  bktype == B_AUTOVAC_LAUNCHER))
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_ACQUIRE || io_op == IOOP_HIT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts
+	 */
+	if (io_op == IOOP_EXTEND && io_context == IOCONTEXT_BULKREAD)
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD || io_context ==
+		IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	if ((io_context == IOCONTEXT_LOCAL || strategy_io_context) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ad7334a0d2..155b0b2d48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,48 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ */
+
+typedef enum IOOp
+{
+	IOOP_ACQUIRE = 0,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_HIT,
+	IOOP_READ,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD = 0,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter acquires;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter hits;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +496,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..50d7e586e9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa0960..d0eed71f63 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97c9bc1861..67218ec6f2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2028,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

#79

melanieplageman@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#78)

3 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

I've gone ahead and implemented option 1 (commented below).

On Thu, Oct 6, 2022 at 6:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

v31 failed in CI, so
I've attached v32 which has a few issues fixed:
- addressed some compiler warnings I hadn't noticed locally
- autovac launcher and worker do indeed use bulkread strategy if they
end up starting before critical indexes have loaded and end up doing a
sequential scan of some catalog tables, so I have changed the
restrictions on BackendTypes allowed to track IO Operations in
IOCONTEXT_BULKREAD
- changed the name of the column "fsynced" to "files_synced" to make it
more clear what unit it is in (and that the unit differs from that of
the "unit" column)

In an off-list discussion with Andres, he mentioned that he thought
buffers reused by a BufferAccessStrategy should be split from buffers
"acquired" and that "acquired" should be renamed "clocksweeps".

I have started doing this, but for BufferAccessStrategy IO there are a
few choices about how we want to count the clocksweeps:

Currently the following situations are counted under the following
IOContexts and IOOps:

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_ACQUIRE
- reuse a buffer from the ring

IOCONTEXT_SHARED, IOOP_ACQUIRE
- add a buffer to the strategy ring initially
- add a new shared buffer to the ring when all the existing buffers in
the ring are pinned

And in the new paradigm, I think these are two good options:

1)
IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
- add a buffer to the strategy ring initially
- add a new shared buffer to the ring when all the existing buffers in
the ring are pinned

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
- reuse a buffer from the ring

I've implemented this option in attached v33.

2)
IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP
- add a buffer to the strategy ring initially

IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE
- reuse a buffer from the ring

IOCONTEXT SHARED, IOOP_CLOCKSWEEP
- add a new shared buffer to the ring when all the existing buffers in
the ring are pinned

- Melanie

Attachments:

v33-0002-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v33-0002-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 6a83f0028a69a56243fa5b036299185766b80629 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v33 2/4] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 161 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  84 +++++++++
 src/include/utils/pgstat_internal.h           |  36 ++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 358 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 342b20ebeb..14dfd650f8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5360,6 +5360,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1b97597f17..4becee9a6c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index a992882ac3..369aafa9f3 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,45 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_CLOCKSWEEP:
+			target->clocksweeps += source->clocksweeps;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_HIT:
+			target->hits += source->hits;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -57,6 +96,78 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+									 (IOOp) io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, (IOOp) io_op);
+				pgstat_io_op_assert_zero(pendingent, (IOOp) io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, (IOOp) io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -103,6 +214,56 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..7a2fd1ccf9 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 9a4f037959..275a7be166 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eadd8464ff..edd73e5c25 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2071,6 +2071,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2079,7 +2081,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e7ebea4ff4..bf97162e83 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 015f17cd06..c2c127d846 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -321,6 +322,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -502,6 +509,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -513,6 +521,82 @@ extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp i
 
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->clocksweeps == 0 && counters->extends == 0 &&
+		   counters->fsyncs == 0 && counters->reads == 0 &&
+		   counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_CLOCKSWEEP:
+			Assert(counters->clocksweeps == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_HIT:
+			Assert(counters->hits == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+										   backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, (IOContext) io_context, (IOOp) io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+						(IOOp) io_op);
+		}
+	}
+}
 
 
 /*
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 627c1389e4..9066fed660 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -330,6 +330,25 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -420,6 +439,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -443,6 +463,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -550,6 +572,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -642,6 +673,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 67218ec6f2..33c9362257 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2020,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

v33-0001-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v33-0001-Track-IO-operation-statistics-locally.patchDownload

From 677a8cec3dadfcaf9476e27d1d9e9328a14753c9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:25 -0400
Subject: [PATCH v33 1/4] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "written" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (clocksweep, hit, reuse, read, write, extend, fsync) is
counted per IOContext (bulkread, bulkwrite, local, shared, or vacuum)
through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_CLOCKSWEEP IOOps are counted in IOCONTEXT_SHARED and
IOCONTEXT_LOCAL IOContexts when a buffer is acquired through
[Local]BufferAlloc() and no BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, buffers added to the strategy
ring are counted as IOOP_CLOCKSWEEP IOOps in the
IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM) IOContext. When one of these
buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 ++
 src/backend/storage/buffer/bufmgr.c        |  54 ++++-
 src/backend/storage/buffer/freelist.c      |  60 +++++-
 src/backend/storage/buffer/localbuf.c      |   5 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 238 +++++++++++++++++++++
 src/include/pgstat.h                       |  63 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   4 +
 12 files changed, 438 insertions(+), 12 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..4ea4e6a298 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_SHARED
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_SHARED and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..fb539ed9e6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -833,6 +834,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	isExtend = (blockNum == P_NEW);
 
+	if (strategy)
+		io_context = IOContextForStrategy(strategy);
+	else if (isLocalBuf)
+		io_context = IOCONTEXT_LOCAL;
+	else
+		io_context = IOCONTEXT_SHARED;
+
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -886,6 +894,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
+		pgstat_count_io_op(IOOP_HIT, io_context);
+
 		if (!isExtend)
 		{
 			/* Just need to update stats before we exit */
@@ -990,6 +1000,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1020,6 +1031,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1190,6 +1203,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1215,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1252,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1280,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an
+				 * existing strategy buffer being reused, count this as a
+				 * strategy write for the purposes of IO Operations statistics
+				 * tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring
+				 * will be counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				io_context = from_ring ? IOContextForStrategy(strategy) : IOCONTEXT_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2570,7 +2602,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2852,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2900,6 +2932,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3551,6 +3585,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3586,7 +3622,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3720,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3930,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3957,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..15cd8bbf88 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -198,23 +199,40 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
+	IOContext	io_context;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
 	 */
 	if (strategy != NULL)
 	{
+		io_context = IOContextForStrategy(strategy);
+
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, reused buffers from the
+			 * strategy ring will be counted as IOCONTEXT_BULKREAD,
+			 * IOCONTEXT_BULKWRITE, or IOCONTEXT_VACUUM reuses for the
+			 * purposes of IO Operation statistics tracking.
+			 */
+			*from_ring = true;
+			pgstat_count_io_op(IOOP_REUSE, io_context);
 			return buf;
+		}
 	}
+	else
+		io_context = IOCONTEXT_SHARED;
 
 	/*
 	 * If asked, we need to waken the bgwriter. Since we don't want to rely on
@@ -249,6 +267,16 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
+	/*
+	 * When a BufferAccessStrategy is in use, clocksweeps adding a shared
+	 * buffer to the strategy ring are counted in the corresponding strategy's
+	 * context. This includes the clocksweeps done to add buffers to the ring
+	 * initially as well as those done to add a new shared buffer to the ring
+	 * when all existing buffers in the ring are pinned or have a usage count
+	 * above one.
+	 */
+	pgstat_count_io_op(IOOP_CLOCKSWEEP, io_context);
+
 	/*
 	 * First check, without acquiring the lock, whether there's buffers in the
 	 * freelist. Since we otherwise don't require the spinlock in every
@@ -670,6 +698,36 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is unlikely
+			 * to be hit.
+			 */
+			return IOCONTEXT_SHARED;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..6fe7459401 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -196,6 +197,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_CLOCKSWEEP, IOCONTEXT_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..a992882ac3
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,238 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_CLOCKSWEEP:
+			pending_counters->clocksweeps++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_HIT:
+			pending_counters->hits++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_CLOCKSWEEP:
+			return "clocksweep";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_HIT:
+			return "hit";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_local;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((io_context == IOCONTEXT_BULKREAD || io_context == IOCONTEXT_BULKWRITE
+		 || io_context == IOCONTEXT_VACUUM) && (bktype == B_CHECKPOINTER
+												|| bktype == B_BG_WRITER))
+		return false;
+
+	if (io_context == IOCONTEXT_VACUUM && bktype == B_AUTOVAC_LAUNCHER)
+		return false;
+
+	if (io_context == IOCONTEXT_BULKWRITE && (bktype == B_AUTOVAC_WORKER ||
+											  bktype == B_AUTOVAC_LAUNCHER))
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_CLOCKSWEEP || io_op == IOOP_HIT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts
+	 */
+	if (io_op == IOOP_EXTEND && io_context == IOCONTEXT_BULKREAD)
+		return false;
+
+	if (io_op == IOOP_REUSE &&
+		(io_context == IOCONTEXT_SHARED || io_context == IOCONTEXT_LOCAL))
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD || io_context ==
+		IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	if ((io_context == IOCONTEXT_LOCAL || strategy_io_context) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index cc1d1dcb7d..015f17cd06 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,50 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ */
+
+typedef enum IOOp
+{
+	IOOP_CLOCKSWEEP = 0,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_HIT,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD = 0,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter clocksweeps;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter hits;
+	PgStat_Counter reads;
+	PgStat_Counter reuses;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -452,6 +497,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..50d7e586e9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa0960..d0eed71f63 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97c9bc1861..67218ec6f2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2028,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v33-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v33-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 2f63439e189eff432bba4abfc12762e7c3acd8ea Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:24:42 -0400
Subject: [PATCH v33 3/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(clocksweeps, reuses, hits, reads, writes, extends, and fsyncs) done
through each IOContext (shared buffers, local buffers, and buffers
reserved by a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsynced so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 215 +++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  15 ++
 src/backend/utils/adt/pgstatfuncs.c  | 130 +++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +
 src/test/regress/expected/rules.out  |  12 ++
 src/test/regress/expected/stats.out  | 264 +++++++++++++++++++++++++++
 src/test/regress/sql/stats.sql       | 137 ++++++++++++++
 7 files changed, 780 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 14dfd650f8..926eb40f75 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,13 +3609,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3615,7 +3623,210 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently, only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) could be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used. This refers to the context or location of an IO
+       operation.
+       <literal>shared</literal> refers to shared buffers, the primary
+       buffer pool for relation data.
+       <literal>local</literal> refers to
+       process-local memory used for temporary tables.
+       <literal>vacuum</literal> refers to memory reserved for use during
+       vacuumming and analyzing.
+       <literal>bulkread</literal>
+       refers to memory reserved for use during bulk read operations.
+       <literal>bulkwrite</literal>
+       refers to memory reserved for use during bulk write operations.
+       The autovacuum daemon, explicit <command>VACUUM</command>, explicit
+       <command>ANALYZE</command>, many bulk reads, and many bulk writes use a
+       fixed amount of memory, acquiring the equivalent number of shared
+       buffers and reusing them circularly to avoid occupying an undue portion
+       of the main shared buffer pool.
+      </para></entry>
+     </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into
+       memory or buffers in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>.
+       Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+       Also, the sum of <varname>written</varname> and
+       <varname>extended</varname> in this view for
+       <varname>backend_type</varname>s <literal>client backend</literal>,
+       <literal>autovacuum worker</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> in
+       <varname>io_context</varname>s <literal>shared</literal>,
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>, and
+       <literal>vacuum</literal> is equivalent to
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>unit</structfield> <type>text</type>
+      </para>
+      <para>
+      The unit of IO read, written, or extended. Currently
+      <varname>block_size</varname> is the only possible value. Reads, writes,
+      and extends of relation data are done in <varname>block_size</varname>
+      units. Future values could include <varname>wal_block_size</varname>,
+      once WAL IO is tracked in this view, and <quote>bytes</quote>, once
+      non-block-oriented IO such as temporary file IO is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>clocksweeps</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of clocksweeps done by this <varname>backend_type</varname> in
+       order to acquire a buffer in this <varname>io_context</varname>. A
+       <literal>clocksweep</literal> in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+       <varname>io_context</varname>s is counted when a backend adds a shared
+       buffer to the fixed-size ring used to avoid consuming excessive shared
+       buffers. If the backend has pinned all of the buffers in the ring, it
+       may add a replacement shared buffer to the ring. This will also be
+       counted as a <literal>clocksweep</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times a buffer was reused as part of an operation in the
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>, and
+       <literal>vacuum</literal> <varname>io_context</varname>s.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>hit</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Relevant only for block-based IO of data accessed in the course of
+       satisfying queries, <varname>hit</varname> is the number of number of
+       accesses of blocks already located in a
+       <productname>PostgreSQL</productname> buffer in this specified
+       <varname>io_context</varname> by this <varname>backend_type</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of files fsynced by this <varname>backend_type</varname> for the
+       purpose of persisting data dirtied in this
+       <varname>io_context</varname>. <literal>fsyncs</literal> are done at
+       segment boundaries so <varname>unit</varname> does not apply to the
+       <varname>files_synced</varname> column. <literal>fsyncs</literal> done
+       by backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as an <varname>io_context</varname> <literal>shared</literal>
+       <literal>fsync</literal>. Note that the sum of
+       <varname>files_synced</varname> for all <varname>io_context</varname>
+       <literal>shared</literal> for all <varname>backend_type</varname>s
+       except <literal>checkpointer</literal> is equivalent to
+       <varname>buffers_backend_fsync</varname> in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 55f7ec79e0..25e0cef114 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1114,6 +1114,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.read,
+       b.written,
+       b.extended,
+       b.unit,
+       b.clocksweeps,
+       b.reused,
+       b.hit,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index edd73e5c25..62fbf7e53a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1712,6 +1712,136 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_UNIT,
+	IO_COL_CLOCKSWEEPS,
+	IO_COL_REUSES,
+	IO_COL_HITS,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_CLOCKSWEEP:
+			return IO_COL_CLOCKSWEEPS;
+		case IOOP_HIT:
+			return IO_COL_HITS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+															bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 * Some combinations of IOContext and BackendType are not valid
+			 * for any type of IOOp. In such cases, omit the entire row from
+			 * the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid((BackendType) bktype,
+												(IOContext) io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COL_UNIT] = CStringGetTextDatum("block_size");
+			values[IO_COL_CLOCKSWEEPS] = Int64GetDatum(counters->clocksweeps);
+			values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+			values[IO_COL_HITS] = Int64GetDatum(counters->hits);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid((BackendType) bktype, (IOContext)
+										 io_context, (IOOp) io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, (IOOp) io_op);
+					nulls[pgstat_io_op_get_index((IOOp) io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 68bb032d3e..77edbe9517 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5649,6 +5649,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '14', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,text,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,read,written,extended,unit,clocksweeps,reused,hit,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9dd137415e..7cedd530f5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1868,6 +1868,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.read,
+    b.written,
+    b.extended,
+    b.unit,
+    b.clocksweeps,
+    b.reused,
+    b.hit,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, read, written, extended, unit, clocksweeps, reused, hit, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f701da2069..0172a6c95e 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -918,4 +918,268 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - clocksweeps of shared buffers
+-- - reads of target blocks into shared buffers
+-- - shared buffer cache hits when target blocks reside in shared buffers
+-- - writes of shared buffers
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+SELECT sum(clocksweeps) AS io_sum_shared_clocksweeps_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(hit) AS io_sum_shared_hits_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- clocksweeps and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(clocksweeps) AS io_sum_shared_clocksweeps_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_clocksweeps_after > :io_sum_shared_clocksweeps_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Select from the table again once it is in shared buffers. There should be
+-- some hits recorded in pg_stat_io.
+SELECT sum(hit) AS io_sum_shared_hits_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_hits_after > :io_sum_shared_hits_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that clocksweeps of local buffers, reads of temporary table blocks
+-- into local buffers, temporary table block cache hits in local buffers,
+-- writes of local buffers, and extends of temporary tables are tracked in
+-- pg_stat_io.
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(clocksweeps) AS io_sum_local_clocksweeps_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+-- Query tuples in local buffers to ensure new local buffer cache hits.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(clocksweeps) AS io_sum_local_clocksweeps_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_clocksweeps_after > :io_sum_local_clocksweeps_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_hits_after > :io_sum_local_hits_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- clocksweeps and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(clocksweeps) AS io_sum_vac_strategy_clocksweeps_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(clocksweeps) AS io_sum_vac_strategy_clocksweeps_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_clocksweeps_after > :io_sum_vac_strategy_clocksweeps_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(clocksweeps) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(clocksweeps) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index eb081f65a4..d3860fd9df 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -449,4 +449,141 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - clocksweeps of shared buffers
+-- - reads of target blocks into shared buffers
+-- - shared buffer cache hits when target blocks reside in shared buffers
+-- - writes of shared buffers
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+SELECT sum(clocksweeps) AS io_sum_shared_clocksweeps_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(hit) AS io_sum_shared_hits_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- clocksweeps and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(clocksweeps) AS io_sum_shared_clocksweeps_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_clocksweeps_after > :io_sum_shared_clocksweeps_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+-- Select from the table again once it is in shared buffers. There should be
+-- some hits recorded in pg_stat_io.
+SELECT sum(hit) AS io_sum_shared_hits_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_hits_after > :io_sum_shared_hits_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that clocksweeps of local buffers, reads of temporary table blocks
+-- into local buffers, temporary table block cache hits in local buffers,
+-- writes of local buffers, and extends of temporary tables are tracked in
+-- pg_stat_io.
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(clocksweeps) AS io_sum_local_clocksweeps_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+-- Query tuples in local buffers to ensure new local buffer cache hits.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(clocksweeps) AS io_sum_local_clocksweeps_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(hit) AS io_sum_local_hits_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_clocksweeps_after > :io_sum_local_clocksweeps_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_hits_after > :io_sum_local_hits_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- clocksweeps and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(clocksweeps) AS io_sum_vac_strategy_clocksweeps_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(clocksweeps) AS io_sum_vac_strategy_clocksweeps_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_clocksweeps_after > :io_sum_vac_strategy_clocksweeps_before;
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+
+-- Test IO stats reset
+SELECT sum(clocksweeps) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(clocksweeps) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

#80

m.sakrejda@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#79)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Thanks for working on this! Like Lukas, I'm excited to see more
visibility into important parts of the system like this.

On Mon, Oct 10, 2022 at 11:49 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've gone ahead and implemented option 1 (commented below).

No strong opinion on 1 versus 2, but I guess at least partly because I
don't understand the implications (I do understand the difference,
just not when it might be important in terms of stats). Can we think
of a situation where combining stats about initial additions with
pinned additions hides some behavior that might be good to understand
and hard to pinpoint otherwise?

I took a look at the latest docs (as someone mostly familiar with
internals at only a pretty high level, so probably somewhat close to
the target audience) and have some feedback.

+     <row>
+      <entry role="catalog_table_entry"><para
role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>

Not critical, but is there a list of backend types we could
cross-reference elsewhere in the docs?

From the io_context column description:

+       The autovacuum daemon, explicit <command>VACUUM</command>,
explicit
+       <command>ANALYZE</command>, many bulk reads, and many bulk
writes use a
+       fixed amount of memory, acquiring the equivalent number of
shared
+       buffers and reusing them circularly to avoid occupying an
undue portion
+       of the main shared buffer pool.
+      </para></entry>

I don't understand how this is relevant to the io_context column.
Could you expand on that, or am I just missing something obvious?

+     <row>
+      <entry role="catalog_table_entry"><para
role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this
<varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>

I understand what this is, but not why this is something I might want
to know about.

And from your earlier e-mail:

On Thu, Oct 6, 2022 at 10:42 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Because we want to add non-block-oriented IO in the future (like
temporary file IO) to this view and want to use the same "read",
"written", "extended" columns, I would prefer not to prefix the columns
with "blks_". I have added a column "unit" which would contain the unit
in which read, written, and extended are in. Unfortunately, fsyncs are
not per block, so "unit" doesn't really work for this. I documented
this.

The most correct thing to do to accommodate block-oriented and
non-block-oriented IO would be to specify all the values in bytes.
However, I would like this view to be usable visually (as opposed to
just in scripts and by tools). The only current value of unit is
"block_size" which could potentially be combined with the value of the
GUC to get bytes.

I've hard-coded the string "block_size" into the view generation
function pg_stat_get_io(), so, if this idea makes sense, perhaps I
should do something better there.

That seems broadly reasonable, but pg_settings also has a 'unit'
field, and in that view, unit is '8kB' on my system--i.e., it
(presumably) reflects the block size. Is that something we should try
to be consistent with (not sure if that's a good idea, but thought it
was worth asking)?

On Fri, Sep 30, 2022 at 7:18 PM Lukas Fittl <lukas@fittl.com> wrote:

- Overall it would be helpful if we had a dedicated documentation page on I/O statistics that's linked from the pg_stat_io view description, and explains how the I/O statistics tie into the various concepts of shared buffers / buffer access strategies / etc (and what is not tracked today)

I haven't done this yet. How specific were you thinking -- like
interpretations of all the combinations and what to do with what you
see? Like you should run pg_prewarm if you see X? Specific checkpointer
or bgwriter GUCs to change? Or just links to other docs pages on
recommended tunings?

Were you imagining the other IO statistics views (like
pg_statio_all_tables and pg_stat_database) also being included in this
page? Like would it be a comprehensive guide to IO statistics and what
their significance/purposes are?

I can't speak for Lukas here, but I encouraged him to suggest more
thorough documentation in general, so I can speak to my concerns: in
general, these stats should be usable for someone who does not know
much about Postgres internals. It's pretty low-level information,
sure, so I think you need some understanding of how the system broadly
works to make sense of it. But ideally you should be able to find what
you need to understand the concepts involved within the docs.

I think your updated docs are much clearer (with the caveats of my
specific comments above). It would still probably be helpful to have a
dedicated page on I/O stats (and yeah, something with a broad scope,
along the lines of a comprehensive guide), but I think that can wait
until a future patch.

Thanks,
Maciek

#81

[1]: https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW

melanieplageman@gmail.com

about 3 years ago

In reply to: Maciek Sakrejda (#80)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Oct 10, 2022 at 7:43 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

Thanks for working on this! Like Lukas, I'm excited to see more
visibility into important parts of the system like this.

Thanks for taking another look!

On Mon, Oct 10, 2022 at 11:49 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've gone ahead and implemented option 1 (commented below).

No strong opinion on 1 versus 2, but I guess at least partly because I
don't understand the implications (I do understand the difference,
just not when it might be important in terms of stats). Can we think
of a situation where combining stats about initial additions with
pinned additions hides some behavior that might be good to understand
and hard to pinpoint otherwise?

I think that it makes sense to count both the initial buffers added to
the ring and subsequent shared buffers added to the ring (either when
the current strategy buffer is pinned or in use or when a bulkread
rejects dirty strategy buffers in favor of new shared buffers) as
strategy clocksweeps because of how the statistic would be used.

Clocksweeps give you an idea of how much of your working set is cached
(setting aside initially reading data into shared buffers when you are
warming up the db). You may use clocksweeps to determine if you need to
make shared buffers larger.

Distinguishing strategy buffer clocksweeps from shared buffer
clocksweeps allows us to avoid enlarging shared buffers if most of the
clocksweeps are to bring in blocks for the strategy operation.

However, I could see an argument that discounting strategy clocksweeps
done because the current strategy buffer is pinned makes the number of
shared buffer clocksweeps artificially low since those other queries
using the buffer would have suffered a cache miss were it not for the
strategy. And, in this case, you would take strategy clocksweeps
together with shared clocksweeps to make your decision. And if we
include buffers initially added to the strategy ring in the strategy
clocksweep statistic, this number may be off because those blocks may
not be needed in the main shared working set. But you won't know that
until you try to reuse the buffer and it is pinned. So, I think we don't
have a better option than counting initial buffers added to the ring as
strategy clocksweeps (as opposed to as reuses).

So, in answer to your question, no, I cannot think of a scenario like
that.

Sitting down and thinking about that for a long time did, however, help
me realize that some of my code comments were misleading (and some
incorrect). I will update these in the next version once we agree on
updated docs.

It also made me remember that I am incorrectly counting rejected buffers
as reused. I'm not sure if it is a good idea to subtract from reuses
when a buffer is rejected. Waiting until after it is rejected to count
the reuse will take some other code changes. Perhaps we could also count
rejections in the stats?

I took a look at the latest docs (as someone mostly familiar with
internals at only a pretty high level, so probably somewhat close to
the target audience) and have some feedback.
+     <row>
+      <entry role="catalog_table_entry"><para
role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
Not critical, but is there a list of backend types we could
cross-reference elsewhere in the docs?

The most I could find was this longer explanation (with exhaustive list
of types) in pg_stat_activity docs [1]https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW. I could duplicate what it says
or I could link to the view and say "see pg_stat_activity" for a
description of backend_type" or something like that (to keep them from
getting out of sync as new backend_types are added. I suppose I could
also add docs on backend_types, but I'm not sure where something like
that would go.

From the io_context column description:
+       The autovacuum daemon, explicit <command>VACUUM</command>,
explicit
+       <command>ANALYZE</command>, many bulk reads, and many bulk
writes use a
+       fixed amount of memory, acquiring the equivalent number of
shared
+       buffers and reusing them circularly to avoid occupying an
undue portion
+       of the main shared buffer pool.
+      </para></entry>
I don't understand how this is relevant to the io_context column.
Could you expand on that, or am I just missing something obvious?

I'm trying to explain why those other IO Contexts exist (bulkread,
bulkwrite, vacuum) and why they are separate from shared buffers.
Should I cut it altogether or preface it with something like: these are
counted separate from shared buffers because...?

+     <row>
+      <entry role="catalog_table_entry"><para
role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this
<varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>

I understand what this is, but not why this is something I might want
to know about.

Unlike writes, backends largely have to do their own extends, so
separating this from writes lets us determine whether or not we need to
change checkpointer/bgwriter to be more aggressive using the writes
without the distraction of the extends. Should I mention this in the
docs? The other stats views don't seems to editorialize at all, and I
wasn't sure if this was an objective enough point to include in docs.

And from your earlier e-mail:

On Thu, Oct 6, 2022 at 10:42 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Because we want to add non-block-oriented IO in the future (like
temporary file IO) to this view and want to use the same "read",
"written", "extended" columns, I would prefer not to prefix the columns
with "blks_". I have added a column "unit" which would contain the unit
in which read, written, and extended are in. Unfortunately, fsyncs are
not per block, so "unit" doesn't really work for this. I documented
this.

The most correct thing to do to accommodate block-oriented and
non-block-oriented IO would be to specify all the values in bytes.
However, I would like this view to be usable visually (as opposed to
just in scripts and by tools). The only current value of unit is
"block_size" which could potentially be combined with the value of the
GUC to get bytes.

I've hard-coded the string "block_size" into the view generation
function pg_stat_get_io(), so, if this idea makes sense, perhaps I
should do something better there.

That seems broadly reasonable, but pg_settings also has a 'unit'
field, and in that view, unit is '8kB' on my system--i.e., it
(presumably) reflects the block size. Is that something we should try
to be consistent with (not sure if that's a good idea, but thought it
was worth asking)?

I think this idea is a good option. I am wondering if it would be clear
when mixed with non-block-oriented IO. Block-oriented IO would say 8kB
(or whatever the build-time value of a block was) and non-block-oriented
IO would say B or kB. The math would work out.

Looking at pg_settings now though, I am confused about
how the units for wal_buffers is 8kB but then the value of wal_buffers
when I show it in psql is "16MB"...

Though the units for the pg_stat_io view for block-oriented IO would be
the build-time values for block size, so it wouldn't line up exactly
with pg_settings. However, I do like the idea of having a unit column
that reflects the value and not the name of the GUC/setting which
determined the unit. I can update this in the next version.

- Melanie

#82

m.sakrejda@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#81)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Oct 13, 2022 at 10:29 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think that it makes sense to count both the initial buffers added to
the ring and subsequent shared buffers added to the ring (either when
the current strategy buffer is pinned or in use or when a bulkread
rejects dirty strategy buffers in favor of new shared buffers) as
strategy clocksweeps because of how the statistic would be used.

Clocksweeps give you an idea of how much of your working set is cached
(setting aside initially reading data into shared buffers when you are
warming up the db). You may use clocksweeps to determine if you need to
make shared buffers larger.

Distinguishing strategy buffer clocksweeps from shared buffer
clocksweeps allows us to avoid enlarging shared buffers if most of the
clocksweeps are to bring in blocks for the strategy operation.

However, I could see an argument that discounting strategy clocksweeps
done because the current strategy buffer is pinned makes the number of
shared buffer clocksweeps artificially low since those other queries
using the buffer would have suffered a cache miss were it not for the
strategy. And, in this case, you would take strategy clocksweeps
together with shared clocksweeps to make your decision. And if we
include buffers initially added to the strategy ring in the strategy
clocksweep statistic, this number may be off because those blocks may
not be needed in the main shared working set. But you won't know that
until you try to reuse the buffer and it is pinned. So, I think we don't
have a better option than counting initial buffers added to the ring as
strategy clocksweeps (as opposed to as reuses).

So, in answer to your question, no, I cannot think of a scenario like
that.

That analysis makes sense to me; thanks.

It also made me remember that I am incorrectly counting rejected buffers
as reused. I'm not sure if it is a good idea to subtract from reuses
when a buffer is rejected. Waiting until after it is rejected to count
the reuse will take some other code changes. Perhaps we could also count
rejections in the stats?

I'm not sure what makes sense here.

Not critical, but is there a list of backend types we could
cross-reference elsewhere in the docs?

The most I could find was this longer explanation (with exhaustive list
of types) in pg_stat_activity docs [1]. I could duplicate what it says
or I could link to the view and say "see pg_stat_activity" for a
description of backend_type" or something like that (to keep them from
getting out of sync as new backend_types are added. I suppose I could
also add docs on backend_types, but I'm not sure where something like
that would go.

I think linking pg_stat_activity is reasonable for now. A separate
section for this might be nice at some point, but that seems out of
scope.

From the io_context column description:
+       The autovacuum daemon, explicit <command>VACUUM</command>,
explicit
+       <command>ANALYZE</command>, many bulk reads, and many bulk
writes use a
+       fixed amount of memory, acquiring the equivalent number of
shared
+       buffers and reusing them circularly to avoid occupying an
undue portion
+       of the main shared buffer pool.
+      </para></entry>
I don't understand how this is relevant to the io_context column.
Could you expand on that, or am I just missing something obvious?
I'm trying to explain why those other IO Contexts exist (bulkread,
bulkwrite, vacuum) and why they are separate from shared buffers.
Should I cut it altogether or preface it with something like: these are
counted separate from shared buffers because...?

Oh I see. That makes sense; it just wasn't obvious to me this was
talking about the last three values of io_context. I think a brief
preface like that would be helpful (maybe explicitly with "these last
three values", and I think "counted separately").

+     <row>
+      <entry role="catalog_table_entry"><para
role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this
<varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
I understand what this is, but not why this is something I might want
to know about.
Unlike writes, backends largely have to do their own extends, so
separating this from writes lets us determine whether or not we need to
change checkpointer/bgwriter to be more aggressive using the writes
without the distraction of the extends. Should I mention this in the
docs? The other stats views don't seems to editorialize at all, and I
wasn't sure if this was an objective enough point to include in docs.

Thanks for the clarification. Just to make sure I understand, you mean
that if I see a high extended count, that may be interesting in terms
of write activity, but I can't fix that by tuning--it's just the
nature of my workload?

I think you're right that this is not objective enough. It's
unfortunate that there's not a good place in the docs for info like
that, since stats like this are hard to interpret without that
context, but I admit that it's not really this patch's job to solve
that larger issue.

That seems broadly reasonable, but pg_settings also has a 'unit'
field, and in that view, unit is '8kB' on my system--i.e., it
(presumably) reflects the block size. Is that something we should try
to be consistent with (not sure if that's a good idea, but thought it
was worth asking)?

I think this idea is a good option. I am wondering if it would be clear
when mixed with non-block-oriented IO. Block-oriented IO would say 8kB
(or whatever the build-time value of a block was) and non-block-oriented
IO would say B or kB. The math would work out.

Right, yeah. Although maybe that's a little confusing? When you
originally added "unit", you had said:

The most correct thing to do to accommodate block-oriented and
non-block-oriented IO would be to specify all the values in bytes.
However, I would like this view to be usable visually (as opposed to
just in scripts and by tools). The only current value of unit is
"block_size" which could potentially be combined with the value of the
GUC to get bytes.

Is this still usable visually if you have to compare values across
units? I don't really have any great ideas here (and maybe this is
still the best option), just pointing it out.

Looking at pg_settings now though, I am confused about
how the units for wal_buffers is 8kB but then the value of wal_buffers
when I show it in psql is "16MB"...

You mean the difference between

maciek=# select setting, unit from pg_settings where name = 'wal_buffers';
setting | unit
---------+------
512 | 8kB
(1 row)

and

maciek=# show wal_buffers;
wal_buffers
-------------
4MB
(1 row)

Poking around, I think it looks like that's due to
convert_int_from_base_unit (indirectly called from SHOW /
current_setting):

/*
* Convert an integer value in some base unit to a human-friendly
unit.
*
* The output unit is chosen so that it's the greatest unit that can
represent
* the value without loss. For example, if the base unit is
GUC_UNIT_KB, 1024
* is converted to 1 MB, but 1025 is represented as 1025 kB.
*/

Though the units for the pg_stat_io view for block-oriented IO would be
the build-time values for block size, so it wouldn't line up exactly
with pg_settings.

I don't follow--what would be the discrepancy?

#83

melanieplageman@gmail.com

about 3 years ago

In reply to: Maciek Sakrejda (#82)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v34 is attached.
I think the column names need discussion. Also, the docs need more work
(I added a lot of new content there). I could use feedback on the column
names and definitions and review/rephrasing ideas for the docs
additions.

On Mon, Oct 17, 2022 at 1:28 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

On Thu, Oct 13, 2022 at 10:29 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think that it makes sense to count both the initial buffers added to
the ring and subsequent shared buffers added to the ring (either when
the current strategy buffer is pinned or in use or when a bulkread
rejects dirty strategy buffers in favor of new shared buffers) as
strategy clocksweeps because of how the statistic would be used.

Clocksweeps give you an idea of how much of your working set is cached
(setting aside initially reading data into shared buffers when you are
warming up the db). You may use clocksweeps to determine if you need to
make shared buffers larger.

Distinguishing strategy buffer clocksweeps from shared buffer
clocksweeps allows us to avoid enlarging shared buffers if most of the
clocksweeps are to bring in blocks for the strategy operation.

However, I could see an argument that discounting strategy clocksweeps
done because the current strategy buffer is pinned makes the number of
shared buffer clocksweeps artificially low since those other queries
using the buffer would have suffered a cache miss were it not for the
strategy. And, in this case, you would take strategy clocksweeps
together with shared clocksweeps to make your decision. And if we
include buffers initially added to the strategy ring in the strategy
clocksweep statistic, this number may be off because those blocks may
not be needed in the main shared working set. But you won't know that
until you try to reuse the buffer and it is pinned. So, I think we don't
have a better option than counting initial buffers added to the ring as
strategy clocksweeps (as opposed to as reuses).

So, in answer to your question, no, I cannot think of a scenario like
that.

That analysis makes sense to me; thanks.

I have made some major changes in this area to make the columns more
useful. I have renamed and split "clocksweeps". It is now "evicted" and
"freelist acquired". This makes it clear when a block must be evicted
from a shared buffer must be and may help to identify misconfiguration
of shared buffers.

There is some nuance here that I tried to make clear in the docs.
"freelist acquired" in a shared context is straightforward.
"freelist acquired" in a strategy context is counted when a shared
buffer is added to the strategy ring (not when it is reused).

"freelist acquired" in the local buffer context is actually the initial
allocation of a local buffer (in contrast with reuse).

"evicted" in the shared IOContext is a block being evicted from a shared
buffer in order to reuse that buffer when not using a strategy.

"evicted" in a strategy IOContext is a block being evicted from
a shared buffer in order to add that shared buffer to the strategy ring.

This is in contrast with "reused" in a strategy IOContext which is when
an existing buffer in the strategy ring has a block evicted in order to
reuse that buffer in a strategy context.

"evicted" in a local IOContext is when an existing local buffer has a
block evicted in order to reuse that local buffer.

"freelist_acquired" is confusing for local buffers but I wanted to
distinguish between reuse/eviction of local buffers and initial
allocation. "freelist_acquired" seemed more fitting because there is a
clocksweep to find a local buffer and if it hasn't been allocated yet it
is allocated in a place similar to where shared buffers acquire a buffer
from the freelist. If I didn't count it here, I would need to make a new
column only for local buffers called "allocated" or something like that.

I chose not to call "evicted" "sb_evicted"
because then we would need a separate "local_evicted". I could instead
make "local_evicted", "sb_evicted", and rename "reused" to
"strat_evicted". If I did that we would end up with separate columns for
every IO Context describing behavior when a buffer is initially acquired
vs when it is reused.

It would look something like this:

shared buffers:
initial: freelist_acquired
reused: sb_evicted

local buffers:
initial: allocated
reused: local_evicted

strategy buffers:
initial: sb_evicted | freelist_acquired
reused: strat_evicted
replaced: sb_evicted | freelist_acquired

This seems not too bad at first, but if you consider that later we will
add other kinds of IO -- eg WAL IO or temporary file IO, we won't be
able to use these existing columns and will need to add even more
columns describing the exact behavior in those cases.

I wanted to devise a paradigm which allowed for reuse of columns across
IOContexts even if with slightly different meanings.

I have also added the columns "repossessed" and "rejected". "rejected"
is when a bulkread rejects a strategy buffer because it is dirty and
requires flush. Seeing a lot of rejections could indicate you need to
vacuum. "repossessed" is the number of times a strategy buffer was
pinned or in use by another backend and had to be removed from the
strategy ring and replaced with a new shared buffer. This gives you some
indication that there is contention on blocks recently used by a
strategy.

I've also added some descriptions to the docs of how these columns might
be used or what a large value in one of them may mean.

I haven't added tests for repossessed or rejected yet. I can add tests
for repossessed if we decide to keep it. Rejected is hard to write a
test for because we can't guarantee checkpointer won't clean up the
buffer before we can reject it

It also made me remember that I am incorrectly counting rejected buffers
as reused. I'm not sure if it is a good idea to subtract from reuses
when a buffer is rejected. Waiting until after it is rejected to count
the reuse will take some other code changes. Perhaps we could also count
rejections in the stats?

I'm not sure what makes sense here.

I have fixed the counting of rejected and have made a new column
dedicated to rejected.

From the io_context column description:
+       The autovacuum daemon, explicit <command>VACUUM</command>,
explicit
+       <command>ANALYZE</command>, many bulk reads, and many bulk
writes use a
+       fixed amount of memory, acquiring the equivalent number of
shared
+       buffers and reusing them circularly to avoid occupying an
undue portion
+       of the main shared buffer pool.
+      </para></entry>
I don't understand how this is relevant to the io_context column.
Could you expand on that, or am I just missing something obvious?
I'm trying to explain why those other IO Contexts exist (bulkread,
bulkwrite, vacuum) and why they are separate from shared buffers.
Should I cut it altogether or preface it with something like: these are
counted separate from shared buffers because...?
Oh I see. That makes sense; it just wasn't obvious to me this was
talking about the last three values of io_context. I think a brief
preface like that would be helpful (maybe explicitly with "these last
three values", and I think "counted separately").

I've done this. Thanks for the suggested wording.

+     <row>
+      <entry role="catalog_table_entry"><para
role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this
<varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
I understand what this is, but not why this is something I might want
to know about.
Unlike writes, backends largely have to do their own extends, so
separating this from writes lets us determine whether or not we need to
change checkpointer/bgwriter to be more aggressive using the writes
without the distraction of the extends. Should I mention this in the
docs? The other stats views don't seems to editorialize at all, and I
wasn't sure if this was an objective enough point to include in docs.
Thanks for the clarification. Just to make sure I understand, you mean
that if I see a high extended count, that may be interesting in terms
of write activity, but I can't fix that by tuning--it's just the
nature of my workload?

That is correct.

That seems broadly reasonable, but pg_settings also has a 'unit'
field, and in that view, unit is '8kB' on my system--i.e., it
(presumably) reflects the block size. Is that something we should try
to be consistent with (not sure if that's a good idea, but thought it
was worth asking)?

I think this idea is a good option. I am wondering if it would be clear
when mixed with non-block-oriented IO. Block-oriented IO would say 8kB
(or whatever the build-time value of a block was) and non-block-oriented
IO would say B or kB. The math would work out.

Right, yeah. Although maybe that's a little confusing? When you
originally added "unit", you had said:

The most correct thing to do to accommodate block-oriented and
non-block-oriented IO would be to specify all the values in bytes.
However, I would like this view to be usable visually (as opposed to
just in scripts and by tools). The only current value of unit is
"block_size" which could potentially be combined with the value of the
GUC to get bytes.

Is this still usable visually if you have to compare values across
units? I don't really have any great ideas here (and maybe this is
still the best option), just pointing it out.

Looking at pg_settings now though, I am confused about
how the units for wal_buffers is 8kB but then the value of wal_buffers
when I show it in psql is "16MB"...

You mean the difference between

maciek=# select setting, unit from pg_settings where name = 'wal_buffers';
setting | unit
---------+------
512 | 8kB
(1 row)

and

maciek=# show wal_buffers;
wal_buffers
-------------
4MB
(1 row)

?

Poking around, I think it looks like that's due to
convert_int_from_base_unit (indirectly called from SHOW /
current_setting):

/*
* Convert an integer value in some base unit to a human-friendly
unit.
*
* The output unit is chosen so that it's the greatest unit that can
represent
* the value without loss. For example, if the base unit is
GUC_UNIT_KB, 1024
* is converted to 1 MB, but 1025 is represented as 1025 kB.
*/

I've implemented a change using the same function pg_settings uses to
turn the build-time parameter BLCKSZ into 8kB (get_config_unit_name())
using the flag GUC_UNIT_BLOCKS. I am unsure if this is better or worse
than "block_size". I am feeling very conflicted about this column.

Though the units for the pg_stat_io view for block-oriented IO would be
the build-time values for block size, so it wouldn't line up exactly
with pg_settings.

I don't follow--what would be the discrepancy?

I got confused.
You are right -- pg_settings does seem to use the build-time value of
BLCKSZ to derive this. I was confused because the description of
pg_settings says:

"The view pg_settings provides access to run-time parameters of the server."

- Melanie

Attachments:

v34-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchtext/x-patch; charset=US-ASCII; name=v34-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchDownload

From da4e97d2c7ec2997f07df53b7d7c734d2ee9f9ab Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Oct 2022 11:03:05 -0700
Subject: [PATCH v34 1/5] Remove BufferAccessStrategyData->current_was_in_ring

It is a duplication of StrategyGetBuffer->from_ring.
---
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 src/backend/storage/buffer/freelist.c | 15 ++-------------
 src/include/storage/buf_internals.h   |  2 +-
 3 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..4e7b0b31bb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1254,7 +1254,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..64728bd7ce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,12 +81,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -625,10 +619,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +635,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +644,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -682,14 +671,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..b75481450d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -395,7 +395,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
-- 
2.34.1

v34-0003-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v34-0003-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 96e1968827218aaa50e252840cdc224e8a92e8ee Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v34 3/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 167 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  92 ++++++++++
 src/include/utils/pgstat_internal.h           |  36 ++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 372 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..698f274341 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5390,6 +5390,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1b97597f17..4becee9a6c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 00abca4e94..8280eee731 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,51 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			target->evictions += source->evictions;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FREELIST_ACQUIRE:
+			target->freelist_acquisitions += source->freelist_acquisitions;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REPOSSESS:
+			target->repossessions += source->repossessions;
+			return;
+		case IOOP_REJECT:
+			target->rejections += source->rejections;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -63,6 +108,78 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+									 (IOOp) io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, (IOOp) io_op);
+				pgstat_io_op_assert_zero(pendingent, (IOOp) io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, (IOOp) io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -113,6 +230,56 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 55a355f583..a23a90b133 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 9a4f037959..275a7be166 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 96bffc0f2a..b783af130c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2084,6 +2084,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2092,7 +2094,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e7ebea4ff4..bf97162e83 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2844d72f86..a1c345a270 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -328,6 +329,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -510,6 +517,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -521,6 +529,90 @@ extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp i
 
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->evictions == 0 && counters->extends == 0 &&
+			counters->freelist_acquisitions == 0 && counters->fsyncs == 0 &&
+			counters->reads == 0 && counters->rejections == 0 &&
+			counters->repossessions == 0 && counters->reuses == 0 &&
+			counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			Assert(counters->evictions == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FREELIST_ACQUIRE:
+			Assert(counters->freelist_acquisitions == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REJECT:
+			Assert(counters->rejections == 0);
+			return;
+		case IOOP_REPOSSESS:
+			Assert(counters->repossessions == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+										   backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, (IOContext) io_context, (IOOp) io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+						(IOOp) io_op);
+		}
+	}
+}
 
 
 /*
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 627c1389e4..9066fed660 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -330,6 +330,25 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -420,6 +439,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -443,6 +463,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -550,6 +572,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -642,6 +673,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 600c0fe855..10781480cd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2020,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

v34-0002-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v34-0002-Track-IO-operation-statistics-locally.patchDownload

From c540da75759239dd78b096e5b6711f79270f4672 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:25 -0400
Subject: [PATCH v34 2/5] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "written" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (evict, freelist acquisition, reject, repossess, reuse, read,
write, extend, and fsync) is counted per IOContext (bulkread, bulkwrite,
local, shared, or vacuum) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_FREELIST_ACQUIRE and IOOP_EVICT IOOps are counted in
IOCONTEXT_SHARED and IOCONTEXT_LOCAL IOContexts when a buffer is
acquired or allocated through [Local]BufferAlloc() and no
BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, shared buffers added to the
strategy ring are counted as IOOP_FREELIST_ACQUIRE or IOOP_EVICT IOOps
in the IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM) IOContext. When one of
these buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 ++
 src/backend/storage/buffer/bufmgr.c        |  73 +++++-
 src/backend/storage/buffer/freelist.c      |  71 +++++-
 src/backend/storage/buffer/localbuf.c      |  19 ++
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 260 +++++++++++++++++++++
 src/include/pgstat.h                       |  70 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   4 +
 12 files changed, 510 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..4ea4e6a298 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_SHARED
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_SHARED and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4e7b0b31bb..f064b7faf3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -833,6 +834,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	isExtend = (blockNum == P_NEW);
 
+	if (strategy)
+		io_context = IOContextForStrategy(strategy);
+	else if (isLocalBuf)
+		io_context = IOCONTEXT_LOCAL;
+	else
+		io_context = IOCONTEXT_SHARED;
+
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -990,6 +998,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1020,6 +1029,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1190,6 +1201,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool		from_ring;
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1213,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1250,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1278,36 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_SHARED IOOP_WRITE.
+				 *
+				 * If a shared buffer added to the ring later because the
+				 * current strategy buffer is pinned or in use or because all
+				 * strategy buffers were dirty and rejected (for BAS_BULKREAD
+				 * operations only) requires flushing, this is counted as an
+				 * IOCONTEXT_SHARED IOOP_WRITE (from_ring will be false).
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer (IOCONTEXT_SHARED
+				 * IOOP_WRITE).
+				 */
+
+				io_context = IOContextForStrategy(strategy);
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1291,6 +1329,19 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			}
 		}
 
+		/*
+		* When a BufferAccessStrategy is in use, reused buffers from the
+		* strategy ring will be counted as IOCONTEXT_BULKREAD,
+		* IOCONTEXT_BULKWRITE, or IOCONTEXT_VACUUM reuses for the
+		* purposes of IO Operation statistics tracking.
+		*
+		* We wait until this point to count reuses to avoid incorrectly
+		* counting a buffer as reused when it was rejected or concurrently
+		* pinned.
+		*/
+		if (from_ring)
+			pgstat_count_io_op(IOOP_REUSE, IOContextForStrategy(strategy));
+
 		/*
 		 * To change the association of a valid buffer, we'll need to have
 		 * exclusive lock on both the old and new mapping partitions.
@@ -2570,7 +2621,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2871,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2900,6 +2951,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3551,6 +3604,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3586,7 +3641,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3739,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3949,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3976,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 64728bd7ce..72028372be 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -192,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -207,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -299,6 +305,15 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
+				/*
+				 * When a strategy is in use, IO Operation statistics count
+				 * buffers acquired from the freelist when a strategy is in use
+				 * in their corresponding strategy IOContext even though the
+				 * buffer acquired from the freelist is a shared buffer.
+				 */
+				pgstat_count_io_op(IOOP_FREELIST_ACQUIRE,
+						IOContextForStrategy(strategy));
 				return buf;
 			}
 			UnlockBufHdr(buf, local_buf_state);
@@ -331,6 +346,19 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+				/*
+				* When a BufferAccessStrategy is in use, evictions adding a
+				* shared buffer to the strategy ring are counted in the
+				* corresponding strategy's context. This includes the evictions
+				* done to add buffers to the ring initially as well as those
+				* done to add a new shared buffer to the ring when current
+				* buffer is pinned or otherwise in use.
+				*
+				* We wait until this point to count evictions to avoid
+				* incorrectly counting cases in which we error out.
+				*/
+				pgstat_count_io_op(IOOP_EVICT, IOContextForStrategy(strategy));
+
 				return buf;
 			}
 		}
@@ -596,7 +624,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -643,7 +671,13 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	/*
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
+	 *
+	 * This counts as a "repossession" for the purposes of IO operation
+	 * statistic tracking, since the reason that we no longer consider the
+	 * current buffer to be part of the ring is that the block in it is in use
+	 * outside of the ring, preventing us from reusing the buffer.
 	 */
+	pgstat_count_io_op(IOOP_REPOSSESS, IOContextForStrategy(strategy));
 	return NULL;
 }
 
@@ -659,6 +693,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_SHARED;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_SHARED;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -688,5 +753,7 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	pgstat_count_io_op(IOOP_REJECT, IOContextForStrategy(strategy));
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..bfcd35e4e2 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -117,6 +118,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	bool		found;
 	uint32		buf_state;
 
+	IOOp	io_op = IOOP_EVICT;
+
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
 
 	/* Initialize local buffers if first request in this session */
@@ -196,6 +199,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
 				break;
 			}
 		}
@@ -226,6 +230,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
@@ -238,6 +244,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (LocalBufHdrGetBlock(bufHdr) == NULL)
 	{
+		/*
+		 * If this is the first use of the buffer count it as a "freelist
+		 * acquire". This isn't a perfect description of this allocation, since
+		 * we do not maintain a freelist with local buffers, however it allows
+		 * us to distinguish between initial use and evictions of local
+		 * buffers.
+		 */
+		io_op = IOOP_FREELIST_ACQUIRE;
 		/* Set pointer for use by BufferGetBlock() macro */
 		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
 	}
@@ -275,6 +289,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 	*foundPtr = false;
+
+	/*
+	 * Count the IOOp here after we've ensured we were successful.
+	 */
+	pgstat_count_io_op(io_op, IOCONTEXT_LOCAL);
 	return bufHdr;
 }
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..00abca4e94
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,260 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FREELIST_ACQUIRE:
+			pending_counters->freelist_acquisitions++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REPOSSESS:
+			pending_counters->repossessions++;
+			break;
+		case IOOP_REJECT:
+			pending_counters->rejections++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FREELIST_ACQUIRE:
+			return "freelist acquired";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REPOSSESS:
+			return "repossessed";
+		case IOOP_REJECT:
+			return "rejected";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_local;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((io_context == IOCONTEXT_BULKREAD || io_context == IOCONTEXT_BULKWRITE
+		 || io_context == IOCONTEXT_VACUUM) && (bktype == B_CHECKPOINTER
+												|| bktype == B_BG_WRITER))
+		return false;
+
+	if (io_context == IOCONTEXT_VACUUM && bktype == B_AUTOVAC_LAUNCHER)
+		return false;
+
+	if (io_context == IOCONTEXT_BULKWRITE && (bktype == B_AUTOVAC_WORKER ||
+											  bktype == B_AUTOVAC_LAUNCHER))
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT || io_op == IOOP_FREELIST_ACQUIRE))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Only BAS_BULKREAD will reject strategy buffers
+	 */
+	if (io_context != IOCONTEXT_BULKREAD && io_op == IOOP_REJECT)
+		return false;
+
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD || io_context ==
+		IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REPOSSESS and IOOP_REUSE are only relevant when a
+	 * BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && (io_op == IOOP_REJECT || io_op ==
+				IOOP_REPOSSESS || io_op == IOOP_REUSE))
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if ((io_context == IOCONTEXT_LOCAL || strategy_io_context) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..2844d72f86 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,57 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ * When adding a new value, ensure that the proper assertions are added to
+ * pgstat_io_context_ops_assert_zero() and pgstat_io_op_assert_zero() (though
+ * the compiler will remind you about the latter)
+ */
+
+typedef enum IOOp
+{
+	IOOP_EVICT = 0,
+	IOOP_EXTEND,
+	IOOP_FREELIST_ACQUIRE,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REJECT,
+	IOOP_REPOSSESS,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD = 0,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter freelist_acquisitions;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter rejections;
+	PgStat_Counter reuses;
+	PgStat_Counter repossessions;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +505,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b75481450d..7b67250747 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa0960..d0eed71f63 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d9b839c979..600c0fe855 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2028,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v34-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v34-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 1de196bad4aba06beb32ba6487145d765794a7c5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:24:42 -0400
Subject: [PATCH v34 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (freelist
acquisitions, evictions, reuses, rejections, repossessions, reads,
writes, extends, and fsyncs) done through each IOContext (shared
buffers, local buffers, and buffers reserved by a BufferAccessStrategy)
by each type of backend (e.g. client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsynced so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 330 ++++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  17 ++
 src/backend/utils/adt/pgstatfuncs.c  | 153 +++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +
 src/test/regress/expected/rules.out  |  14 ++
 src/test/regress/expected/stats.out  | 242 ++++++++++++++++++++
 src/test/regress/sql/stats.sql       | 131 +++++++++++
 7 files changed, 894 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 698f274341..e144bb0c35 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3600,13 +3609,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3615,7 +3623,325 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) could be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+       See <link linkend="monitoring-pg-stat-activity-view">
+       <structname>pg_stat_activity</structname></link> for more information on
+       <varname>backend_type</varname>s.
+      </para></entry>
+     </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       IO Context used. This refers to the context or location of an IO
+       operation. <literal>shared</literal> refers to shared buffers, the
+       primary buffer pool for relation data. <literal>local</literal> refers
+       to process-local memory used for temporary tables.
+       <literal>vacuum</literal> refers to memory reserved for use during
+       vacuumming and analyzing. <literal>bulkread</literal> refers to memory
+       reserved for use during bulk read operations.
+       <literal>bulkwrite</literal> refers to memory reserved for use during
+       bulk write operations. These last three <varname>io_context</varname>s
+       are counted separately because the autovacuum daemon, explicit
+       <command>VACUUM</command>, explicit <command>ANALYZE</command>, many
+       bulk reads, and many bulk writes use a fixed amount of memory, acquiring
+       the equivalent number of shared buffers and reusing them circularly to
+       avoid occupying an undue portion of the main shared buffer pool.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into buffers in this
+       <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>. Note that the values of
+       <varname>written</varname> for <varname>backend_type</varname>
+       <literal>background writer</literal> and <varname>backend_type</varname>
+       <literal>checkpointer</literal> are equivalent to the values of
+       <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>. Also, the sum of
+       <varname>written</varname> and <varname>extended</varname> in this view
+       for <varname>backend_type</varname>s <literal>client backend</literal>,
+       <literal>autovacuum worker</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> in
+       <varname>io_context</varname>s <literal>shared</literal>,
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>, and
+       <literal>vacuum</literal> is equivalent to
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>unit</structfield> <type>text</type>
+      </para>
+      <para>
+      The unit of IO read, written, or extended. For block-oriented IO of
+      relation data, reads, writes, and extends are done in
+      <varname>block_size</varname> units. Thus, the <varname>unit</varname>
+      column will be the value of the build-time parameter
+      <symbol>BLCKSZ</symbol> represented in kilobytes. Future values could
+      include those derived from <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is
+      tracked in this view and <quote>bytes</quote>, once non-block-oriented IO
+      (e.g. temporary file IO) is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>freelist_acquired</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of unoccupied shared or local buffers this
+       <varname>backend_type</varname> has acquired for use as a buffer
+       in this <varname>io_context</varname>.
+
+       A freelist of shared buffers which have yet to be used or whose contents
+       have been invalidated and have yet to be replaced with other contents is
+       maintained in shared memory.
+
+       Acquisition of a buffer from the freelist constitutes a <quote>cache
+       miss</quote>.
+
+       <varname>freelist_acquired</varname> in <varname>io_context</varname>
+       <literal>shared</literal> counts the acquisitions of an unoccupied
+       shared buffer from the freelist.
+
+       Given a working set close to or larger than the size of shared buffers,
+       once the database has warmed up initially, fewer
+       <varname>freelist_acquired</varname> are expected. Instead, data
+       resident in shared buffers will need to be evicted in order to read in
+       non-resident data. This will be manifest as increased
+       <varname>evicted</varname> count in <varname>io_context</varname>
+       <literal>shared</literal>.
+
+       <varname>freelist_acquired</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times unoccupied
+       shared buffers were acquired from the freelist and added to the
+       fixed-size strategy ring buffer. Shared buffers are added to the
+       strategy ring lazily. If the current buffer in the ring is pinned or in
+       use by another backend, it may be replaced with a new shared buffer. In
+       <varname>io_context</varname> <literal>bulkread</literal>, existing
+       dirty buffers in the ring requirng flush are
+       <varname>rejected</varname>. If all of the buffers in the strategy ring
+       have been <varname>rejected</varname>, a new shared buffer will be added
+       to the ring and will be counted as <varname>freelist_acquired</varname>
+       if the replacement buffer is procured from the freelist.
+
+       <varname>freelist_acquired</varname> in <varname>io_context</varname>
+       <literal>local</literal> counts the number of local buffers allocated.
+       Local buffers are allocated lazily.
+      </para></entry>
+     </row>
+
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>evicted</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times a <varname>backend_type</varname> has evicted a block
+       from a shared or local buffer in order to reuse the buffer in this
+       <varname>io_context</varname>.
+
+       <varname>evicted</varname> and <varname>freelist_acquired</varname>
+       together constitute the number of cache misses for local or shared
+       buffers. A large number of <varname>evicted</varname> buffers after the
+       database has initially warmed up could indicate that shared buffers is
+       too small and should be set to a larger value.
+
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>shared</literal> counts the eviction of a block from a shared
+       buffer so that it can be replaced with another block, also in shared
+       buffers.
+
+       If shared buffers are available on the freelist, then no eviction is
+       necessary and the acquisition of the required buffer is counted as a
+       <literal>freelist_acquired</literal> operation in the
+       <varname>io_context</varname> <literal>shared</literal>.
+
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times occupied shared
+       buffers were added to the fixed-size strategy ring buffer, causing the
+       buffer contents to be evicted. If the current buffer in the ring is
+       pinned or in use by another backend, it may be replaced by a new shared
+       buffer. If this shared buffer contains valid data, that block must be
+       evicted and will count as <varname>evicted</varname>.
+
+       In <varname>io_context</varname> <literal>bulkread</literal>, existing
+       dirty buffers in the ring requirng flush are
+       <varname>rejected</varname>. If all of the buffers in the strategy ring
+       have been <varname>rejected</varname>, a new shared buffer will be added
+       to the ring. If the new shared buffer is occupied, its contents will
+       need to be evicted.
+
+       Seeing a large number of <varname>evicted</varname> in strategy
+       <varname>io_context</varname>s can provide insight into primary working
+       set cache misses.
+
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>local</literal> counts the eviction of a block of data from an
+       existing local buffer in order to replace it with another block, also in
+       local buffers.
+
+       This is in contrast with <varname>freelist_acquired</varname> operations
+       in <varname>io_context</varname> <literal>local</literal> which are
+       counted the first time that the memory for a local buffer is allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times an existing buffer in the strategy ring was reused
+       as part of an operation in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, and <literal>vacuum</literal>
+       <varname>io_context</varname>s. This is equivalent to
+       <varname>evicted</varname> for shared buffers in
+       <varname>io_context</varname> <literal>shared</literal>, as the contents
+       of the buffer are <quote>evicted</quote> but refers to the case when the
+       eviction target is already a member of the strategy ring.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>rejected</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of times a <literal>bulkread</literal> found the current
+      buffer in the fixed-size strategy ring dirty and requiring flush.
+      <quote>Rejecting</quote> the buffer effectively removes it from the
+      strategy ring buffer allowing the slot in the ring to be replaced in the
+      future with a new shared buffer. A high number of
+      <literal>bulkread</literal> rejections can indicate a need for more
+      frequent vacuuming or more aggressive autovacuum settings, as buffers are
+      dirtied during a bulkread operation when updating the hint bit or when
+      performing on-access pruning.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>repossessed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times a buffer in the fixed-size ring buffer used by
+       operations in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, and <literal>vacuum</literal>
+       <varname>io_context</varname>s was removed from that ring buffer because
+       it was pinned or in use by another backend and thus could not have its
+       tenant block evicted so it could be reused. Once removed from the
+       strategy ring, this buffer is a <quote>normal</quote> shared buffer
+       again. A high number of repossessions is a sign of contention for the
+       blocks operated on by the strategy operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of files fsynced by this <varname>backend_type</varname> for the
+       purpose of persisting data dirtied in this
+       <varname>io_context</varname>. <literal>fsyncs</literal> are done at
+       segment boundaries so <varname>unit</varname> does not apply to the
+       <varname>files_synced</varname> column. <literal>fsyncs</literal> done
+       by backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as an <varname>io_context</varname> <literal>shared</literal>
+       <literal>fsync</literal>. Note that the sum of
+       <varname>files_synced</varname> for all <varname>io_context</varname>
+       <literal>shared</literal> for all <varname>backend_type</varname>s
+       except <literal>checkpointer</literal> is equivalent to
+       <varname>buffers_backend_fsync</varname> in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..f851cc9aac 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,23 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.read,
+       b.written,
+       b.extended,
+       b.unit,
+       b.freelist_acquired,
+       b.evicted,
+       b.reused,
+       b.rejected,
+       b.repossessed,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b783af130c..74ee9f61d0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/guc.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
 
@@ -1725,6 +1726,158 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_UNIT,
+	IO_COL_FREELIST_ACQUISITIONS,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_REJECTIONS,
+	IO_COL_REPOSSESSIONS,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_FREELIST_ACQUIRE:
+			return IO_COL_FREELIST_ACQUISITIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_REJECT:
+			return IO_COL_REJECTIONS;
+		case IOOP_REPOSSESS:
+			return IO_COL_REPOSSESSIONS;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+															bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			/*
+			 * Hard-code this to blocks until we have non-block-oriented IO
+			 * represented in the view as well
+			 */
+			int	unit = GUC_UNIT_BLOCKS;
+			const char	*unit_name = get_config_unit_name(unit);
+
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 *  Given that unit is hard-coded to GUC_UNIT_BLOCKS, unit_name
+			 *  should not be NULL.
+			 */
+			Assert(unit_name);
+
+			/*
+			 * Some combinations of IOContext and BackendType are not valid
+			 * for any type of IOOp. In such cases, omit the entire row from
+			 * the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid((BackendType) bktype,
+												(IOContext) io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COL_UNIT] = CStringGetTextDatum(unit_name);
+			values[IO_COL_FREELIST_ACQUISITIONS] = Int64GetDatum(counters->freelist_acquisitions);
+			values[IO_COL_EVICTIONS] = Int64GetDatum(counters->evictions);
+			values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+			values[IO_COL_REJECTIONS] = Int64GetDatum(counters->rejections);
+			values[IO_COL_REPOSSESSIONS] = Int64GetDatum(counters->repossessions);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid((BackendType) bktype, (IOContext)
+										 io_context, (IOOp) io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, (IOOp) io_op);
+					nulls[pgstat_io_op_get_index((IOOp) io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62a5b8e655..2d07595677 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5653,6 +5653,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,text,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,read,written,extended,unit,freelist_acquired,evicted,reused,rejected,repossessed,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index bfcd8ac9a0..c8cf3d3fa8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,6 +1871,20 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.read,
+    b.written,
+    b.extended,
+    b.unit,
+    b.freelist_acquired,
+    b.evicted,
+    b.reused,
+    b.rejected,
+    b.repossessed,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, read, written, extended, unit, freelist_acquired, evicted, reused, rejected, repossessed, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 257a6a9da9..8a95002d61 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1120,4 +1120,246 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - acquisitions of shared buffers from the freelist or by evicting another block
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- Consider both buffers acquired from the freelist as well as those acquired
+-- by evicting another block because we cannot be sure the state of shared
+-- buffers at the point the test is run.
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_shared_acq_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- evicted or freelist acquisitions and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_shared_acq_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_acq_after > :io_sum_shared_acq_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - initial allocation and subsequent usage of local buffers
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_local_acq_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_local_acq_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_acq_after > :io_sum_local_acq_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- evictions/freelist acqusitions and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_vac_strategy_acq_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_vac_strategy_acq_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_acq_after > :io_sum_vac_strategy_acq_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(freelist_acquired) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(freelist_acquired) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index f6270f7bad..2b8ca91088 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -535,4 +535,135 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - acquisitions of shared buffers from the freelist or by evicting another block
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- Consider both buffers acquired from the freelist as well as those acquired
+-- by evicting another block because we cannot be sure the state of shared
+-- buffers at the point the test is run.
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_shared_acq_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- evicted or freelist acquisitions and extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_shared_acq_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_acq_after > :io_sum_shared_acq_before;
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - initial allocation and subsequent usage of local buffers
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_local_acq_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_local_acq_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_acq_after > :io_sum_local_acq_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- evictions/freelist acqusitions and reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_vac_strategy_acq_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) + sum(freelist_acquired) AS io_sum_vac_strategy_acq_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_acq_after > :io_sum_vac_strategy_acq_before;
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(freelist_acquired) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(freelist_acquired) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

#84

andres@anarazel.de

about 3 years ago

In reply to: Melanie Plageman (#83)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

- we shouldn't do pgstat_count_io_op() while the buffer header lock is held,
if possible.

I wonder if we should add a "source" output argument to
StrategyGetBuffer(). Then nearly all the counting can happen in
BufferAlloc().

- "repossession" is a very unintuitive name for me. If we want something like
it, can't we just name it reuse_failed or such?

- Wonder if the column names should be reads, writes, extends, etc instead of
the current naming pattern

- Is it actually correct to count evictions in StrategyGetBuffer()? What if we
then decide to not use that buffer in BufferAlloc()? Yes, that'll be counted
via rejected, but that still leaves the eviction count to be "misleading"?

On 2022-10-19 15:26:51 -0400, Melanie Plageman wrote:

I have made some major changes in this area to make the columns more
useful. I have renamed and split "clocksweeps". It is now "evicted" and
"freelist acquired". This makes it clear when a block must be evicted
from a shared buffer must be and may help to identify misconfiguration
of shared buffers.

I'm not sure freelist acquired is really that useful? If we don't add it, we
should however definitely not count buffers from the freelist as evictions.

There is some nuance here that I tried to make clear in the docs.
"freelist acquired" in a shared context is straightforward.
"freelist acquired" in a strategy context is counted when a shared
buffer is added to the strategy ring (not when it is reused).

Not sure what the second half here means - why would a buffer that's not from
the freelist ever be counted as being from the freelist?

"freelist_acquired" is confusing for local buffers but I wanted to
distinguish between reuse/eviction of local buffers and initial
allocation. "freelist_acquired" seemed more fitting because there is a
clocksweep to find a local buffer and if it hasn't been allocated yet it
is allocated in a place similar to where shared buffers acquire a buffer
from the freelist. If I didn't count it here, I would need to make a new
column only for local buffers called "allocated" or something like that.

I think you're making this too granular. We need to have more detail than
today. But we don't necessarily need to catch every nuance.

I chose not to call "evicted" "sb_evicted"
because then we would need a separate "local_evicted". I could instead
make "local_evicted", "sb_evicted", and rename "reused" to
"strat_evicted". If I did that we would end up with separate columns for
every IO Context describing behavior when a buffer is initially acquired
vs when it is reused.

It would look something like this:

shared buffers:
initial: freelist_acquired
reused: sb_evicted

local buffers:
initial: allocated
reused: local_evicted

strategy buffers:
initial: sb_evicted | freelist_acquired
reused: strat_evicted
replaced: sb_evicted | freelist_acquired

This seems not too bad at first, but if you consider that later we will
add other kinds of IO -- eg WAL IO or temporary file IO, we won't be
able to use these existing columns and will need to add even more
columns describing the exact behavior in those cases.

I think it's clearly not the right direction.

I have also added the columns "repossessed" and "rejected". "rejected"
is when a bulkread rejects a strategy buffer because it is dirty and
requires flush. Seeing a lot of rejections could indicate you need to
vacuum. "repossessed" is the number of times a strategy buffer was
pinned or in use by another backend and had to be removed from the
strategy ring and replaced with a new shared buffer. This gives you some
indication that there is contention on blocks recently used by a
strategy.

I don't immediately see a real use case for repossessed. Why isn't it
sufficient to count it as part of rejected?

Greetings,

Andres Freund

#85

m.sakrejda@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#83)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Oct 19, 2022 at 12:27 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

v34 is attached.
I think the column names need discussion. Also, the docs need more work
(I added a lot of new content there). I could use feedback on the column
names and definitions and review/rephrasing ideas for the docs
additions.

Nice! I think the expanded docs are great, and make this information
much easier to interpret.

+       <varname>io_context</varname> <literal>bulkread</literal>, existing
+       dirty buffers in the ring requirng flush are

"requiring"

+       shared buffers were acquired from the freelist and added to the
+       fixed-size strategy ring buffer. Shared buffers are added to the
+       strategy ring lazily. If the current buffer in the ring is pinned or in

This is the first mention of the term "strategy" in these docs. It's
not totally opaque, since there's some context, but maybe we should
either try to avoid that term or define it more explicitly?

+       <varname>io_context</varname>s. This is equivalent to
+       <varname>evicted</varname> for shared buffers in
+       <varname>io_context</varname> <literal>shared</literal>, as the contents
+       of the buffer are <quote>evicted</quote> but refers to the case when the

I don't quite follow this: does this mean that I should expect
'reused' and 'evicted' to be equal in the 'shared' context, because
they represent the same thing? Or will 'reused' just be null because
it's not distinct from 'evicted'? It looks like it's null right now,
but I find the wording here confusing.

+      future with a new shared buffer. A high number of
+      <literal>bulkread</literal> rejections can indicate a need for more
+      frequent vacuuming or more aggressive autovacuum settings, as buffers are
+      dirtied during a bulkread operation when updating the hint bit or when
+      performing on-access pruning.

This is great. Just wanted to re-iterate that notes like this are
really helpful to understanding this view.

I've implemented a change using the same function pg_settings uses to
turn the build-time parameter BLCKSZ into 8kB (get_config_unit_name())
using the flag GUC_UNIT_BLOCKS. I am unsure if this is better or worse
than "block_size". I am feeling very conflicted about this column.

Yeah, I guess it feels less natural here than in pg_settings, but it
still kind of feels like one way of doing this is better than two...

#86

m.sakrejda@gmail.com

about 3 years ago

In reply to: Andres Freund (#84)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Oct 20, 2022 at 10:31 AM Andres Freund <andres@anarazel.de> wrote:

- "repossession" is a very unintuitive name for me. If we want something like
it, can't we just name it reuse_failed or such?

+1, I think "repossessed" is awkward. I think "reuse_failed" works,
but no strong opinions on an alternate name.

- Wonder if the column names should be reads, writes, extends, etc instead of
the current naming pattern

Why? Lukas suggested alignment with existing views like
pg_stat_database and pg_stat_statements. It doesn't make sense to use
the blks_ prefix since it's not all blocks, but otherwise it seems
like we should be consistent, no?

"freelist_acquired" is confusing for local buffers but I wanted to
distinguish between reuse/eviction of local buffers and initial
allocation. "freelist_acquired" seemed more fitting because there is a
clocksweep to find a local buffer and if it hasn't been allocated yet it
is allocated in a place similar to where shared buffers acquire a buffer
from the freelist. If I didn't count it here, I would need to make a new
column only for local buffers called "allocated" or something like that.

I think you're making this too granular. We need to have more detail than
today. But we don't necessarily need to catch every nuance.

In general I agree that coarser granularity here may be easier to use.
I do think the current docs explain what's going on pretty well,
though, and I worry if merging too many concepts will make that harder
to follow. But if a less detailed breakdown still communicates
potential problems, +1.

This seems not too bad at first, but if you consider that later we will
add other kinds of IO -- eg WAL IO or temporary file IO, we won't be
able to use these existing columns and will need to add even more
columns describing the exact behavior in those cases.

I think it's clearly not the right direction.

+1, I think the existing approach makes more sense.

#87

melanieplageman@gmail.com

about 3 years ago

In reply to: Andres Freund (#84)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Oct 20, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

- we shouldn't do pgstat_count_io_op() while the buffer header lock is held,
if possible.

I've changed this locally. It will be fixed in the next version I share.

I wonder if we should add a "source" output argument to
StrategyGetBuffer(). Then nearly all the counting can happen in
BufferAlloc().

I think we can just check for BM_VALID being set before invalidating it
in order to claim the buffer at the end of BufferAlloc(). Then we can
count it as an eviction or reuse.

- "repossession" is a very unintuitive name for me. If we want something like
it, can't we just name it reuse_failed or such?

Repossession could be called eviction_failed or reuse_failed.
Do we think we will ever want to use it to count buffers we released
in other IOContexts (thus making the name eviction_failed better than
reuse_failed)?

- Is it actually correct to count evictions in StrategyGetBuffer()? What if we
then decide to not use that buffer in BufferAlloc()? Yes, that'll be counted
via rejected, but that still leaves the eviction count to be "misleading"?

I agree that counting evictions in StrategyGetBuffer() is incorrect.
Checking BM_VALID at bottom of BufferAlloc() should be better.

On 2022-10-19 15:26:51 -0400, Melanie Plageman wrote:

I have made some major changes in this area to make the columns more
useful. I have renamed and split "clocksweeps". It is now "evicted" and
"freelist acquired". This makes it clear when a block must be evicted
from a shared buffer must be and may help to identify misconfiguration
of shared buffers.

I'm not sure freelist acquired is really that useful? If we don't add it, we
should however definitely not count buffers from the freelist as evictions.

There is some nuance here that I tried to make clear in the docs.
"freelist acquired" in a shared context is straightforward.
"freelist acquired" in a strategy context is counted when a shared
buffer is added to the strategy ring (not when it is reused).

Not sure what the second half here means - why would a buffer that's not from
the freelist ever be counted as being from the freelist?

"freelist_acquired" is confusing for local buffers but I wanted to
distinguish between reuse/eviction of local buffers and initial
allocation. "freelist_acquired" seemed more fitting because there is a
clocksweep to find a local buffer and if it hasn't been allocated yet it
is allocated in a place similar to where shared buffers acquire a buffer
from the freelist. If I didn't count it here, I would need to make a new
column only for local buffers called "allocated" or something like that.

I think you're making this too granular. We need to have more detail than
today. But we don't necessarily need to catch every nuance.

I am fine with cutting freelist_acquired. The same actionable
information that it could provide could be provided by "read", right?
Also, removing it means I can remove the complicated explanation of how
freelist_acquired should be interpreted in IOCONTEXT_LOCAL.

Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call
it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What
if, in the future, we want to track other IO done using data in local
memory? Also, what if we want to track other IO done using data from
shared memory that is not in shared buffers? Would IOCONTEXT_SB and
IOCONTEXT_TEMP be better? Should IOContext literally describe the
context of the IO being done and there be a separate column which
indicates the source of the data for the IO?
Like wal_buffer, local_buffer, shared_buffer? Then if it is not
block-oriented, it could be shared_mem, local_mem, or bypass?

If we had another dimension to the matrix "data_src" which, with
block-oriented IO is equivalent to "buffer type", this could help with
some of the clarity problems.

We could remove the "reused" column and that becomes:

IOCONTEXT | DATA_SRC | IOOP
----------------------------------------
strategy | strategy_buffer | EVICT

Having data_src and iocontext simplifies the meaning of all io
operations involving a strategy. Some operations are done on shared
buffers and some on existing strategy buffers and this would be more
clear without the addition of special columns for strategies.

I have also added the columns "repossessed" and "rejected". "rejected"
is when a bulkread rejects a strategy buffer because it is dirty and
requires flush. Seeing a lot of rejections could indicate you need to
vacuum. "repossessed" is the number of times a strategy buffer was
pinned or in use by another backend and had to be removed from the
strategy ring and replaced with a new shared buffer. This gives you some
indication that there is contention on blocks recently used by a
strategy.

I don't immediately see a real use case for repossessed. Why isn't it
sufficient to count it as part of rejected?

I'm still on the fence about combining rejection and reuse_failed. A
buffer rejected by a bulkread for being dirty may indicate the need to
vacuum but doesn't say anything about contention.
Whereas, failed reuses indicate contention for the blocks operated on by
the strategy. You would react to them differently. And you could have a
bulkread racking up both failed reuses and rejections.

If this seems like an unlikely or niche case, I would be okay with
combining rejections with reuse_failed. But it would be nice if we could
help with interpreting the column. I wonder if there is a rule of thumb
for determining which scenario you have. For example, how likely is it
that if you see a high number of reuse_rejected in a bulkread IOContext
that you would see any reused if the rejections are due to the bulkread
dirtying its own buffers? I suppose it would depend on your workload and
how random your updates/deletes were? If there is some way to use
reuse_rejected in combination with another column to determine the cause
of the rejections, it would be easier to combine them.

- Melanie

#88

melanieplageman@gmail.com

about 3 years ago

In reply to: Maciek Sakrejda (#85)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Sun, Oct 23, 2022 at 6:35 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

On Wed, Oct 19, 2022 at 12:27 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

v34 is attached.
I think the column names need discussion. Also, the docs need more work
(I added a lot of new content there). I could use feedback on the column
names and definitions and review/rephrasing ideas for the docs
additions.

Nice! I think the expanded docs are great, and make this information
much easier to interpret.
+       <varname>io_context</varname> <literal>bulkread</literal>, existing
+       dirty buffers in the ring requirng flush are
"requiring"

Thanks!

+       shared buffers were acquired from the freelist and added to the
+       fixed-size strategy ring buffer. Shared buffers are added to the
+       strategy ring lazily. If the current buffer in the ring is pinned or in
This is the first mention of the term "strategy" in these docs. It's
not totally opaque, since there's some context, but maybe we should
either try to avoid that term or define it more explicitly?

I am thinking it might be good to define the term strategy for use in
this view documentation.
In the IOContext column documentation, I've added this
...
avoid occupying an undue portion of the main shared buffer pool. This
pattern is called a Buffer Access Strategy and the fixed-size ring
buffer can be referred to as a <quote>strategy ring buffer</quote>.
</para></entry>

I was thinking this would allow me to refer to the strategy ring buffer
more easily. I fear simply referring to "the" ring buffer throughout
this view documentation will be confusing.

+       <varname>io_context</varname>s. This is equivalent to
+       <varname>evicted</varname> for shared buffers in
+       <varname>io_context</varname> <literal>shared</literal>, as the contents
+       of the buffer are <quote>evicted</quote> but refers to the case when the
I don't quite follow this: does this mean that I should expect
'reused' and 'evicted' to be equal in the 'shared' context, because
they represent the same thing? Or will 'reused' just be null because
it's not distinct from 'evicted'? It looks like it's null right now,
but I find the wording here confusing.

You should only see evictions when the strategy evicts shared buffers
and reuses when the strategy evicts existing strategy buffers.

How about this instead in this docs?

the number of times an existing buffer in the strategy ring was reused
as part of an operation in the <literal>bulkread</literal>,
<literal>bulkwrite</literal>, or <literal>vacuum</literal>
<varname>io_context</varname>s. when a buffer access strategy
<quote>reuses</quote> a buffer in the strategy ring, it must evict its
contents, incrementing <varname>reused</varname>. when a buffer access
strategy adds a new shared buffer to the strategy ring and this shared
buffer is occupied, the buffer access strategy must evict the contents
of the shared buffer, incrementing <varname>evicted</varname>.

I've implemented a change using the same function pg_settings uses to
turn the build-time parameter BLCKSZ into 8kB (get_config_unit_name())
using the flag GUC_UNIT_BLOCKS. I am unsure if this is better or worse
than "block_size". I am feeling very conflicted about this column.

Yeah, I guess it feels less natural here than in pg_settings, but it
still kind of feels like one way of doing this is better than two...

So, Andres pointed out that it would be nice to be able to multiply the
unit column by the operation column (e.g. select unit * reused from
pg_stat_io...) and get a number of bytes. Then you can use
pg_size_pretty to convert it to something more human readable.

It probably shouldn't be called unit, then, since that would be the same
name as pg_settings but a different meaning. I thought of
"bytes_conversion". Then, non-block-oriented IO also wouldn't have to be
in bytes. They could put 1000 or 10000 for bytes_conversion.

What do you think?

- Melanie

#89

melanieplageman@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#87)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v35 is attached

On Mon, Oct 24, 2022 at 2:38 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Oct 20, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote:

I wonder if we should add a "source" output argument to
StrategyGetBuffer(). Then nearly all the counting can happen in
BufferAlloc().

I think we can just check for BM_VALID being set before invalidating it
in order to claim the buffer at the end of BufferAlloc(). Then we can
count it as an eviction or reuse.

Done this in attached version

On 2022-10-19 15:26:51 -0400, Melanie Plageman wrote:

I have made some major changes in this area to make the columns more
useful. I have renamed and split "clocksweeps". It is now "evicted" and
"freelist acquired". This makes it clear when a block must be evicted
from a shared buffer must be and may help to identify misconfiguration
of shared buffers.

I'm not sure freelist acquired is really that useful? If we don't add it, we
should however definitely not count buffers from the freelist as evictions.

There is some nuance here that I tried to make clear in the docs.
"freelist acquired" in a shared context is straightforward.
"freelist acquired" in a strategy context is counted when a shared
buffer is added to the strategy ring (not when it is reused).

Not sure what the second half here means - why would a buffer that's not from
the freelist ever be counted as being from the freelist?

"freelist_acquired" is confusing for local buffers but I wanted to
distinguish between reuse/eviction of local buffers and initial
allocation. "freelist_acquired" seemed more fitting because there is a
clocksweep to find a local buffer and if it hasn't been allocated yet it
is allocated in a place similar to where shared buffers acquire a buffer
from the freelist. If I didn't count it here, I would need to make a new
column only for local buffers called "allocated" or something like that.

I think you're making this too granular. We need to have more detail than
today. But we don't necessarily need to catch every nuance.

I cut freelist_acquired in attached version.

I am fine with cutting freelist_acquired. The same actionable
information that it could provide could be provided by "read", right?
Also, removing it means I can remove the complicated explanation of how
freelist_acquired should be interpreted in IOCONTEXT_LOCAL.

Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call
it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What
if, in the future, we want to track other IO done using data in local
memory? Also, what if we want to track other IO done using data from
shared memory that is not in shared buffers? Would IOCONTEXT_SB and
IOCONTEXT_TEMP be better? Should IOContext literally describe the
context of the IO being done and there be a separate column which
indicates the source of the data for the IO?
Like wal_buffer, local_buffer, shared_buffer? Then if it is not
block-oriented, it could be shared_mem, local_mem, or bypass?

pg_stat_statements uses local_blks_read and temp_blks_read for local
buffers for temp tables and temp file IO respectively -- so perhaps we
should stick to that

Other updates in this version:

I've also updated the unit column to bytes_conversion.

I've made quite a few updates to the docs including more information
on overlaps between pg_stat_database, pg_statio_*, and
pg_stat_statements.

Let me know if there are other configuration tip resources from the
existing docs that I could link in the column "files_synced".

I still need to look at the docs with fresh eyes and do another round of
cleanup (probably).

- Melanie

Attachments:

v35-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchtext/x-patch; charset=US-ASCII; name=v35-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchDownload

From b19456745d4431c2271fa9fbd57148d4733a7f66 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Oct 2022 11:03:05 -0700
Subject: [PATCH v35 1/5] Remove BufferAccessStrategyData->current_was_in_ring

It is a duplication of StrategyGetBuffer->from_ring.
---
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 src/backend/storage/buffer/freelist.c | 15 ++-------------
 src/include/storage/buf_internals.h   |  2 +-
 3 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..4e7b0b31bb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1254,7 +1254,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..64728bd7ce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,12 +81,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -625,10 +619,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +635,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +644,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -682,14 +671,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..b75481450d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -395,7 +395,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
-- 
2.34.1

v35-0002-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v35-0002-Track-IO-operation-statistics-locally.patchDownload

From 629683d8fa63064ec55da2d65794d5e5af251407 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:25 -0400
Subject: [PATCH v35 2/5] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "written" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (evict, freelist acquisition, reject, repossess, reuse, read,
write, extend, and fsync) is counted per IOContext (bulkread, bulkwrite,
local, shared, or vacuum) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_FREELIST_ACQUIRE and IOOP_EVICT IOOps are counted in
IOCONTEXT_SHARED and IOCONTEXT_LOCAL IOContexts when a buffer is
acquired or allocated through [Local]BufferAlloc() and no
BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, shared buffers added to the
strategy ring are counted as IOOP_FREELIST_ACQUIRE or IOOP_EVICT IOOps
in the IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM) IOContext. When one of
these buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 ++
 src/backend/storage/buffer/bufmgr.c        |  84 ++++++-
 src/backend/storage/buffer/freelist.c      |  51 ++++-
 src/backend/storage/buffer/localbuf.c      |   6 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 255 +++++++++++++++++++++
 src/include/pgstat.h                       |  68 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   4 +
 12 files changed, 481 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..4ea4e6a298 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_SHARED
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_SHARED and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4e7b0b31bb..9f25c5ce32 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -833,6 +834,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	isExtend = (blockNum == P_NEW);
 
+	if (isLocalBuf)
+		io_context = IOCONTEXT_LOCAL;
+	else
+		io_context = IOContextForStrategy(strategy);
+
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -990,6 +996,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1015,6 +1022,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			instr_time	io_start,
 						io_time;
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
+
 			if (track_io_timing)
 				INSTR_TIME_SET_CURRENT(io_start);
 
@@ -1121,6 +1131,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1190,6 +1201,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1212,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1237,6 +1249,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOContext	io_context;
+
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1263,13 +1277,36 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_SHARED IOOP_WRITE.
+				 *
+				 * If a shared buffer added to the ring later because the
+				 * current strategy buffer is pinned or in use or because all
+				 * strategy buffers were dirty and rejected (for BAS_BULKREAD
+				 * operations only) requires flushing, this is counted as an
+				 * IOCONTEXT_SHARED IOOP_WRITE (from_ring will be false).
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer (IOCONTEXT_SHARED
+				 * IOOP_WRITE).
+				 */
+
+				io_context = IOContextForStrategy(strategy);
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1478,31 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		* When a BufferAccessStrategy is in use, evictions adding a
+		* shared buffer to the strategy ring are counted in the
+		* corresponding strategy's context. This includes the evictions
+		* done to add buffers to the ring initially as well as those
+		* done to add a new shared buffer to the ring when current
+		* buffer is pinned or otherwise in use.
+		*
+		* Blocks evicted from buffers already in the strategy ring are counted
+		* as IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, or IOCONTEXT_VACUUM
+		* reuses.
+		*
+		* We wait until this point to count reuses and evictions in order to
+		* avoid incorrectly counting a buffer as reused or evicted when it was
+		* released because it was concurrently pinned or in use or counting it
+		* as reused when it was rejected or when we errored out.
+		*/
+		if (from_ring)
+			pgstat_count_io_op(IOOP_REUSE, IOContextForStrategy(strategy));
+		else
+			pgstat_count_io_op(IOOP_EVICT, IOCONTEXT_SHARED);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2632,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2882,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2900,6 +2962,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3551,6 +3615,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3586,7 +3652,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3750,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3960,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3987,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 64728bd7ce..6eb2e00ae2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -192,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -207,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -299,6 +305,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
 				return buf;
 			}
 			UnlockBufHdr(buf, local_buf_state);
@@ -331,6 +338,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
 				return buf;
 			}
 		}
@@ -596,7 +604,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -643,7 +651,13 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	/*
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
+	 *
+	 * This counts as a "repossession" for the purposes of IO operation
+	 * statistic tracking, since the reason that we no longer consider the
+	 * current buffer to be part of the ring is that the block in it is in use
+	 * outside of the ring, preventing us from reusing the buffer.
 	 */
+	pgstat_count_io_op(IOOP_REPOSSESS, IOContextForStrategy(strategy));
 	return NULL;
 }
 
@@ -659,6 +673,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_SHARED;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_SHARED;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -688,5 +733,7 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	pgstat_count_io_op(IOOP_REJECT, IOContextForStrategy(strategy));
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..cb9685564f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
 				break;
 			}
 		}
@@ -226,6 +228,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
@@ -256,6 +260,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOCONTEXT_LOCAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
@@ -275,6 +280,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 	*foundPtr = false;
+
 	return bufHdr;
 }
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..6f9c250907
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,255 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REPOSSESS:
+			pending_counters->repossessions++;
+			break;
+		case IOOP_REJECT:
+			pending_counters->rejections++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REPOSSESS:
+			return "repossessed";
+		case IOOP_REJECT:
+			return "rejected";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_local;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((io_context == IOCONTEXT_BULKREAD || io_context == IOCONTEXT_BULKWRITE
+		 || io_context == IOCONTEXT_VACUUM) && (bktype == B_CHECKPOINTER
+												|| bktype == B_BG_WRITER))
+		return false;
+
+	if (io_context == IOCONTEXT_VACUUM && bktype == B_AUTOVAC_LAUNCHER)
+		return false;
+
+	if (io_context == IOCONTEXT_BULKWRITE && (bktype == B_AUTOVAC_WORKER ||
+											  bktype == B_AUTOVAC_LAUNCHER))
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Only BAS_BULKREAD will reject strategy buffers
+	 */
+	if (io_context != IOCONTEXT_BULKREAD && io_op == IOOP_REJECT)
+		return false;
+
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD || io_context ==
+		IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REPOSSESS and IOOP_REUSE are only relevant when a
+	 * BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && (io_op == IOOP_REJECT || io_op ==
+				IOOP_REPOSSESS || io_op == IOOP_REUSE))
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if ((io_context == IOCONTEXT_LOCAL || strategy_io_context) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..5883aafe9c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ * When adding a new value, ensure that the proper assertions are added to
+ * pgstat_io_context_ops_assert_zero() and pgstat_io_op_assert_zero() (though
+ * the compiler will remind you about the latter)
+ */
+
+typedef enum IOOp
+{
+	IOOP_EVICT = 0,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REJECT,
+	IOOP_REPOSSESS,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD = 0,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter rejections;
+	PgStat_Counter reuses;
+	PgStat_Counter repossessions;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +503,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b75481450d..7b67250747 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa0960..d0eed71f63 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f02cc8f42..b080367073 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2028,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v35-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v35-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 4562099222f41866099835ccfca40f22abfd9e8f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:24:42 -0400
Subject: [PATCH v35 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (freelist
acquisitions, evictions, reuses, rejections, repossessions, reads,
writes, extends, and fsyncs) done through each IOContext (shared
buffers, local buffers, and buffers reserved by a BufferAccessStrategy)
by each type of backend (e.g. client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsynced so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 384 +++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql |  16 ++
 src/backend/utils/adt/pgstatfuncs.c  | 139 ++++++++++
 src/include/catalog/pg_proc.dat      |   9 +
 src/test/regress/expected/rules.out  |  13 +
 src/test/regress/expected/stats.out  | 224 ++++++++++++++++
 src/test/regress/sql/stats.sql       | 123 +++++++++
 7 files changed, 892 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 698f274341..de0850337b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -658,20 +667,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_statio_</structname> and
+   <structname>pg_stat_io</structname> views are primarily useful to determine
+   the effectiveness of the buffer cache.  When the number of actual disk reads
+   is much smaller than the number of buffer hits, then the cache is satisfying
+   most read requests without invoking a kernel call. However, these statistics
+   do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3600,13 +3609,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3615,7 +3623,351 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) could be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+       See <link linkend="monitoring-pg-stat-activity-view">
+       <structname>pg_stat_activity</structname></link> for more information on
+       <varname>backend_type</varname>s.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       The context or location of an IO operation.
+       <varname>io_context</varname> <literal>shared</literal> refers to IO
+       operations of data in shared buffers, the primary buffer pool for
+       relation data. <varname>io_context</varname> <literal>local</literal>
+       refers to IO operations on process-local memory used for temporary
+       tables. <varname>io_context</varname> <literal>vacuum</literal> refers
+       to the IO operations incurred while vacuuming and analyzing.
+       <varname>io_context</varname> <literal>bulkread</literal> refers to IO
+       operations specially designated as <literal>bulk reads</literal>, such
+       as the sequential scan of a large table. <varname>io_context</varname>
+       <literal>bulkwrite</literal> refers to IO operations specially
+       designated as <literal>bulk writes</literal>, such as
+       <command>COPY</command>.
+       </para>
+
+       <para>
+       These last three <varname>io_context</varname>s are counted separately
+       because the autovacuum daemon, explicit <command>VACUUM</command>,
+       explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+       writes use a fixed amount of memory, acquiring the equivalent number of
+       shared buffers and reusing them circularly to avoid occupying an undue
+       portion of the main shared buffer pool. This pattern is called a
+       <quote>Buffer Access Strategy</quote> in the
+       <productname>PostgreSQL</productname> source code and the fixed-size
+       ring buffer is referred to as a <quote>strategy ring buffer</quote> for
+       the purposes of this view's documentation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into buffers in this
+       <varname>io_context</varname>.
+       </para>
+       <para>
+       <varname>read</varname> and <varname>extended</varname> for
+       <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+       <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+       <literal>standalone backend</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> for all
+       <varname>io_context</varname>s is similar to the sum of
+       <varname>heap_blks_read</varname>, <varname>idx_blks_read</varname>,
+       <varname>tidx_blks_read</varname>, and
+       <varname>toast_blks_read</varname> in <link
+       linkend="monitoring-pg-statio-all-tables-view">
+       <structname>pg_statio_all_tables</structname></link>. and
+       <varname>blks_read</varname> from <link
+       linkend="monitoring-pg-stat-database-view">
+       <structname>pg_stat_database</structname></link>. The difference is that
+       reads done as part of <command>CREATE DATABASE</command> are not counted
+       in <structname>pg_statio_all_tables</structname> and
+       <structname>pg_stat_database</structname>.
+       </para>
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>,
+       <varname>read</varname> for
+       <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+       <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+       <literal>standalone backend</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> for all
+       <varname>io_context</varname>s is equivalent to
+       <varname>shared_blks_read</varname> together with
+       <varname>local_blks_read</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>.
+       </para>
+
+       <para>
+       Normal client backends should be able to rely on maintenance processes
+       like the checkpointer and background writer to write out dirty data as
+       much as possible. Large numbers of writes by
+       <varname>backend_type</varname> <literal>client backend</literal> in
+       <varname>io_context</varname> <literal>shared</literal> could indicate a
+       misconfiguration of shared buffers or of checkpointer . More information
+       on checkpointer configuration can be found in <xref
+       linkend="wal-configuration"/>.
+       </para>
+
+       <para>Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>. Also, the sum of
+       <varname>written</varname> and <varname>extended</varname> in this view
+       for <varname>backend_type</varname>s <literal>client backend</literal>,
+       <literal>autovacuum worker</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> in
+       <varname>io_context</varname>s <literal>shared</literal>,
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>, and
+       <literal>vacuum</literal> is equivalent to
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>.
+       </para>
+
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>, <varname>written</varname> and
+       <varname>extended</varname> for <varname>backend_type</varname>s
+       <literal>autovacuum launcher</literal>, <literal>autovacuum
+       worker</literal>, <literal>client backend</literal>, <literal>standalone
+       backend</literal>, <literal>background worker</literal>, and
+       <literal>walsender</literal> for all <varname>io_context</varname>s is
+       equivalent to <varname>shared_blks_written</varname> together with
+       <varname>local_blks_written</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>bytes_conversion</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of bytes per unit of IO read, written, or extended. For
+      block-oriented IO of relation data, reads, writes, and extends are done
+      in <varname>block_size</varname> units, derived from the build-time
+      parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+      default. Future values could include those derived from
+      <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
+      constant multipliers once non-block-oriented IO (e.g. temporary file IO)
+      is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>evicted</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times a <varname>backend_type</varname> has evicted a block
+       from a shared or local buffer in order to reuse the buffer in this
+       <varname>io_context</varname>. Blocks are only evicted when there are no
+       unoccupied buffers.
+       </para>
+
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>shared</literal> counts the number of times a block from a
+       shared buffer was evicted so that it can be replaced with another block,
+       also in shared buffers.
+
+       A high <varname>evicted</varname> count in <varname>io_context</varname>
+       <literal>shared</literal> could indicate that shared buffers is too
+       small and should be set to a larger value.
+       </para>
+
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times occupied shared
+       buffers were added to the fixed-size strategy ring buffer, causing the
+       buffer contents to be evicted. If the current buffer in the ring is
+       pinned or in use by another backend, it may be replaced by a new shared
+       buffer. If this shared buffer contains valid data, that block must be
+       evicted and will count as <varname>evicted</varname>.
+
+       In <varname>io_context</varname> <literal>bulkread</literal>, existing
+       dirty buffers in the ring requiring flush are
+       <varname>rejected</varname>. If all of the buffers in the strategy ring
+       have been <varname>rejected</varname>, a new shared buffer will be added
+       to the ring. If the new shared buffer is occupied, its contents will
+       need to be evicted.
+
+       Seeing a large number of <varname>evicted</varname> in strategy
+       <varname>io_context</varname>s can provide insight into primary working
+       set cache misses.
+       </para>
+
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>local</literal> counts the number of times a block of data from
+       an existing local buffer was evicted in order to replace it with another
+       block, also in local buffers.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times an existing buffer in the strategy ring was reused
+       as part of an operation in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+       <varname>io_context</varname>s. When a <quote>Buffer Access
+       Strategy</quote> reuses a buffer in the strategy ring, it must evict its
+       contents, incrementing <varname>reused</varname>. When a <quote>Buffer
+       Access Strategy</quote> adds a new shared buffer to the strategy ring
+       and this shared buffer is occupied, the <quote>Buffer Access
+       Strategy</quote> must evict the contents of the shared buffer,
+       incrementing <varname>evicted</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>rejected</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of times a <literal>bulkread</literal> found the current
+      buffer in the fixed-size strategy ring dirty and requiring flush.
+      <quote>Rejecting</quote> the buffer effectively removes it from the
+      strategy ring buffer allowing the slot in the ring to be replaced in the
+      future with a new shared buffer. A high number of
+      <literal>bulkread</literal> rejections can indicate a need for more
+      frequent vacuuming or more aggressive autovacuum settings, as buffers are
+      dirtied during a bulkread operation when updating the hint bit or when
+      performing on-access pruning.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>repossessed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times a buffer in the fixed-size ring buffer used by
+       operations in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, and <literal>vacuum</literal>
+       <varname>io_context</varname>s was removed from that ring buffer because
+       it was pinned or in use by another backend and thus could not have its
+       tenant block evicted so it could be reused. Once removed from the
+       strategy ring, this buffer is a <quote>normal</quote> shared buffer
+       again. A high number of repossessions is a sign of contention for the
+       blocks operated on by the strategy operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of files fsynced by this <varname>backend_type</varname> for the
+       purpose of persisting data dirtied in this
+       <varname>io_context</varname>. <literal>fsyncs</literal> are done at
+       segment boundaries so <varname>bytes_conversion</varname> does not apply to the
+       <varname>files_synced</varname> column. <literal>fsyncs</literal> done
+       by backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as an <varname>io_context</varname> <literal>shared</literal>
+       <literal>fsync</literal>.
+       </para>
 
+       <para>
+       Normal client backends should be able to rely on the checkpointer to
+       ensure data is persisted to permanent storage. Large numbers of
+       <varname>files_synced</varname> by <varname>backend_type</varname>
+       <literal>client backend</literal> could indicate a misconfiguration of
+       shared buffers or of checkpointer. More information on checkpointer
+       configuration can be found in <xref linkend="wal-configuration"/>.
+       </para>
+
+       <para>
+       Note that the sum of <varname>files_synced</varname> for all
+       <varname>io_context</varname> <literal>shared</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in
+       <link linkend="monitoring-pg-stat-bgwriter-view"> <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..571c422f73 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,22 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.read,
+       b.written,
+       b.extended,
+       b.bytes_conversion,
+       b.evicted,
+       b.reused,
+       b.rejected,
+       b.repossessed,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b783af130c..5bd39733b6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/guc.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
 
@@ -1725,6 +1726,144 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_REJECTIONS,
+	IO_COL_REPOSSESSIONS,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_REJECT:
+			return IO_COL_REJECTIONS;
+		case IOOP_REPOSSESS:
+			return IO_COL_REPOSSESSIONS;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+															bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 * Some combinations of IOContext and BackendType are not valid
+			 * for any type of IOOp. In such cases, omit the entire row from
+			 * the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid((BackendType) bktype,
+												(IOContext) io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			/*
+			 * Hard-code this to blocks until we have non-block-oriented IO
+			 * represented in the view as well
+			 */
+			values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+			values[IO_COL_EVICTIONS] = Int64GetDatum(counters->evictions);
+			values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+			values[IO_COL_REJECTIONS] = Int64GetDatum(counters->rejections);
+			values[IO_COL_REPOSSESSIONS] = Int64GetDatum(counters->repossessions);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid((BackendType) bktype, (IOContext)
+										 io_context, (IOOp) io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, (IOOp) io_op);
+					nulls[pgstat_io_op_get_index((IOOp) io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62a5b8e655..fdb9c2f4a1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5653,6 +5653,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,read,written,extended,bytes_conversion,evicted,reused,rejected,repossessed,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index bfcd8ac9a0..2ca46656d0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,6 +1871,19 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.read,
+    b.written,
+    b.extended,
+    b.bytes_conversion,
+    b.evicted,
+    b.reused,
+    b.rejected,
+    b.repossessed,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, read, written, extended, bytes_conversion, evicted, reused, rejected, repossessed, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 257a6a9da9..28ef9171de 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1120,4 +1120,228 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) AS io_sum_local_evictions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index f6270f7bad..75c2f6c4c0 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -535,4 +535,127 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) AS io_sum_local_evictions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v35-0003-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v35-0003-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From d8479c25f234d05c864e9fd659c9b2fe2ce5ade0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v35 3/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 164 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  88 ++++++++++
 src/include/utils/pgstat_internal.h           |  36 ++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 365 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..698f274341 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5390,6 +5390,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1b97597f17..4becee9a6c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 6f9c250907..9f0f27da1f 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,48 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			target->evictions += source->evictions;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REPOSSESS:
+			target->repossessions += source->repossessions;
+			return;
+		case IOOP_REJECT:
+			target->rejections += source->rejections;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -60,6 +102,78 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+									 (IOOp) io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, (IOOp) io_op);
+				pgstat_io_op_assert_zero(pendingent, (IOOp) io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, (IOOp) io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -108,6 +222,56 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 55a355f583..a23a90b133 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 9a4f037959..275a7be166 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 96bffc0f2a..b783af130c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2084,6 +2084,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2092,7 +2094,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e7ebea4ff4..bf97162e83 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5883aafe9c..010dc7267b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -326,6 +327,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -508,6 +515,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -519,6 +527,86 @@ extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp i
 
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->evictions == 0 && counters->extends == 0 &&
+			counters->fsyncs == 0 && counters->reads == 0 &&
+			counters->rejections == 0 && counters->repossessions == 0 &&
+			counters->reuses == 0 && counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			Assert(counters->evictions == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REJECT:
+			Assert(counters->rejections == 0);
+			return;
+		case IOOP_REPOSSESS:
+			Assert(counters->repossessions == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+										   backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, (IOContext) io_context, (IOOp) io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+						(IOOp) io_op);
+		}
+	}
+}
 
 
 /*
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 627c1389e4..9066fed660 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -330,6 +330,25 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -420,6 +439,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -443,6 +463,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -550,6 +572,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -642,6 +673,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b080367073..6d33b2c9bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2020,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

#90

melanieplageman@gmail.com

about 3 years ago

In reply to: Maciek Sakrejda (#86)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

okay, so I realized v35 had an issue where I wasn't counting strategy
evictions correctly. fixed in attached v36. This made me wonder if there
is actually a way to add a test for evictions (in strategy and shared
contexts) that is not flakey.

On Sun, Oct 23, 2022 at 6:48 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

On Thu, Oct 20, 2022 at 10:31 AM Andres Freund <andres@anarazel.de> wrote:

- "repossession" is a very unintuitive name for me. If we want something like
it, can't we just name it reuse_failed or such?

+1, I think "repossessed" is awkward. I think "reuse_failed" works,
but no strong opinions on an alternate name.

Also, re: repossessed, I can change it to reuse_failed but I do think it
is important to give users a way to distinguish between bulkread
rejections of dirty buffers and strategies failing to reuse buffers due
to concurrent pinning (since the reaction to these two scenarios would
likely be different).

If we added another column called something like "claim_failed" which
counts buffers which we failed to reuse because of concurrent pinning or
usage, we could recommend use of this column together with
"reuse_failed" to determine the cause of the failed reuses for a
bulkread. We could also use "claim_failed" in IOContext shared to
provide information on shared buffer contention.

- Melanie

Attachments:

v36-0003-Aggregate-IO-operation-stats-per-BackendType.patchtext/x-patch; charset=US-ASCII; name=v36-0003-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 0d5fc7da60f6b02259b8dd1d2eab25967cb9a95a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v36 3/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 164 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  88 ++++++++++
 src/include/utils/pgstat_internal.h           |  36 ++++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 365 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..698f274341 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5390,6 +5390,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1b97597f17..4becee9a6c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[i]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(shmem->io_ops.stats[bktype].data,
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 6f9c250907..9f0f27da1f 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,48 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			target->evictions += source->evictions;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REPOSSESS:
+			target->repossessions += source->repossessions;
+			return;
+		case IOOP_REJECT:
+			target->rejections += source->rejections;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOContext io_context)
@@ -60,6 +102,78 @@ pgstat_count_io_op(IOOp io_op, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOOpCounters *sharedent = &type_shstats->data[io_context];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[io_context];
+
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(MyBackendType, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(sharedent);
+			pgstat_io_context_ops_assert_zero(pendingent);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+									 (IOOp) io_op)))
+			{
+				pgstat_io_op_assert_zero(sharedent, (IOOp) io_op);
+				pgstat_io_op_assert_zero(pendingent, (IOOp) io_op);
+				continue;
+			}
+
+			pgstat_accum_io_op(sharedent, pendingent, (IOOp) io_op);
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -108,6 +222,56 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 55a355f583..a23a90b133 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 9a4f037959..275a7be166 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 96bffc0f2a..b783af130c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2084,6 +2084,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2092,7 +2094,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e7ebea4ff4..bf97162e83 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5883aafe9c..010dc7267b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -326,6 +327,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -508,6 +515,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_op_desc(IOOp io_op);
 
@@ -519,6 +527,86 @@ extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp i
 
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->evictions == 0 && counters->extends == 0 &&
+			counters->fsyncs == 0 && counters->reads == 0 &&
+			counters->rejections == 0 && counters->repossessions == 0 &&
+			counters->reuses == 0 && counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			Assert(counters->evictions == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REJECT:
+			Assert(counters->rejections == 0);
+			return;
+		case IOOP_REPOSSESS:
+			Assert(counters->repossessions == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext and
+ * IOOp which are not valid for the passed-in BackendType. The passed-in array
+ * of PgStat_IOOpCounters must contain stats from the BackendType specified by
+ * the second parameter. Caller is responsible for any locking if the passed-in
+ * array of PgStat_IOOpCounters is a member of PgStatShared_IOContextOps.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStat_IOOpCounters
+										   backend_io_context_ops[IOCONTEXT_NUM_TYPES], BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		if (!expect_backend_stats ||
+			!pgstat_bktype_io_context_valid(bktype, (IOContext) io_context))
+		{
+			pgstat_io_context_ops_assert_zero(&backend_io_context_ops[io_context]);
+			continue;
+		}
+
+		for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+		{
+			if (!pgstat_io_op_valid(bktype, (IOContext) io_context, (IOOp) io_op))
+				pgstat_io_op_assert_zero(&backend_io_context_ops[io_context],
+						(IOOp) io_op);
+		}
+	}
+}
 
 
 /*
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 627c1389e4..9066fed660 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -330,6 +330,25 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -420,6 +439,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -443,6 +463,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -550,6 +572,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -642,6 +673,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b080367073..6d33b2c9bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,14 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2020,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.34.1

v36-0002-Track-IO-operation-statistics-locally.patchtext/x-patch; charset=US-ASCII; name=v36-0002-Track-IO-operation-statistics-locally.patchDownload

From ac9fa4fe501fd948cfc4c4af983555813bfb20de Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:23:25 -0400
Subject: [PATCH v36 2/5] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO source, target, or type done by a backend. For example, the
checkpointer may write a shared buffer out. This would be counted as an
IOOp "written" on an IOContext IOCONTEXT_SHARED by BackendType
"checkpointer".

Each IOOp (evict, reject, repossess, reuse, read, write, extend, and
fsync) is counted per IOContext (bulkread, bulkwrite, local, shared, or
vacuum) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_FREELIST_ACQUIRE and IOOP_EVICT IOOps are counted in
IOCONTEXT_SHARED and IOCONTEXT_LOCAL IOContexts when a buffer is
acquired or allocated through [Local]BufferAlloc() and no
BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, shared buffers added to the
strategy ring are counted as IOOP_FREELIST_ACQUIRE or IOOP_EVICT IOOps
in the IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM) IOContext. When one of
these buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 ++
 src/backend/storage/buffer/bufmgr.c        |  82 ++++++-
 src/backend/storage/buffer/freelist.c      |  51 ++++-
 src/backend/storage/buffer/localbuf.c      |   6 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 255 +++++++++++++++++++++
 src/include/pgstat.h                       |  68 ++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   4 +
 12 files changed, 479 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..4ea4e6a298 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_SHARED or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_SHARED
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_SHARED and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_SHARED is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4e7b0b31bb..1cc108004f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -833,6 +834,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	isExtend = (blockNum == P_NEW);
 
+	if (isLocalBuf)
+		io_context = IOCONTEXT_LOCAL;
+	else
+		io_context = IOContextForStrategy(strategy);
+
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -990,6 +996,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1015,6 +1022,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			instr_time	io_start,
 						io_time;
 
+			pgstat_count_io_op(IOOP_READ, io_context);
+
+
 			if (track_io_timing)
 				INSTR_TIME_SET_CURRENT(io_start);
 
@@ -1121,6 +1131,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
+	bool		from_ring;
+	IOContext	io_context;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1187,9 +1199,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1215,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1263,13 +1278,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_SHARED IOOP_WRITE.
+				 *
+				 * If a shared buffer added to the ring later because the
+				 * current strategy buffer is pinned or in use or because all
+				 * strategy buffers were dirty and rejected (for BAS_BULKREAD
+				 * operations only) requires flushing, this is counted as an
+				 * IOCONTEXT_SHARED IOOP_WRITE (from_ring will be false).
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer (IOCONTEXT_SHARED
+				 * IOOP_WRITE).
+				 */
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1477,30 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		* When a BufferAccessStrategy is in use, evictions adding a
+		* shared buffer to the strategy ring are counted in the
+		* corresponding strategy's context. This includes the evictions
+		* done to add buffers to the ring initially as well as those
+		* done to add a new shared buffer to the ring when current
+		* buffer is pinned or otherwise in use.
+		*
+		* Blocks evicted from buffers already in the strategy ring are counted
+		* as IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, or IOCONTEXT_VACUUM
+		* reuses.
+		*
+		* We wait until this point to count reuses and evictions in order to
+		* avoid incorrectly counting a buffer as reused or evicted when it was
+		* released because it was concurrently pinned or in use or counting it
+		* as reused when it was rejected or when we errored out.
+		*/
+		IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT;
+
+		pgstat_count_io_op(io_op, io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2630,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2880,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2900,6 +2960,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3551,6 +3613,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3586,7 +3650,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3748,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3958,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3985,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 64728bd7ce..6eb2e00ae2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -192,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -207,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -299,6 +305,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
 				return buf;
 			}
 			UnlockBufHdr(buf, local_buf_state);
@@ -331,6 +338,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
 				return buf;
 			}
 		}
@@ -596,7 +604,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -643,7 +651,13 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	/*
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
+	 *
+	 * This counts as a "repossession" for the purposes of IO operation
+	 * statistic tracking, since the reason that we no longer consider the
+	 * current buffer to be part of the ring is that the block in it is in use
+	 * outside of the ring, preventing us from reusing the buffer.
 	 */
+	pgstat_count_io_op(IOOP_REPOSSESS, IOContextForStrategy(strategy));
 	return NULL;
 }
 
@@ -659,6 +673,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_SHARED;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_SHARED;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -688,5 +733,7 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	pgstat_count_io_op(IOOP_REJECT, IOContextForStrategy(strategy));
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..cb9685564f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
 				break;
 			}
 		}
@@ -226,6 +228,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
@@ -256,6 +260,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOCONTEXT_LOCAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
@@ -275,6 +280,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 	*foundPtr = false;
+
 	return bufHdr;
 }
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..5718b52fb5 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..6f9c250907
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,255 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REPOSSESS:
+			pending_counters->repossessions++;
+			break;
+		case IOOP_REJECT:
+			pending_counters->rejections++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_LOCAL:
+			return "local";
+		case IOCONTEXT_SHARED:
+			return "shared";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REPOSSESS:
+			return "repossessed";
+		case IOOP_REJECT:
+			return "rejected";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
+{
+	bool		no_local;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries should use local buffers. Parallel workers will not
+	 * use local buffers (see InitLocalBuffers()); however, extensions
+	 * leveraging background workers have no such limitation, so track IO
+	 * Operations in IOCONTEXT_LOCAL for BackendType B_BG_WORKER.
+	 */
+	no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (io_context == IOCONTEXT_LOCAL && no_local)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((io_context == IOCONTEXT_BULKREAD || io_context == IOCONTEXT_BULKWRITE
+		 || io_context == IOCONTEXT_VACUUM) && (bktype == B_CHECKPOINTER
+												|| bktype == B_BG_WRITER))
+		return false;
+
+	if (io_context == IOCONTEXT_VACUUM && bktype == B_AUTOVAC_LAUNCHER)
+		return false;
+
+	if (io_context == IOCONTEXT_BULKWRITE && (bktype == B_AUTOVAC_WORKER ||
+											  bktype == B_AUTOVAC_LAUNCHER))
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Only BAS_BULKREAD will reject strategy buffers
+	 */
+	if (io_context != IOCONTEXT_BULKREAD && io_op == IOOP_REJECT)
+		return false;
+
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD || io_context ==
+		IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REPOSSESS and IOOP_REUSE are only relevant when a
+	 * BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && (io_op == IOOP_REJECT || io_op ==
+				IOOP_REPOSSESS || io_op == IOOP_REUSE))
+		return false;
+
+	/*
+	 * Temporary tables using local buffers are not logged and thus do not
+	 * require fsync'ing.
+	 *
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_SHARED IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if ((io_context == IOCONTEXT_LOCAL || strategy_io_context) &&
+		io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_valid(bktype, io_context))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..5883aafe9c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ * When adding a new value, ensure that the proper assertions are added to
+ * pgstat_io_context_ops_assert_zero() and pgstat_io_op_assert_zero() (though
+ * the compiler will remind you about the latter)
+ */
+
+typedef enum IOOp
+{
+	IOOP_EVICT = 0,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REJECT,
+	IOOP_REPOSSESS,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD = 0,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_LOCAL,
+	IOCONTEXT_SHARED,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter rejections;
+	PgStat_Counter reuses;
+	PgStat_Counter repossessions;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOOpCounters data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +503,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b75481450d..7b67250747 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa0960..d0eed71f63 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f02cc8f42..b080367073 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,9 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2028,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v36-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v36-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 2bb6195640ec5f04dad43e276b4f2801bd5b76ab Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 6 Oct 2022 12:24:42 -0400
Subject: [PATCH v36 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (
evictions, reuses, rejections, repossessions, reads, writes, extends,
and fsyncs) done through each IOContext (shared buffers, local buffers,
and buffers reserved by a BufferAccessStrategy) by each type of backend
(e.g. client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsynced so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 384 +++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql |  16 ++
 src/backend/utils/adt/pgstatfuncs.c  | 139 ++++++++++
 src/include/catalog/pg_proc.dat      |   9 +
 src/test/regress/expected/rules.out  |  13 +
 src/test/regress/expected/stats.out  | 224 ++++++++++++++++
 src/test/regress/sql/stats.sql       | 123 +++++++++
 7 files changed, 892 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 698f274341..de0850337b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -658,20 +667,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_statio_</structname> and
+   <structname>pg_stat_io</structname> views are primarily useful to determine
+   the effectiveness of the buffer cache.  When the number of actual disk reads
+   is much smaller than the number of buffer hits, then the cache is satisfying
+   most read requests without invoking a kernel call. However, these statistics
+   do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3600,13 +3609,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3615,7 +3623,351 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) could be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+       See <link linkend="monitoring-pg-stat-activity-view">
+       <structname>pg_stat_activity</structname></link> for more information on
+       <varname>backend_type</varname>s.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       The context or location of an IO operation.
+       <varname>io_context</varname> <literal>shared</literal> refers to IO
+       operations of data in shared buffers, the primary buffer pool for
+       relation data. <varname>io_context</varname> <literal>local</literal>
+       refers to IO operations on process-local memory used for temporary
+       tables. <varname>io_context</varname> <literal>vacuum</literal> refers
+       to the IO operations incurred while vacuuming and analyzing.
+       <varname>io_context</varname> <literal>bulkread</literal> refers to IO
+       operations specially designated as <literal>bulk reads</literal>, such
+       as the sequential scan of a large table. <varname>io_context</varname>
+       <literal>bulkwrite</literal> refers to IO operations specially
+       designated as <literal>bulk writes</literal>, such as
+       <command>COPY</command>.
+       </para>
+
+       <para>
+       These last three <varname>io_context</varname>s are counted separately
+       because the autovacuum daemon, explicit <command>VACUUM</command>,
+       explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+       writes use a fixed amount of memory, acquiring the equivalent number of
+       shared buffers and reusing them circularly to avoid occupying an undue
+       portion of the main shared buffer pool. This pattern is called a
+       <quote>Buffer Access Strategy</quote> in the
+       <productname>PostgreSQL</productname> source code and the fixed-size
+       ring buffer is referred to as a <quote>strategy ring buffer</quote> for
+       the purposes of this view's documentation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into buffers in this
+       <varname>io_context</varname>.
+       </para>
+       <para>
+       <varname>read</varname> and <varname>extended</varname> for
+       <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+       <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+       <literal>standalone backend</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> for all
+       <varname>io_context</varname>s is similar to the sum of
+       <varname>heap_blks_read</varname>, <varname>idx_blks_read</varname>,
+       <varname>tidx_blks_read</varname>, and
+       <varname>toast_blks_read</varname> in <link
+       linkend="monitoring-pg-statio-all-tables-view">
+       <structname>pg_statio_all_tables</structname></link>. and
+       <varname>blks_read</varname> from <link
+       linkend="monitoring-pg-stat-database-view">
+       <structname>pg_stat_database</structname></link>. The difference is that
+       reads done as part of <command>CREATE DATABASE</command> are not counted
+       in <structname>pg_statio_all_tables</structname> and
+       <structname>pg_stat_database</structname>.
+       </para>
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>,
+       <varname>read</varname> for
+       <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+       <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+       <literal>standalone backend</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> for all
+       <varname>io_context</varname>s is equivalent to
+       <varname>shared_blks_read</varname> together with
+       <varname>local_blks_read</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>.
+       </para>
+
+       <para>
+       Normal client backends should be able to rely on maintenance processes
+       like the checkpointer and background writer to write out dirty data as
+       much as possible. Large numbers of writes by
+       <varname>backend_type</varname> <literal>client backend</literal> in
+       <varname>io_context</varname> <literal>shared</literal> could indicate a
+       misconfiguration of shared buffers or of checkpointer . More information
+       on checkpointer configuration can be found in <xref
+       linkend="wal-configuration"/>.
+       </para>
+
+       <para>Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>. Also, the sum of
+       <varname>written</varname> and <varname>extended</varname> in this view
+       for <varname>backend_type</varname>s <literal>client backend</literal>,
+       <literal>autovacuum worker</literal>, <literal>background
+       worker</literal>, and <literal>walsender</literal> in
+       <varname>io_context</varname>s <literal>shared</literal>,
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>, and
+       <literal>vacuum</literal> is equivalent to
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>.
+       </para>
+
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>, <varname>written</varname> and
+       <varname>extended</varname> for <varname>backend_type</varname>s
+       <literal>autovacuum launcher</literal>, <literal>autovacuum
+       worker</literal>, <literal>client backend</literal>, <literal>standalone
+       backend</literal>, <literal>background worker</literal>, and
+       <literal>walsender</literal> for all <varname>io_context</varname>s is
+       equivalent to <varname>shared_blks_written</varname> together with
+       <varname>local_blks_written</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>bytes_conversion</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of bytes per unit of IO read, written, or extended. For
+      block-oriented IO of relation data, reads, writes, and extends are done
+      in <varname>block_size</varname> units, derived from the build-time
+      parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+      default. Future values could include those derived from
+      <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
+      constant multipliers once non-block-oriented IO (e.g. temporary file IO)
+      is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>evicted</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times a <varname>backend_type</varname> has evicted a block
+       from a shared or local buffer in order to reuse the buffer in this
+       <varname>io_context</varname>. Blocks are only evicted when there are no
+       unoccupied buffers.
+       </para>
+
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>shared</literal> counts the number of times a block from a
+       shared buffer was evicted so that it can be replaced with another block,
+       also in shared buffers.
+
+       A high <varname>evicted</varname> count in <varname>io_context</varname>
+       <literal>shared</literal> could indicate that shared buffers is too
+       small and should be set to a larger value.
+       </para>
+
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times occupied shared
+       buffers were added to the fixed-size strategy ring buffer, causing the
+       buffer contents to be evicted. If the current buffer in the ring is
+       pinned or in use by another backend, it may be replaced by a new shared
+       buffer. If this shared buffer contains valid data, that block must be
+       evicted and will count as <varname>evicted</varname>.
+
+       In <varname>io_context</varname> <literal>bulkread</literal>, existing
+       dirty buffers in the ring requiring flush are
+       <varname>rejected</varname>. If all of the buffers in the strategy ring
+       have been <varname>rejected</varname>, a new shared buffer will be added
+       to the ring. If the new shared buffer is occupied, its contents will
+       need to be evicted.
+
+       Seeing a large number of <varname>evicted</varname> in strategy
+       <varname>io_context</varname>s can provide insight into primary working
+       set cache misses.
+       </para>
+
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>local</literal> counts the number of times a block of data from
+       an existing local buffer was evicted in order to replace it with another
+       block, also in local buffers.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times an existing buffer in the strategy ring was reused
+       as part of an operation in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+       <varname>io_context</varname>s. When a <quote>Buffer Access
+       Strategy</quote> reuses a buffer in the strategy ring, it must evict its
+       contents, incrementing <varname>reused</varname>. When a <quote>Buffer
+       Access Strategy</quote> adds a new shared buffer to the strategy ring
+       and this shared buffer is occupied, the <quote>Buffer Access
+       Strategy</quote> must evict the contents of the shared buffer,
+       incrementing <varname>evicted</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>rejected</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of times a <literal>bulkread</literal> found the current
+      buffer in the fixed-size strategy ring dirty and requiring flush.
+      <quote>Rejecting</quote> the buffer effectively removes it from the
+      strategy ring buffer allowing the slot in the ring to be replaced in the
+      future with a new shared buffer. A high number of
+      <literal>bulkread</literal> rejections can indicate a need for more
+      frequent vacuuming or more aggressive autovacuum settings, as buffers are
+      dirtied during a bulkread operation when updating the hint bit or when
+      performing on-access pruning.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>repossessed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times a buffer in the fixed-size ring buffer used by
+       operations in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, and <literal>vacuum</literal>
+       <varname>io_context</varname>s was removed from that ring buffer because
+       it was pinned or in use by another backend and thus could not have its
+       tenant block evicted so it could be reused. Once removed from the
+       strategy ring, this buffer is a <quote>normal</quote> shared buffer
+       again. A high number of repossessions is a sign of contention for the
+       blocks operated on by the strategy operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of files fsynced by this <varname>backend_type</varname> for the
+       purpose of persisting data dirtied in this
+       <varname>io_context</varname>. <literal>fsyncs</literal> are done at
+       segment boundaries so <varname>bytes_conversion</varname> does not apply to the
+       <varname>files_synced</varname> column. <literal>fsyncs</literal> done
+       by backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as an <varname>io_context</varname> <literal>shared</literal>
+       <literal>fsync</literal>.
+       </para>
 
+       <para>
+       Normal client backends should be able to rely on the checkpointer to
+       ensure data is persisted to permanent storage. Large numbers of
+       <varname>files_synced</varname> by <varname>backend_type</varname>
+       <literal>client backend</literal> could indicate a misconfiguration of
+       shared buffers or of checkpointer. More information on checkpointer
+       configuration can be found in <xref linkend="wal-configuration"/>.
+       </para>
+
+       <para>
+       Note that the sum of <varname>files_synced</varname> for all
+       <varname>io_context</varname> <literal>shared</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in
+       <link linkend="monitoring-pg-stat-bgwriter-view"> <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..571c422f73 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,22 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.read,
+       b.written,
+       b.extended,
+       b.bytes_conversion,
+       b.evicted,
+       b.reused,
+       b.rejected,
+       b.repossessed,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b783af130c..5bd39733b6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -29,6 +29,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/guc.h"
 #include "utils/inet.h"
 #include "utils/timestamp.h"
 
@@ -1725,6 +1726,144 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_REJECTIONS,
+	IO_COL_REPOSSESSIONS,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_REJECT:
+			return IO_COL_REJECTIONS;
+		case IOOP_REPOSSESS:
+			return IO_COL_REPOSSESSIONS;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+															bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+
+			Datum		values[IO_NUM_COLUMNS] = {0};
+			bool		nulls[IO_NUM_COLUMNS] = {0};
+
+			/*
+			 * Some combinations of IOContext and BackendType are not valid
+			 * for any type of IOOp. In such cases, omit the entire row from
+			 * the view.
+			 */
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_valid((BackendType) bktype,
+												(IOContext) io_context))
+			{
+				pgstat_io_context_ops_assert_zero(counters);
+				continue;
+			}
+
+			values[IO_COL_BACKEND_TYPE] = bktype_desc;
+			values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+			values[IO_COL_READS] = Int64GetDatum(counters->reads);
+			values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+			values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+			/*
+			 * Hard-code this to blocks until we have non-block-oriented IO
+			 * represented in the view as well
+			 */
+			values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+			values[IO_COL_EVICTIONS] = Int64GetDatum(counters->evictions);
+			values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+			values[IO_COL_REJECTIONS] = Int64GetDatum(counters->rejections);
+			values[IO_COL_REPOSSESSIONS] = Int64GetDatum(counters->repossessions);
+			values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+			/*
+			 * Some combinations of BackendType and IOOp and of IOContext and
+			 * IOOp are not valid. Set these cells in the view NULL and assert
+			 * that these stats are zero as expected.
+			 */
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid((BackendType) bktype, (IOContext)
+										 io_context, (IOOp) io_op)))
+				{
+					pgstat_io_op_assert_zero(counters, (IOOp) io_op);
+					nulls[pgstat_io_op_get_index((IOOp) io_op)] = true;
+				}
+			}
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 20f5aa56ea..aae96db37a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5653,6 +5653,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,read,written,extended,bytes_conversion,evicted,reused,rejected,repossessed,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 624d0e5aae..c46babade3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,19 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.read,
+    b.written,
+    b.extended,
+    b.bytes_conversion,
+    b.evicted,
+    b.reused,
+    b.rejected,
+    b.repossessed,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, read, written, extended, bytes_conversion, evicted, reused, rejected, repossessed, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 257a6a9da9..28ef9171de 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1120,4 +1120,228 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) AS io_sum_local_evictions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index f6270f7bad..75c2f6c4c0 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -535,4 +535,127 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(written) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'shared' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_SHARED
+-- extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(extended) AS io_sum_shared_extends_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after FROM pg_stat_io WHERE io_context = 'shared' \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) AS io_sum_local_evictions_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'local' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(written) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT sum(extended) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'local' \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.34.1

v36-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchtext/x-patch; charset=US-ASCII; name=v36-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchDownload

From 4746ef5834de99836f81be8ffd322d139c940a25 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Oct 2022 11:03:05 -0700
Subject: [PATCH v36 1/5] Remove BufferAccessStrategyData->current_was_in_ring

It is a duplication of StrategyGetBuffer->from_ring.
---
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 src/backend/storage/buffer/freelist.c | 15 ++-------------
 src/include/storage/buf_internals.h   |  2 +-
 3 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b95381481..4e7b0b31bb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1254,7 +1254,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..64728bd7ce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,12 +81,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -625,10 +619,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +635,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +644,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -682,14 +671,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..b75481450d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -395,7 +395,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
-- 
2.34.1

#91

andres@anarazel.de

about 3 years ago

In reply to: Melanie Plageman (#87)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-10-24 14:38:52 -0400, Melanie Plageman wrote:

- "repossession" is a very unintuitive name for me. If we want something like
it, can't we just name it reuse_failed or such?

Repossession could be called eviction_failed or reuse_failed.
Do we think we will ever want to use it to count buffers we released
in other IOContexts (thus making the name eviction_failed better than
reuse_failed)?

I've a somewhat radical proposal: Let's just not count any of this in the
initial version. I think we want something, but clearly it's one of the harder
aspects of this patch. Let's get the rest in, and then work on this is in
isolation.

Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call
it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What
if, in the future, we want to track other IO done using data in local
memory?

Fair point. However, I think 'tmp' or 'temp' would be worse, because there's
other sources of temporary files that would be worth counting, consider
e.g. tuplestore temporary files. 'temptable' isn't good because it's not just
tables. 'temprel'? On balance I think local is better, but not sure.

Also, what if we want to track other IO done using data from shared memory
that is not in shared buffers? Would IOCONTEXT_SB and IOCONTEXT_TEMP be
better? Should IOContext literally describe the context of the IO being done
and there be a separate column which indicates the source of the data for
the IO? Like wal_buffer, local_buffer, shared_buffer? Then if it is not
block-oriented, it could be shared_mem, local_mem, or bypass?

Hm. I don't think we'd need _buffer for WAL or such, because there's nothing
else.

If we had another dimension to the matrix "data_src" which, with
block-oriented IO is equivalent to "buffer type", this could help with
some of the clarity problems.

We could remove the "reused" column and that becomes:

IOCONTEXT | DATA_SRC | IOOP
----------------------------------------
strategy | strategy_buffer | EVICT

Having data_src and iocontext simplifies the meaning of all io
operations involving a strategy. Some operations are done on shared
buffers and some on existing strategy buffers and this would be more
clear without the addition of special columns for strategies.

-1, I think this just blows up the complexity further, without providing much
benefit. But:

Perhaps a somewhat similar idea could be used to address the concerns in the
preceding paragraphs. How about the following set of columns:

backend_type:
object: relation, temp_relation[, WAL, tempfiles, ...]
iocontext: buffer_pool, bulkread, bulkwrite, vacuum[, bypass]
read:
written:
extended:
bytes_conversion:
evicted:
reused:
files_synced:
stats_reset:

Greetings,

Andres Freund

#92

[1]: https://www.postgresql.org/docs/15/glossary.html#GLOSSARY-AUXILIARY-PROC
[2]: https://www.postgresql.org/docs/15/storage-page-layout.html

m.sakrejda@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#90)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Oct 26, 2022 at 10:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

+ The <structname>pg_statio_</structname> and
+ <structname>pg_stat_io</structname> views are primarily useful to determine
+ the effectiveness of the buffer cache. When the number of actual disk reads

Totally nitpicking, but this reads a little funny to me. Previously
the trailing underscore suggested this is a group, and now with
pg_stat_io itself added (stupid question: should this be
"pg_statio"?), it sounds like we're talking about two views:
pg_stat_io and "pg_statio_". Maybe something like "The pg_stat_io view
and the pg_statio_ set of views are primarily..."?

+ by that backend type in that IO context. Currently only a subset of IO
+ operations are tracked here. WAL IO, IO on temporary files, and some forms
+ of IO outside of shared buffers (such as when building indexes or moving a
+ table from one tablespace to another) could be added in the future.

Again nitpicking, but should this be "may be added"? I think "could"
suggests the possibility of implementation, whereas "may" feels more
like a hint as to how the feature could evolve.

+ portion of the main shared buffer pool. This pattern is called a
+ <quote>Buffer Access Strategy</quote> in the
+ <productname>PostgreSQL</productname> source code and the fixed-size
+ ring buffer is referred to as a <quote>strategy ring buffer</quote> for
+ the purposes of this view's documentation.
+ </para></entry>

Nice, I think this explanation is very helpful. You also use the term
"strategy context" and "strategy operation" below. I think it's fairly
obvious what those mean, but pointing it out in case we want to note
that here, too.

+ <varname>read</varname> and <varname>extended</varname> for

Maybe "plus" instead of "and" here for clarity (I'm assuming that's
what the "and" means)?

+ <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+ <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+ <literal>standalone backend</literal>, <literal>background
+ worker</literal>, and <literal>walsender</literal> for all
+ <varname>io_context</varname>s is similar to the sum of

I'm reviewing the rendered docs now, and I noticed sentences like this
are a bit hard to scan: they force the reader to parse a big list of
backend types before even getting to the meat of what this is talking
about. Should we maybe reword this so that the backend list comes at
the end of the sentence? Or maybe even use a list (e.g., like in the
"state" column description in pg_stat_activity)?

+ <varname>heap_blks_read</varname>, <varname>idx_blks_read</varname>,
+ <varname>tidx_blks_read</varname>, and
+ <varname>toast_blks_read</varname> in <link
+ linkend="monitoring-pg-statio-all-tables-view">
+ <structname>pg_statio_all_tables</structname></link>. and
+ <varname>blks_read</varname> from <link

I think that's a stray period before the "and."

+ <para>If using the <productname>PostgreSQL</productname> extension,
+ <xref linkend="pgstatstatements"/>,
+ <varname>read</varname> for
+ <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+ <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+ <literal>standalone backend</literal>, <literal>background
+ worker</literal>, and <literal>walsender</literal> for all
+ <varname>io_context</varname>s is equivalent to

Same comment as above re: the lengthy list.

+ Normal client backends should be able to rely on maintenance processes
+ like the checkpointer and background writer to write out dirty data as

Nice--it's great to see this mentioned. But I think these are
generally referred to as "auxiliary" not "maintenance" processes, no?

+ <para>If using the <productname>PostgreSQL</productname> extension,
+ <xref linkend="pgstatstatements"/>, <varname>written</varname> and
+ <varname>extended</varname> for <varname>backend_type</varname>s

Again, should this be "plus" instead of "and"?

+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bytes_conversion</structfield> <type>bigint</type>
+ </para>

I think this general approach works (instead of unit). I'm not wild
about the name, but I don't really have a better suggestion. Maybe
"op_bytes" (since each cell is counting the number of I/O operations)?
But I think bytes_conversion is okay.

Also, is this (in the middle of the table) the right place for this
column? I would have expected to see it before or after all the actual
I/O op cells.

+ <varname>io_context</varname>s. When a <quote>Buffer Access
+ Strategy</quote> reuses a buffer in the strategy ring, it must evict its
+ contents, incrementing <varname>reused</varname>. When a <quote>Buffer
+ Access Strategy</quote> adds a new shared buffer to the strategy ring
+ and this shared buffer is occupied, the <quote>Buffer Access
+ Strategy</quote> must evict the contents of the shared buffer,
+ incrementing <varname>evicted</varname>.

I think the parallel phrasing here makes this a little hard to follow.
Specifically, I think "must evict its contents" for the strategy case
sounds like a bad thing, but in fact this is a totally normal thing
that happens as part of strategy access, no? The idea is you probably
won't need that buffer again, so it's fine to evict it. I'm not sure
how to reword, but I think the current phrasing is misleading.

+ The number of times a <literal>bulkread</literal> found the current
+ buffer in the fixed-size strategy ring dirty and requiring flush.

Maybe "...found ... to be dirty..."?

+ frequent vacuuming or more aggressive autovacuum settings, as buffers are
+ dirtied during a bulkread operation when updating the hint bit or when
+ performing on-access pruning.

Are there docs to cross-reference here, especially for pruning? I
couldn't find much except a few un-explained mentions in the page
layout docs [2]https://www.postgresql.org/docs/15/storage-page-layout.html, and most of the search results refer to partition
pruning. Searching for hint bits at least gives some info in blog
posts and the wiki.

+ again. A high number of repossessions is a sign of contention for the
+ blocks operated on by the strategy operation.

This (and in general the repossession description) makes sense, but
I'm not sure what to do with the information. Maybe Andres is right
that we could skip this in the first version?

On Mon, Oct 24, 2022 at 12:39 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I don't quite follow this: does this mean that I should expect
'reused' and 'evicted' to be equal in the 'shared' context, because
they represent the same thing? Or will 'reused' just be null because
it's not distinct from 'evicted'? It looks like it's null right now,
but I find the wording here confusing.

You should only see evictions when the strategy evicts shared buffers
and reuses when the strategy evicts existing strategy buffers.

How about this instead in this docs?

the number of times an existing buffer in the strategy ring was reused
as part of an operation in the <literal>bulkread</literal>,
<literal>bulkwrite</literal>, or <literal>vacuum</literal>
<varname>io_context</varname>s. when a buffer access strategy
<quote>reuses</quote> a buffer in the strategy ring, it must evict its
contents, incrementing <varname>reused</varname>. when a buffer access
strategy adds a new shared buffer to the strategy ring and this shared
buffer is occupied, the buffer access strategy must evict the contents
of the shared buffer, incrementing <varname>evicted</varname>.

It looks like you ended up with different wording in the patch, but
both this explanation and what's in the patch now make sense to me.
Thanks for clarifying.

Also, I noticed that the commit message explains missing rows for some
backend_type / io_context combinations and NULL (versus 0) in some
cells, but the docs don't really talk about that. Do you think that
should be in there as well?

Thanks,
Maciek

#93

[1]: /messages/by-id/20221026185808.4qnxowtn35x43u7u@awork3.anarazel.de

melanieplageman@gmail.com

about 3 years ago

In reply to: Maciek Sakrejda (#92)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v37 attached

On Sun, Oct 30, 2022 at 9:09 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

On Wed, Oct 26, 2022 at 10:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
+ The <structname>pg_statio_</structname> and
+ <structname>pg_stat_io</structname> views are primarily useful to determine
+ the effectiveness of the buffer cache. When the number of actual disk reads
Totally nitpicking, but this reads a little funny to me. Previously
the trailing underscore suggested this is a group, and now with
pg_stat_io itself added (stupid question: should this be
"pg_statio"?), it sounds like we're talking about two views:
pg_stat_io and "pg_statio_". Maybe something like "The pg_stat_io view
and the pg_statio_ set of views are primarily..."?

I decided not to call it pg_statio because all of the other stats views
have an underscore after stat and I thought it was an opportunity to be
consistent with them.

+ by that backend type in that IO context. Currently only a subset of IO
+ operations are tracked here. WAL IO, IO on temporary files, and some forms
+ of IO outside of shared buffers (such as when building indexes or moving a
+ table from one tablespace to another) could be added in the future.
Again nitpicking, but should this be "may be added"? I think "could"
suggests the possibility of implementation, whereas "may" feels more
like a hint as to how the feature could evolve.

I've adopted the wording you suggested.

+ portion of the main shared buffer pool. This pattern is called a
+ <quote>Buffer Access Strategy</quote> in the
+ <productname>PostgreSQL</productname> source code and the fixed-size
+ ring buffer is referred to as a <quote>strategy ring buffer</quote> for
+ the purposes of this view's documentation.
+ </para></entry>
Nice, I think this explanation is very helpful. You also use the term
"strategy context" and "strategy operation" below. I think it's fairly
obvious what those mean, but pointing it out in case we want to note
that here, too.

Thanks! I've added definitions of those as well.

+ <varname>read</varname> and <varname>extended</varname> for

Maybe "plus" instead of "and" here for clarity (I'm assuming that's
what the "and" means)?

Modified this -- in some cases by adding the lists mentioned below

+ <varname>backend_type</varname>s <literal>autovacuum launcher</literal>,
+ <literal>autovacuum worker</literal>, <literal>client backend</literal>,
+ <literal>standalone backend</literal>, <literal>background
+ worker</literal>, and <literal>walsender</literal> for all
+ <varname>io_context</varname>s is similar to the sum of
I'm reviewing the rendered docs now, and I noticed sentences like this
are a bit hard to scan: they force the reader to parse a big list of
backend types before even getting to the meat of what this is talking
about. Should we maybe reword this so that the backend list comes at
the end of the sentence? Or maybe even use a list (e.g., like in the
"state" column description in pg_stat_activity)?

Good idea with the bullet points.
For the lengthy lists, I've added bullet point lists to the docs for
several of the columns. It is quite long now but, hopefully, clearer?
Let me know if you think it improves the readability.

+ <varname>heap_blks_read</varname>, <varname>idx_blks_read</varname>,
+ <varname>tidx_blks_read</varname>, and
+ <varname>toast_blks_read</varname> in <link
+ linkend="monitoring-pg-statio-all-tables-view">
+ <structname>pg_statio_all_tables</structname></link>. and
+ <varname>blks_read</varname> from <link

I think that's a stray period before the "and."

Fixed!

+ Normal client backends should be able to rely on maintenance processes
+ like the checkpointer and background writer to write out dirty data as
Nice--it's great to see this mentioned. But I think these are
generally referred to as "auxiliary" not "maintenance" processes, no?

Thanks! Fixed.

+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>bytes_conversion</structfield> <type>bigint</type>
+ </para>
I think this general approach works (instead of unit). I'm not wild
about the name, but I don't really have a better suggestion. Maybe
"op_bytes" (since each cell is counting the number of I/O operations)?
But I think bytes_conversion is okay.

I really like op_bytes and have changed it to this. Thanks for the
suggestion!

Also, is this (in the middle of the table) the right place for this
column? I would have expected to see it before or after all the actual
I/O op cells.

I put it after read, write, and extend columns because it applies to
them. It doesn't apply to files_synced. For reused and evicted, I didn't
think bytes reused and evicted made sense. Also, when we add non-block
oriented IO, reused and evicted won't be used but op_bytes will be. So I
thought it made more sense to place it after the operations it applies
to.

+ <varname>io_context</varname>s. When a <quote>Buffer Access
+ Strategy</quote> reuses a buffer in the strategy ring, it must evict its
+ contents, incrementing <varname>reused</varname>. When a <quote>Buffer
+ Access Strategy</quote> adds a new shared buffer to the strategy ring
+ and this shared buffer is occupied, the <quote>Buffer Access
+ Strategy</quote> must evict the contents of the shared buffer,
+ incrementing <varname>evicted</varname>.
I think the parallel phrasing here makes this a little hard to follow.
Specifically, I think "must evict its contents" for the strategy case
sounds like a bad thing, but in fact this is a totally normal thing
that happens as part of strategy access, no? The idea is you probably
won't need that buffer again, so it's fine to evict it. I'm not sure
how to reword, but I think the current phrasing is misleading.

I had trouble rephrasing this. I changed a few words. I see what you
mean. It is worth noting that reusing strategy buffers when there are
buffers on the freelist may not be the best behavior, so I wouldn't
necessarily consider "reused" a good thing. However, I'm not sure how
much the user could really do about this. I would at least like this
phrasing to be clear (evicted is for shared buffers, reused is for
strategy buffers), so, perhaps this section requires more work.

+ The number of times a <literal>bulkread</literal> found the current
+ buffer in the fixed-size strategy ring dirty and requiring flush.

Maybe "...found ... to be dirty..."?

Changed to this wording.

+ frequent vacuuming or more aggressive autovacuum settings, as buffers are
+ dirtied during a bulkread operation when updating the hint bit or when
+ performing on-access pruning.
Are there docs to cross-reference here, especially for pruning? I
couldn't find much except a few un-explained mentions in the page
layout docs [2], and most of the search results refer to partition
pruning. Searching for hint bits at least gives some info in blog
posts and the wiki.

yes, I don't see anything explaining this either -- below the page
layout it discusses tuple layout but that doesn't mention hint bits.

+ again. A high number of repossessions is a sign of contention for the
+ blocks operated on by the strategy operation.
This (and in general the repossession description) makes sense, but
I'm not sure what to do with the information. Maybe Andres is right
that we could skip this in the first version?

I've removed repossessed and rejected in attached v37. I am a bit sad
about this because I don't see a good way forward and I think those
could be useful for users.

I have added the new column Andres recommended in [1]/messages/by-id/20221026185808.4qnxowtn35x43u7u@awork3.anarazel.de ("io_object") to
clarify temp and local buffers and pave the way for bypass IO (IO not
done through a buffer pool), which can be done on temp or permanent
files for temp or permanent relations, and spill file IO which is done
on temporary files but isn't related to temporary tables.

IOObject has increased the memory footprint and complexity of the code
around tracking and accumulating the statistics, though it has not
increased the number of rows in the view.

One question I still have about this additional dimension is how much
enumeration we need of the various combinations of IO operations, IO
objects, IO ops, and backend types which are allowed and not allowed.
Currently because it is only valid to operate on both IOOBJECT_RELATION
and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the
various functions asserting and validating what is "allowed" in terms of
combinations of ops, objects, contexts, and backend types aren't much
different than they were without IO Object. However, once we begin
adding other objects and contexts, we will need to make this logic more
comprehensive. I'm not sure whether or not I should do that
preemptively.

On Mon, Oct 24, 2022 at 12:39 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I don't quite follow this: does this mean that I should expect
'reused' and 'evicted' to be equal in the 'shared' context, because
they represent the same thing? Or will 'reused' just be null because
it's not distinct from 'evicted'? It looks like it's null right now,
but I find the wording here confusing.

You should only see evictions when the strategy evicts shared buffers
and reuses when the strategy evicts existing strategy buffers.

How about this instead in this docs?

the number of times an existing buffer in the strategy ring was reused
as part of an operation in the <literal>bulkread</literal>,
<literal>bulkwrite</literal>, or <literal>vacuum</literal>
<varname>io_context</varname>s. when a buffer access strategy
<quote>reuses</quote> a buffer in the strategy ring, it must evict its
contents, incrementing <varname>reused</varname>. when a buffer access
strategy adds a new shared buffer to the strategy ring and this shared
buffer is occupied, the buffer access strategy must evict the contents
of the shared buffer, incrementing <varname>evicted</varname>.

It looks like you ended up with different wording in the patch, but
both this explanation and what's in the patch now make sense to me.
Thanks for clarifying.

Yes, I tried to rework it and your suggestion and feedback was very
helpful.

Also, I noticed that the commit message explains missing rows for some
backend_type / io_context combinations and NULL (versus 0) in some
cells, but the docs don't really talk about that. Do you think that
should be in there as well?

Thanks for pointing this out. I have added notes about this to the
relevant columns in the docs.

- Melanie

Attachments:

v37-0002-Track-IO-operation-statistics-locally.patchapplication/octet-stream; name=v37-0002-Track-IO-operation-statistics-locally.patchDownload

From cb632a7feca9162b486d8a7a90581fd45db8865c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 3 Nov 2022 12:10:07 -0400
Subject: [PATCH v37 2/5] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be counted as an IOOp "written" on an
IOObject IOOBJECT_RELATION in IOContext IOCONTEXT_BUFFER_POOL by
BackendType "checkpointer".

Each IOOp (evict, reuse, read, write, extend, and fsync) is counted per
IOObject (relation, temp relation) per IOContext (bulkread, bulkwrite,
buffer pool, or vacuum) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOContext IOCONTEXT_BUFFER_POOL concerns operations on local and shared
buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_EVICT IOOps are counted in IOCONTEXT_BUFFER_POOL when a buffer is
acquired or allocated through [Local]BufferAlloc() and no
BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, shared buffers added to the
strategy ring are counted as IOOP_EVICT IOOps in the
IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM) IOContext. When one of these
buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 +
 src/backend/storage/buffer/bufmgr.c        |  95 +++++++-
 src/backend/storage/buffer/freelist.c      |  43 +++-
 src/backend/storage/buffer/localbuf.c      |   6 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 265 +++++++++++++++++++++
 src/include/pgstat.h                       |  80 +++++++
 src/include/storage/buf_internals.h        |   2 +-
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   6 +
 12 files changed, 508 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..04a8f89637 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_BUFFER_POOL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_BUFFER_POOL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_BUFFER_POOL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_BUFFER_POOL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_BUFFER_POOL);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 82cdec0eb1..a494d7148e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,8 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+		IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	isExtend = (blockNum == P_NEW);
 
+	if (isLocalBuf)
+	{
+		/*
+		 * Though a strategy object may be passed in, no strategy is employed
+		 * when using local buffers. This could happen when doing, for example,
+		 * CREATE TEMPORRARY TABLE AS ...
+		 */
+		io_context = IOCONTEXT_BUFFER_POOL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
@@ -990,6 +1009,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1015,6 +1035,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			instr_time	io_start,
 						io_time;
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
+
 			if (track_io_timing)
 				INSTR_TIME_SET_CURRENT(io_start);
 
@@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
+	bool		from_ring;
+	IOContext	io_context;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1263,13 +1291,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_BUFFER_POOL IOOP_WRITE.
+				 *
+				 * If a shared buffer added to the ring later because the
+				 * current strategy buffer is pinned or in use or because all
+				 * strategy buffers were dirty and rejected (for BAS_BULKREAD
+				 * operations only) requires flushing, this is counted as an
+				 * IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring will be false).
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL
+				 * IOOP_WRITE).
+				 */
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1490,30 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		* When a BufferAccessStrategy is in use, evictions adding a
+		* shared buffer to the strategy ring are counted in the
+		* corresponding strategy's context. This includes the evictions
+		* done to add buffers to the ring initially as well as those
+		* done to add a new shared buffer to the ring when current
+		* buffer is pinned or otherwise in use.
+		*
+		* Blocks evicted from buffers already in the strategy ring are counted
+		* as IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, or IOCONTEXT_VACUUM
+		* reuses.
+		*
+		* We wait until this point to count reuses and evictions in order to
+		* avoid incorrectly counting a buffer as reused or evicted when it was
+		* released because it was concurrently pinned or in use or counting it
+		* as reused when it was rejected or when we errored out.
+		*/
+		IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT;
+
+		pgstat_count_io_op(io_op, IOOBJECT_RELATION, io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2643,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2893,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2900,6 +2973,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3551,6 +3626,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3586,7 +3663,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3761,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3971,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3998,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 64728bd7ce..937c674a7a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -192,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -207,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -299,6 +305,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
 				return buf;
 			}
 			UnlockBufHdr(buf, local_buf_state);
@@ -331,6 +338,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 				if (strategy != NULL)
 					AddBufferToRing(strategy, buf);
 				*buf_state = local_buf_state;
+
 				return buf;
 			}
 		}
@@ -596,7 +604,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -659,6 +667,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_BUFFER_POOL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_BUFFER_POOL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..6361041f7a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
 				break;
 			}
 		}
@@ -226,6 +228,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
@@ -256,6 +260,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
@@ -275,6 +280,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 	*foundPtr = false;
+
 	return bufHdr;
 }
 
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..a1bb1cef54 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_BUFFER_POOL);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..9e192f404a
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,265 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_object, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context].data[io_object];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_BUFFER_POOL:
+			return "buffer pool";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+}
+
+const char *
+pgstat_io_object_desc(IOObject io_object)
+{
+	switch(io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Some
+ * IOObjects are never operated on in some IOContexts. Check that the given
+ * BackendType is expected to do IO in the given IOContext and that the given
+ * IOObject is expected to be operated on in the given IOContext..
+ */
+bool
+pgstat_bktype_io_context_io_object_valid(BackendType bktype,
+		IOContext io_context, IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Currently, IO operations on temporary relations can only occur in the
+	 * IOCONTEXT_BUFFER_POOL IOContext.
+	 */
+	if (io_context != IOCONTEXT_BUFFER_POOL &&
+			io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO Operations on
+	 * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL && io_object ==
+			IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+			(io_context == IOCONTEXT_BULKREAD || io_context ==
+			 IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+			io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype ==
+		 B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD || io_context ==
+		IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	 /*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_BUFFER_POOL IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if (strategy_io_context  && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_BUFFER_POOL && io_object ==
+			IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOObject io_object, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_io_object_valid(bktype, io_context, io_object))
+		return false;
+
+	if (!(pgstat_io_op_valid(bktype, io_context, io_object, io_op)))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..e2beafb9b2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,63 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ * When adding a new value, ensure that the proper assertions are added to
+ * pgstat_io_context_ops_assert_zero() and pgstat_io_op_assert_zero() (though
+ * the compiler will remind you about the latter)
+ */
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_BUFFER_POOL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter reuses;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStat_IOObjectOps;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +511,28 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char * pgstat_io_object_desc(IOObject io_object);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_io_object_valid(BackendType bktype,
+		IOContext io_context, IOObject io_object);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context,
+		IOObject io_object, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype,
+		IOContext io_context, IOObject io_object, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b75481450d..7b67250747 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,7 +392,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e1bd22441b..206f4c0b3e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9683b0a88e..6088c44842 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2029,9 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOObjectOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.38.1

v37-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchapplication/octet-stream; name=v37-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchDownload

From 887a168a3e9123830d35c9a8fe2afb7cec171b46 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Oct 2022 11:03:05 -0700
Subject: [PATCH v37 1/5] Remove BufferAccessStrategyData->current_was_in_ring

It is a duplication of StrategyGetBuffer->from_ring.
---
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 src/backend/storage/buffer/freelist.c | 15 ++-------------
 src/include/storage/buf_internals.h   |  2 +-
 3 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..82cdec0eb1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1254,7 +1254,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..64728bd7ce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,12 +81,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -625,10 +619,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +635,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +644,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -682,14 +671,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..b75481450d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -395,7 +395,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
-- 
2.38.1

v37-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v37-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From c73f17fd7b2e84c3339b9dbd0a16b05cbb25d3bd Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 3 Nov 2022 12:19:48 -0400
Subject: [PATCH v37 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (
evictions, reuses, rejections, repossessions, reads, writes, extends,
and fsyncs) done through each IOContext (shared buffers, local buffers,
and buffers reserved by a BufferAccessStrategy) by each type of backend
(e.g. client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsynced so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 621 ++++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  15 +
 src/backend/utils/adt/pgstatfuncs.c  | 139 ++++++
 src/include/catalog/pg_proc.dat      |   9 +
 src/test/regress/expected/rules.out  |  12 +
 src/test/regress/expected/stats.out  | 240 +++++++++++
 src/test/regress/sql/stats.sql       | 139 ++++++
 7 files changed, 1159 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 698f274341..bab010c1ce 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -658,20 +667,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are primarily useful to
+   determine the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call. However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3600,13 +3609,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3615,7 +3623,588 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) may be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+       See <link linkend="monitoring-pg-stat-activity-view">
+       <structname>pg_stat_activity</structname></link> for more information on
+       <varname>backend_type</varname>s. Some <varname>backend_type</varname>s
+       do not accumulate IO operation statistics and will not be included in
+       the view.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       The context or location of an IO operation.
+      </para>
+       <itemizedlist>
+        <listitem>
+        <para>
+        <varname>io_context</varname> <literal>buffer pool</literal> refers to
+        IO operations on data in both the shared buffer pool and process-local
+        buffer pools used for temporary relation data.
+        </para>
+         <para>
+        Operations on temporary relations are tracked in
+        <varname>io_context</varname> <literal>buffer pool</literal> and
+        <varname>io_object</varname> <literal>temp relation</literal>.
+        </para>
+         <para>
+         Operations on permanent relations are tracked in
+         <varname>io_context</varname> <literal>buffer pool</literal> and
+         <varname>io_object</varname> <literal>relation</literal>.
+         </para>
+        </listitem>
+
+        <listitem>
+        <para>
+         <varname>io_context</varname> <literal>vacuum</literal> refers to the IO
+         operations incurred while vacuuming and analyzing.
+         </para>
+        </listitem>
+
+        <listitem>
+        <para>
+         <varname>io_context</varname> <literal>bulkread</literal> refers to IO
+         operations specially designated as <literal>bulk reads</literal>, such
+         as the sequential scan of a large table.
+         </para>
+        </listitem>
+
+        <listitem>
+        <para>
+         <varname>io_context</varname> <literal>bulkwrite</literal> refers to IO
+         operations specially designated as <literal>bulk writes</literal>, such
+         as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+
+       <para>
+       These last three <varname>io_context</varname>s are counted separately
+       because the autovacuum daemon, explicit <command>VACUUM</command>,
+       explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+       writes use a fixed amount of memory, acquiring the equivalent number of
+       shared buffers and reusing them circularly to avoid occupying an undue
+       portion of the main shared buffer pool. This pattern is called a
+       <quote>Buffer Access Strategy</quote> in the
+       <productname>PostgreSQL</productname> source code and the fixed-size
+       ring buffer is referred to as a <quote>strategy ring buffer</quote> for
+       the purposes of this view's documentation. These
+       <varname>io_context</varname>s are referred to as <quote>strategy
+       contexts</quote> and IO operations on strategy contexts are referred to
+       as <quote>strategy operations</quote>.
+      </para>
+      <para>
+      Some <varname>io_context</varname>s are not used by some
+      <varname>backend_type</varname>s and will not be in the view. For
+      example, the checkpointer does not use a Buffer Access Strategy
+      (currently), so there will be no rows for <varname>backend_type</varname>
+      <literal>checkpointer</literal> and any of the strategy
+      <varname>io_context</varname>s.
+      </para>
+      <para>
+      Some IO operations are invalid in combination with certain
+      <varname>io_context</varname>s and <varname>io_object</varname>s. Those
+      cells will be NULL to distinguish between 0 observed IO operations of
+      that type and an invalid combination. For example, temporary tables are
+      not fsynced, so cells for all <varname>backend_type</varname>s for
+      <varname>io_object</varname> <literal>temp relation</literal> in
+      <varname>io_context</varname> <literal>buffer pool</literal> for
+      <varname>files_synced</varname> will be NULL. Some
+      <varname>backend_type</varname>s never perform certain IO operations.
+      Those cells will also be NULL in the view. For example <varname>backend
+      type</varname> <literal>background writer</literal> should not perform
+      reads.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_object</structfield> <type>text</type>
+      </para>
+      <para>
+      Object operated on in a given <varname>io_context</varname> by a given
+      <varname>backend_type</varname>.
+      </para>
+
+      <para> Some <varname>backend_type</varname>s will never do IO operations
+      on some <varname>io_object</varname>s, either at all or in certain
+      <varname>io_context</varname>s. These rows are omitted from the
+      view.</para>
+
+     </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into buffers in this
+       <varname>io_context</varname>.
+       <varname>read</varname> plus <varname>extended</varname> for
+       <varname>backend_type</varname>s
+
+       <itemizedlist>
+
+       <listitem>
+        <para>
+       <literal>autovacuum launcher</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>autovacuum worker</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>client backend</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>standalone backend</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>background worker</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>walsender</literal>
+        </para>
+       </listitem>
+
+       </itemizedlist>
+
+       for all
+       <varname>io_context</varname>s is similar to the sum of
+       <itemizedlist>
+
+       <listitem>
+        <para>
+       <varname>heap_blks_read</varname>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <varname>idx_blks_read</varname>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <varname>tidx_blks_read</varname>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <varname>toast_blks_read</varname>
+        </para>
+       </listitem>
+
+       </itemizedlist>
+
+       in <link linkend="monitoring-pg-statio-all-tables-view">
+       <structname>pg_statio_all_tables</structname></link> and
+       <varname>blks_read</varname> from <link
+       linkend="monitoring-pg-stat-database-view">
+       <structname>pg_stat_database</structname></link>
+
+       The difference is that reads done as part of <command>CREATE
+       DATABASE</command> are not counted in
+       <structname>pg_statio_all_tables</structname> and
+       <structname>pg_stat_database</structname>
+       </para>
+
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>,
+       <varname>read</varname> for
+       <varname>backend_type</varname>s
+
+       <itemizedlist>
+       <listitem>
+        <para>
+       <literal>autovacuum launcher</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>autovacuum worker</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>client backend</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>standalone backend</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>background worker</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>walsender</literal>
+        </para>
+       </listitem>
+       </itemizedlist>
+
+       for all
+       <varname>io_context</varname>s is equivalent to
+       <varname>shared_blks_read</varname> plus
+       <varname>local_blks_read</varname> in <varname>pg_stat_statements</varname>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>.
+       </para>
+
+       <para>
+       Normal client backends should be able to rely on auxiliary processes
+       like the checkpointer and background writer to write out dirty data as
+       much as possible. Large numbers of writes by
+       <varname>backend_type</varname> <literal>client backend</literal> in
+       <varname>io_context</varname> <literal>buffer pool</literal> and
+       <varname>io_object</varname> <literal>relation</literal> could indicate
+       a misconfiguration of shared buffers or of checkpointer. More
+       information on checkpointer configuration can be found in <xref
+       linkend="wal-configuration"/>.
+       </para>
+
+       <para>Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>
+       </para>
+
+       <para>Also, the sum of
+       <varname>written</varname> plus <varname>extended</varname> in this view
+       for <varname>backend_type</varname>s
+       <itemizedlist>
+       <listitem>
+       <para>
+       <literal>client backend</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <literal>autovacuum worker</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <literal>background worker</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <literal>walsender</literal>
+       </para>
+       </listitem>
+       </itemizedlist>
+       on
+
+       <itemizedlist>
+       <listitem>
+       <para>
+       <varname>io_object</varname> <literal>relation</literal> in
+       <varname>io_context</varname>s <literal>buffer pool</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <varname>io_object</varname> <literal>relation</literal> in
+       <varname>io_context</varname> <literal>bulkread</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <varname>io_object</varname> <literal>relation</literal> in
+       <varname>io_context</varname> <literal>bulkwrite</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <varname>io_object</varname> <literal>relation</literal> in
+       <varname>io_context</varname> <literal>vacuum</literal>
+       </para>
+       </listitem>
+       </itemizedlist>
+
+       is equivalent to <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>
+       </para>
+
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>, <varname>written</varname> plus
+       <varname>extended</varname> for
+       <varname>backend_type</varname>s
+
+       <itemizedlist>
+        <listitem>
+        <para>
+       <literal>autovacuum launcher</literal>
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>autovacuum worker</literal>
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>client backend</literal>
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>standalone backend</literal>
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>background worker</literal> and
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>walsender</literal>
+       </para>
+        </listitem>
+
+       </itemizedlist>
+
+       for all <varname>io_context</varname>s is equivalent to
+       <varname>shared_blks_written</varname> plus
+       <varname>local_blks_written</varname> in
+       <varname>pg_stat_statements</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>op_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of bytes per unit of IO read, written, or extended. For
+      block-oriented IO of relation data, reads, writes, and extends are done
+      in <varname>block_size</varname> units, derived from the build-time
+      parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+      default. Future values could include those derived from
+      <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
+      constant multipliers once non-block-oriented IO (e.g. temporary file IO)
+      is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>evicted</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times a <varname>backend_type</varname> has evicted a block
+       from a shared or local buffer in order to reuse the buffer in this
+       <varname>io_context</varname>. Blocks are only evicted when there are no
+       unoccupied buffers.
+       </para>
+
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>relation</literal> counts the number of times a block from a
+       shared buffer was evicted so that it can be replaced with another block,
+       also in shared buffers.
+
+       A high <varname>evicted</varname> count in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>relation</literal> could indicate that shared buffers is too
+       small and should be set to a larger value.
+       </para>
 
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times occupied shared
+       buffers were added to the fixed-size strategy ring buffer, causing the
+       buffer contents to be evicted. If the current buffer in the ring is
+       pinned or in use by another backend, it may be replaced by a new shared
+       buffer. If this shared buffer contains valid data, that block must be
+       evicted and will count as <varname>evicted</varname>.
+
+       Seeing a large number of <varname>evicted</varname> in strategy
+       <varname>io_context</varname>s can provide insight into primary working
+       set cache misses.
+       </para>
+
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>temp relation</literal> counts the number of times a block of
+       data from an existing local buffer was evicted in order to replace it
+       with another block, also in local buffers.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times an existing buffer in the strategy ring was reused
+       as part of an operation in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+       <varname>io_context</varname>s. When a <quote>Buffer Access
+       Strategy</quote> reuses a buffer in the strategy ring, it evicts the
+       buffer contents, incrementing <varname>reused</varname>. When a
+       <quote>Buffer Access Strategy</quote> adds a new shared buffer to the
+       strategy ring and this shared buffer is occupied, the <quote>Buffer
+       Access Strategy</quote> must evict the contents of the shared buffer,
+       incrementing <varname>evicted</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of files <literal>fsync</literal>ed by this
+       <varname>backend_type</varname> for the purpose of persisting data
+       dirtied in this <varname>io_context</varname>. <literal>fsync</literal>s
+       are done at segment boundaries so <varname>op_bytes</varname>
+       does not apply to the <varname>files_synced</varname> column.
+       <literal>fsync</literal>s done by backends in order to persist data
+       written in <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as <varname>io_context</varname> <literal>buffer pool</literal>
+       <varname>io_object</varname> <literal>relation</literal>
+       <varname>files_synced</varname>.
+       </para>
+
+       <para>
+       Normal client backends should be able to rely on the checkpointer to
+       ensure data is persisted to permanent storage. Large numbers of
+       <varname>files_synced</varname> by <varname>backend_type</varname>
+       <literal>client backend</literal> could indicate a misconfiguration of
+       shared buffers or of checkpointer. More information on checkpointer
+       configuration can be found in <xref linkend="wal-configuration"/>.
+       </para>
+
+       <para>
+       Note that the sum of <varname>files_synced</varname> for all
+       <varname>io_context</varname> <literal>buffer pool</literal>
+       <varname>io_object</varname> <literal>relation</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..296b3acf6e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.read,
+       b.written,
+       b.extended,
+       b.op_bytes,
+       b.evicted,
+       b.reused,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b783af130c..e7bc123e05 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1725,6 +1725,145 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+															bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+			PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
+
+			for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+			{
+				PgStat_IOOpCounters *counters = &io_objs->data[io_object];
+				const char *io_obj_str = pgstat_io_object_desc(io_object);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				* Some combinations of IOContext, IOObject, and BackendType are
+				* not valid for any type of IOOp. In such cases, omit the
+				* entire row from the view.
+				*/
+				if (!expect_backend_stats ||
+					!pgstat_bktype_io_context_io_object_valid((BackendType) bktype,
+						(IOContext) io_context, (IOObject) io_object))
+				{
+					pgstat_io_context_ops_assert_zero(counters);
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(io_obj_str);
+				values[IO_COL_READS] = Int64GetDatum(counters->reads);
+				values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+				values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+				/*
+				* Hard-code this to blocks until we have non-block-oriented IO
+				* represented in the view as well
+				*/
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				values[IO_COL_EVICTIONS] = Int64GetDatum(counters->evictions);
+				values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+				values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				* Some combinations of BackendType and IOOp, of IOContext and
+				* IOOp, and of IOObject and IOOp are not valid. Set these cells
+				* in the view NULL and assert that these stats are zero as
+				* expected.
+				*/
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					if (!(pgstat_io_op_valid((BackendType) bktype, (IOContext)
+											io_context, (IOObject) io_object, (IOOp) io_op)))
+					{
+						pgstat_io_op_assert_zero(counters, (IOOp) io_op);
+						nulls[pgstat_io_op_get_index((IOOp) io_op)] = true;
+					}
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 20f5aa56ea..0e36eedb2c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5653,6 +5653,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,read,written,extended,op_bytes,evicted,reused,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 624d0e5aae..ffa800a661 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.read,
+    b.written,
+    b.extended,
+    b.op_bytes,
+    b.evicted,
+    b.reused,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, read, written, extended, op_bytes, evicted, reused, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 257a6a9da9..6bbd447114 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1120,4 +1120,244 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_BUFFER_POOL
+-- extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_BUFFER_POOL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index f6270f7bad..a7bb0cf0a4 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -535,4 +535,143 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+-- Create a regular table and insert some data to generate IOCONTEXT_BUFFER_POOL
+-- extends.
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+-- After a checkpoint, there should be some additional IOCONTEXT_BUFFER_POOL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+-- Read in evicted buffers.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test that reads of blocks into reused strategy buffers during database
+-- creation, which uses a BAS_BULKREAD BufferAccessStrategy, are tracked in
+-- pg_stat_io.
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_before FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+CREATE DATABASE test_io_bulkread_strategy_db;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_bulkread_strategy_reads_after FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :io_sum_bulkread_strategy_reads_after > :io_sum_bulkread_strategy_reads_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.38.1

v37-0003-Aggregate-IO-operation-stats-per-BackendType.patchapplication/octet-stream; name=v37-0003-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 43647d87a5bcc5e4fcfdf3faa7e3e2a7884f66c6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 3 Nov 2022 12:17:40 -0400
Subject: [PATCH v37 3/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 165 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  47 +++++
 src/include/utils/pgstat_internal.h           |  84 +++++++++
 src/tools/pgindent/typedefs.list              |   4 +
 13 files changed, 374 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..698f274341 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5390,6 +5390,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1ebe3bbf29..d2ba5fd9f3 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[bktype]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(&shmem->io_ops.stats[bktype],
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 9e192f404a..2324d89040 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,42 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			target->evictions += source->evictions;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
@@ -54,6 +90,85 @@ pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
+		PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
+
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
+			PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(MyBackendType,
+					(IOContext) io_context, (IOObject) io_object))
+			{
+				pgstat_io_context_ops_assert_zero(sharedent);
+				pgstat_io_context_ops_assert_zero(pendingent);
+				continue;
+			}
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+								(IOObject) io_object, (IOOp) io_op)))
+				{
+					pgstat_io_op_assert_zero(sharedent, (IOOp) io_op);
+					pgstat_io_op_assert_zero(pendingent, (IOOp) io_op);
+					continue;
+				}
+
+				pgstat_accum_io_op(sharedent, pendingent, (IOOp) io_op);
+			}
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -110,6 +225,56 @@ pgstat_io_op_desc(IOOp io_op)
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 55a355f583..a23a90b133 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 706692862c..4251079ae1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 96bffc0f2a..b783af130c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2084,6 +2084,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2092,7 +2094,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..d1a43a662e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e2beafb9b2..9d86cf35f3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -334,6 +335,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -516,6 +523,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char * pgstat_io_object_desc(IOObject io_object);
 extern const char *pgstat_io_op_desc(IOOp io_op);
@@ -531,6 +539,45 @@ extern bool pgstat_expect_io_op(BackendType bktype,
 
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->evictions == 0 && counters->extends == 0 &&
+			counters->fsyncs == 0 && counters->reads == 0 && counters->reuses
+			== 0 && counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			Assert(counters->evictions == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
 
 
 /*
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index e2c7b59324..5f8f4ba053 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,31 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+
+typedef struct PgStatShared_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStatShared_IOObjectOps;
+
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStatShared_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +444,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +468,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +577,57 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which is not valid for the passed-in BackendType. The
+ * passed-in array of PgStat_IOOpCounters must contain stats from the
+ * BackendType specified by the second parameter. Caller is responsible for
+ * locking of the passed-in PgStatShared_IOContextOps, if needed.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
+		BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *context = &backend_io_context_ops->data[io_context];
+
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *object = &context->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(bktype,
+					(IOContext) io_context, (IOObject) io_object))
+			{
+				pgstat_io_context_ops_assert_zero(object);
+				continue;
+			}
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_io_op_valid(bktype, (IOContext) io_context,
+							(IOObject) io_object, (IOOp) io_op))
+					pgstat_io_op_assert_zero(object, (IOOp) io_op);
+			}
+		}
+	}
+}
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -641,6 +720,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6088c44842..ca4047ea93 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2006,12 +2006,15 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
+PgStatShared_IOObjectOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2019,6 +2022,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.38.1

#94

m.sakrejda@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#93)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Nov 3, 2022 at 10:00 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I decided not to call it pg_statio because all of the other stats views
have an underscore after stat and I thought it was an opportunity to be
consistent with them.

Oh, got it. Makes sense.

I'm reviewing the rendered docs now, and I noticed sentences like this
are a bit hard to scan: they force the reader to parse a big list of
backend types before even getting to the meat of what this is talking
about. Should we maybe reword this so that the backend list comes at
the end of the sentence? Or maybe even use a list (e.g., like in the
"state" column description in pg_stat_activity)?

Good idea with the bullet points.
For the lengthy lists, I've added bullet point lists to the docs for
several of the columns. It is quite long now but, hopefully, clearer?
Let me know if you think it improves the readability.

Hmm, I should have tried this before suggesting it. I think the lists
break up the flow of the column description too much. What do you
think about the attached (on top of your patches--attaching it as a
.diff to hopefully not confuse cfbot)? I kept the lists for backend
types but inlined the others as a middle ground. I also added a few
omitted periods and reworded "read plus extended" to avoid starting
the sentence with a (lowercase) varname (I think in general it's fine
to do that, but the more complicated sentence structure here makes it
easier to follow if the sentence starts with a capital).

Alternately, what do you think about pulling equivalencies to existing
views out of the main column descriptions, and adding them after the
main table as a sort of footnote? Most view docs don't have anything
like that, but pg_stat_replication does and it might be a good pattern
to follow.

Thoughts?

Also, is this (in the middle of the table) the right place for this
column? I would have expected to see it before or after all the actual
I/O op cells.

I put it after read, write, and extend columns because it applies to
them. It doesn't apply to files_synced. For reused and evicted, I didn't
think bytes reused and evicted made sense. Also, when we add non-block
oriented IO, reused and evicted won't be used but op_bytes will be. So I
thought it made more sense to place it after the operations it applies
to.

Got it, makes sense.

+ <varname>io_context</varname>s. When a <quote>Buffer Access
+ Strategy</quote> reuses a buffer in the strategy ring, it must evict its
+ contents, incrementing <varname>reused</varname>. When a <quote>Buffer
+ Access Strategy</quote> adds a new shared buffer to the strategy ring
+ and this shared buffer is occupied, the <quote>Buffer Access
+ Strategy</quote> must evict the contents of the shared buffer,
+ incrementing <varname>evicted</varname>.
I think the parallel phrasing here makes this a little hard to follow.
Specifically, I think "must evict its contents" for the strategy case
sounds like a bad thing, but in fact this is a totally normal thing
that happens as part of strategy access, no? The idea is you probably
won't need that buffer again, so it's fine to evict it. I'm not sure
how to reword, but I think the current phrasing is misleading.
I had trouble rephrasing this. I changed a few words. I see what you
mean. It is worth noting that reusing strategy buffers when there are
buffers on the freelist may not be the best behavior, so I wouldn't
necessarily consider "reused" a good thing. However, I'm not sure how
much the user could really do about this. I would at least like this
phrasing to be clear (evicted is for shared buffers, reused is for
strategy buffers), so, perhaps this section requires more work.

Oh, I see. I think the updated wording works better. Although I think
we can drop the quotes around "Buffer Access Strategy" here. They're
useful when defining the term originally, but after that I think it's
clearer to use the term unquoted.

Just to understand this better myself, though: can you clarify when
"reused" is not a normal, expected part of the strategy execution? I
was under the impression that a ring buffer is used because each page
is needed only "once" (i.e., for one set of operations) for the
command using the strategy ring buffer. Naively, in that situation, it
seems better to reuse a no-longer-needed buffer than to claim another
buffer from the freelist (where other commands may eventually make
better use of it).

+ again. A high number of repossessions is a sign of contention for the
+ blocks operated on by the strategy operation.
This (and in general the repossession description) makes sense, but
I'm not sure what to do with the information. Maybe Andres is right
that we could skip this in the first version?
I've removed repossessed and rejected in attached v37. I am a bit sad
about this because I don't see a good way forward and I think those
could be useful for users.

I can see that, but I think as long as we're not doing anything to
preclude adding this in the future, it's better to get something out
there and expand it later. For what it's worth, I don't feel it needs
to be excluded, just that it's not worth getting hung up on.

I have added the new column Andres recommended in [1] ("io_object") to
clarify temp and local buffers and pave the way for bypass IO (IO not
done through a buffer pool), which can be done on temp or permanent
files for temp or permanent relations, and spill file IO which is done
on temporary files but isn't related to temporary tables.

IOObject has increased the memory footprint and complexity of the code
around tracking and accumulating the statistics, though it has not
increased the number of rows in the view.

One question I still have about this additional dimension is how much
enumeration we need of the various combinations of IO operations, IO
objects, IO ops, and backend types which are allowed and not allowed.
Currently because it is only valid to operate on both IOOBJECT_RELATION
and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the
various functions asserting and validating what is "allowed" in terms of
combinations of ops, objects, contexts, and backend types aren't much
different than they were without IO Object. However, once we begin
adding other objects and contexts, we will need to make this logic more
comprehensive. I'm not sure whether or not I should do that
preemptively.

It's definitely something to consider, but I have no useful input here.

Some more notes on the docs patch:

+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>io_context</structfield> <type>text</type>
+ </para>
+ <para>
+ The context or location of an IO operation.
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <varname>io_context</varname> <literal>buffer pool</literal> refers to
+ IO operations on data in both the shared buffer pool and process-local
+ buffer pools used for temporary relation data.
+ </para>
+ <para>
+ Operations on temporary relations are tracked in
+ <varname>io_context</varname> <literal>buffer pool</literal> and
+ <varname>io_object</varname> <literal>temp relation</literal>.
+ </para>
+ <para>
+ Operations on permanent relations are tracked in
+ <varname>io_context</varname> <literal>buffer pool</literal> and
+ <varname>io_object</varname> <literal>relation</literal>.
+ </para>
+ </listitem>

For this column, you repeat "io_context" in the list describing the
possible values of the column. Enum-style columns in other tables
don't do that (e.g., the pg_stat_activty "state" column). I think it
might read better to omit "io_context" from the list.

+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>io_object</structfield> <type>text</type>
+ </para>
+ <para>
+ Object operated on in a given <varname>io_context</varname> by a given
+ <varname>backend_type</varname>.
+ </para>

Is this a fixed set of objects we should list, like for io_context?

Thanks,
Maciek

Attachments:

v37-pg_stat_io-delta.diffapplication/x-patch; name=v37-pg_stat_io-delta.diffDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bab010c1ce..ff9c9eb339 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3782,8 +3782,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </para>
       <para>
        Reads by this <varname>backend_type</varname> into buffers in this
-       <varname>io_context</varname>.
-       <varname>read</varname> plus <varname>extended</varname> for
+       <varname>io_context</varname>. The sum of
+       <varname>read</varname> and <varname>extended</varname> for
        <varname>backend_type</varname>s
 
        <itemizedlist>
@@ -3828,44 +3828,22 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
        for all
        <varname>io_context</varname>s is similar to the sum of
-       <itemizedlist>
-
-       <listitem>
-        <para>
-       <varname>heap_blks_read</varname>
-        </para>
-       </listitem>
-
-       <listitem>
-        <para>
-       <varname>idx_blks_read</varname>
-        </para>
-       </listitem>
-
-       <listitem>
-        <para>
-       <varname>tidx_blks_read</varname>
-        </para>
-       </listitem>
+       <varname>heap_blks_read</varname>,
+       <varname>idx_blks_read</varname>,
+       <varname>tidx_blks_read</varname>, and
 
-       <listitem>
-        <para>
        <varname>toast_blks_read</varname>
-        </para>
-       </listitem>
-
-       </itemizedlist>
 
        in <link linkend="monitoring-pg-statio-all-tables-view">
        <structname>pg_statio_all_tables</structname></link> and
        <varname>blks_read</varname> from <link
        linkend="monitoring-pg-stat-database-view">
-       <structname>pg_stat_database</structname></link>
+       <structname>pg_stat_database</structname></link>.
 
        The difference is that reads done as part of <command>CREATE
        DATABASE</command> are not counted in
        <structname>pg_statio_all_tables</structname> and
-       <structname>pg_stat_database</structname>
+       <structname>pg_stat_database</structname>.
        </para>
 
        <para>If using the <productname>PostgreSQL</productname> extension,
@@ -3945,7 +3923,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        equivalent to the values of <varname>buffers_clean</varname> and
        <varname>buffers_checkpoint</varname>, respectively, in <link
        linkend="monitoring-pg-stat-bgwriter-view">
-       <structname>pg_stat_bgwriter</structname></link>
+       <structname>pg_stat_bgwriter</structname></link>.
        </para>
 
        <para>Also, the sum of
@@ -3978,38 +3956,13 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
        on
 
-       <itemizedlist>
-       <listitem>
-       <para>
-       <varname>io_object</varname> <literal>relation</literal> in
-       <varname>io_context</varname>s <literal>buffer pool</literal>
-       </para>
-       </listitem>
-
-       <listitem>
-       <para>
        <varname>io_object</varname> <literal>relation</literal> in
-       <varname>io_context</varname> <literal>bulkread</literal>
-       </para>
-       </listitem>
-
-       <listitem>
-       <para>
-       <varname>io_object</varname> <literal>relation</literal> in
-       <varname>io_context</varname> <literal>bulkwrite</literal>
-       </para>
-       </listitem>
-
-       <listitem>
-       <para>
-       <varname>io_object</varname> <literal>relation</literal> in
-       <varname>io_context</varname> <literal>vacuum</literal>
-       </para>
-       </listitem>
-       </itemizedlist>
+       <varname>io_context</varname>s <literal>buffer pool</literal>,
+       <literal>bulkread</literal>, <literal>bulkwrite</literal>,
+       <literal>vacuum</literal>
 
        is equivalent to <varname>buffers_backend</varname> in
-       <structname>pg_stat_bgwriter</structname>
+       <structname>pg_stat_bgwriter</structname>.
        </para>
 
        <para>If using the <productname>PostgreSQL</productname> extension,

#95

andres@anarazel.de

about 3 years ago

In reply to: Melanie Plageman (#93)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

One good follow up patch will be to rip out the accounting for
pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps
buffers_alloc and replace it with a subselect getting the equivalent data from
pg_stat_io. It might not be quite worth doing for buffers_alloc because of
the way that's tied into bgwriter pacing.

On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote:

+ again. A high number of repossessions is a sign of contention for the +
blocks operated on by the strategy operation.

This (and in general the repossession description) makes sense, but
I'm not sure what to do with the information. Maybe Andres is right
that we could skip this in the first version?

I've removed repossessed and rejected in attached v37. I am a bit sad
about this because I don't see a good way forward and I think those
could be useful for users.

Let's get the basic patch in and then check whether we can find a way to have
something providing at least some more information like repossessed and
rejected. I think it'll be easier to analyze in isolation.

I have added the new column Andres recommended in [1] ("io_object") to
clarify temp and local buffers and pave the way for bypass IO (IO not
done through a buffer pool), which can be done on temp or permanent
files for temp or permanent relations, and spill file IO which is done
on temporary files but isn't related to temporary tables.

IOObject has increased the memory footprint and complexity of the code
around tracking and accumulating the statistics, though it has not
increased the number of rows in the view.

It doesn't look too bad from here. Is there a specific portion of the code
where it concerns you the most?

One question I still have about this additional dimension is how much
enumeration we need of the various combinations of IO operations, IO
objects, IO ops, and backend types which are allowed and not allowed.

Currently because it is only valid to operate on both IOOBJECT_RELATION
and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the
various functions asserting and validating what is "allowed" in terms of
combinations of ops, objects, contexts, and backend types aren't much
different than they were without IO Object. However, once we begin
adding other objects and contexts, we will need to make this logic more
comprehensive. I'm not sure whether or not I should do that
preemptively.

I'd not do it preemptively.

@@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,

isExtend = (blockNum == P_NEW);

+	if (isLocalBuf)
+	{
+		/*
+		 * Though a strategy object may be passed in, no strategy is employed
+		 * when using local buffers. This could happen when doing, for example,
+		 * CREATE TEMPORRARY TABLE AS ...
+		 */
+		io_context = IOCONTEXT_BUFFER_POOL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}

I think given how frequently ReadBuffer_common() is called in some workloads,
it'd be good to make IOContextForStrategy inlinable. But I guess that's not
easily doable, because struct BufferAccessStrategyData is only defined in
freelist.c.

Could we defer this until later, given that we don't currently need this in
case of buffer hits afaict?

@@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
BufferAccessStrategy strategy,
bool *foundPtr)
{
+ bool from_ring;
+ IOContext io_context;
BufferTag newTag; /* identity of requested block */
uint32 newHash; /* hash value for newTag */
LWLock *newPartitionLock; /* buffer partition lock for it */
@@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
LWLockRelease(newPartitionLock);

+ io_context = IOContextForStrategy(strategy);

Hm - doesn't this mean we do IOContextForStrategy() twice? Once in
ReadBuffer_common() and then again here?

/* Loop here in case we have to try another victim buffer */
for (;;)
{
+
/*
* Ensure, while the spinlock's not yet held, that there's a free
* refcount entry.
@@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* Select a victim buffer.  The buffer is returned with its header
* spinlock still held!
*/
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);

Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);

I think patch 0001 relies on this change already having been made, If I am not misunderstanding?

@@ -1263,13 +1291,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
}

+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_BUFFER_POOL IOOP_WRITE.
+				 *
+				 * If a shared buffer added to the ring later because the

Missing word?

+				 * current strategy buffer is pinned or in use or because all
+				 * strategy buffers were dirty and rejected (for BAS_BULKREAD
+				 * operations only) requires flushing, this is counted as an
+				 * IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring will be false).

I think this makes sense for now, but it'd be good if somebody else could
chime in on this...

+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL
+				 * IOOP_WRITE).
+				 */
+
/* OK, do the I/O */
TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);

-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
LWLockRelease(BufferDescriptorGetContentLock(buf));
ScheduleBufferTagForWriteback(&BackendWritebackContext,

+	if (oldFlags & BM_VALID)
+	{
+		/*
+		* When a BufferAccessStrategy is in use, evictions adding a
+		* shared buffer to the strategy ring are counted in the
+		* corresponding strategy's context.

Perhaps "adding a shared buffer to the ring are counted in the corresponding
context"? "strategy's context" sounds off to me.

This includes the evictions
+		* done to add buffers to the ring initially as well as those
+		* done to add a new shared buffer to the ring when current
+		* buffer is pinned or otherwise in use.

I think this sentence could use a few commas, but not sure.

s/current/the current/?

+		* We wait until this point to count reuses and evictions in order to
+		* avoid incorrectly counting a buffer as reused or evicted when it was
+		* released because it was concurrently pinned or in use or counting it
+		* as reused when it was rejected or when we errored out.
+		*/

I can't quite parse this sentence.

+		IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT;
+
+		pgstat_count_io_op(io_op, IOOBJECT_RELATION, io_context);
+	}

I'd just inline the variable, but ...

@@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
LocalRefCount[b]++;
ResourceOwnerRememberBuffer(CurrentResourceOwner,
BufferDescriptorGetBuffer(bufHdr));
+
break;
}
}

Spurious change.

pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

*foundPtr = false;
+
return bufHdr;
}

Dito.

+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:

s/worth reporting/we currently report/?

+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO Operations on
+	 * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+		== B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+		B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL && io_object ==
+			IOOBJECT_TEMP_RELATION)
+		return false;

Personally I don't like line breaks on the == and would rather break earlier
on the && or ||.

+	for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
+		PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
+
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{

Is there any compiler that'd complain if you used IOContext/IOObject/IOOp as the
type in the for loop? I don't think so? Then you'd not need the casts in other
places, which I think would make the code easier to read.

+			PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
+			PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(MyBackendType,
+					(IOContext) io_context, (IOObject) io_object))
+			{
+				pgstat_io_context_ops_assert_zero(sharedent);
+				pgstat_io_context_ops_assert_zero(pendingent);
+				continue;
+			}
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+								(IOObject) io_object, (IOOp) io_op)))

Superfluous parens after the !, I think?

void
pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
}
pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire

Missing "until"?

+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{

Does this need to be in pgstat.h? Perhaps pgstat_internal.h would suffice,
afaict it's not used outside of pgstat code?

+
+/*
+ * Assert that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which is not valid for the passed-in BackendType. The
+ * passed-in array of PgStat_IOOpCounters must contain stats from the
+ * BackendType specified by the second parameter. Caller is responsible for
+ * locking of the passed-in PgStatShared_IOContextOps, if needed.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
+		BackendType bktype)
+{

This doesn't look like it should be an inline function - it's quite long.

I think it's also too complicated for the compiler to optimize out if
assertions are disabled. So you'd need to handle this with an explicit #ifdef
USE_ASSERT_CHECKING.

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       The context or location of an IO operation.
+      </para>
+       <itemizedlist>
+        <listitem>
+        <para>
+        <varname>io_context</varname> <literal>buffer pool</literal> refers to
+        IO operations on data in both the shared buffer pool and process-local
+        buffer pools used for temporary relation data.
+        </para>
+         <para>

The indentation in the sgml part of the patch seems to be a bit wonky.

+       <para>
+       These last three <varname>io_context</varname>s are counted separately
+       because the autovacuum daemon, explicit <command>VACUUM</command>,
+       explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+       writes use a fixed amount of memory, acquiring the equivalent number of

s/memory/buffers/? The amount of memory isn't really fixed.

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into buffers in this
+       <varname>io_context</varname>.
+       <varname>read</varname> plus <varname>extended</varname> for
+       <varname>backend_type</varname>s
+
+       <itemizedlist>
+
+       <listitem>
+        <para>
+       <literal>autovacuum launcher</literal>
+        </para>
+       </listitem>

Hm. ISTM that we should not document the set of valid backend types as part of
this view. Couldn't we share it with pg_stat_activity.backend_type?

+       The difference is that reads done as part of <command>CREATE
+       DATABASE</command> are not counted in
+       <structname>pg_statio_all_tables</structname> and
+       <structname>pg_stat_database</structname>
+       </para>

Hm, this seems a bit far into the weeds?

+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+															bktype);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+			PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
+
+			for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+			{
+				PgStat_IOOpCounters *counters = &io_objs->data[io_object];
+				const char *io_obj_str = pgstat_io_object_desc(io_object);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				* Some combinations of IOContext, IOObject, and BackendType are
+				* not valid for any type of IOOp. In such cases, omit the
+				* entire row from the view.
+				*/
+				if (!expect_backend_stats ||
+					!pgstat_bktype_io_context_io_object_valid((BackendType) bktype,
+						(IOContext) io_context, (IOObject) io_object))
+				{
+					pgstat_io_context_ops_assert_zero(counters);
+					continue;
+				}

Perhaps mention in a comment two loops up that we don't skip the nested loops
despite !expect_backend_stats because we want to assert here?

Greetings,

Andres Freund

#96

pryzby@telsasoft.com

about 3 years ago

In reply to: Melanie Plageman (#93)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Note that 001 fails to compile without 002:

../src/backend/storage/buffer/bufmgr.c:1257:43: error: ‘from_ring’ undeclared (first use in this function)
1257 | StrategyRejectBuffer(strategy, buf, from_ring))

My "warnings" script informed me about these gripes from MSVC:

[03:42:30.607] c:\cirrus>call sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0'
[03:42:30.749] c:\cirrus\src\backend\storage\buffer\freelist.c(699) : warning C4715: 'IOContextForStrategy': not all control paths return a value
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(190) : warning C4715: 'pgstat_io_context_desc': not all control paths return a value
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(204) : warning C4715: 'pgstat_io_object_desc': not all control paths return a value
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(226) : warning C4715: 'pgstat_io_op_desc': not all control paths return a value
[03:42:30.749] c:\cirrus\src\backend\utils\adt\pgstatfuncs.c(1816) : warning C4715: 'pgstat_io_op_get_index': not all control paths return a value

In the docs table, you say things like:
| io_context vacuum refers to the IO operations incurred while vacuuming and analyzing.

..but it's a bit unclear (maybe due to the way the docs are rendered).
I think it may be more clear to say "when <io_context> is
<vacuum>, ..."

| acquiring the equivalent number of shared buffers

I don't think "equivelent" fits here, since it's actually acquiring a
different number of buffers.

There's a missing period before " The difference is"

The sentence beginning "read plus extended for backend_types" is difficult to
parse due to having a bulleted list in its middle.

There aren't many references to "IOOps", which is good, because I
started to read it as "I oops".

+        * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+        * Operation stats, however this will not be called after an entire

=> I think that's intended to say *until* after ?

+ * Functions to assert that invalid IO Operation counters are zero.

=> There's a missing newline above this comment.

+       Assert(counters->evictions == 0 && counters->extends == 0 &&
+                       counters->fsyncs == 0 && counters->reads == 0 && counters->reuses
+                       == 0 && counters->writes == 0);

=> It'd be more readable and also maybe help debugging if these were separate
assertions. I wondered in the past if that should be a general policy
for all assertions.

+pgstat_io_op_stats_collected(BackendType bktype)
+{
+       return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+               bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;

Similar: I'd prefer to see this as 5 "ifs" or a "switch" to return
false, else return true. But YMMV.

+ * CREATE TEMPORRARY TABLE AS ...

=> typo: temporary

+ if (strategy_io_context && io_op == IOOP_FSYNC)

=> Extra space.

pgstat_count_io_op() has a superflous newline before "}".

I think there may be a problem/deficiency with hint bits:

|postgres=# DROP TABLE u2; CREATE TABLE u2 AS SELECT generate_series(1,999999)a; SELECT pg_stat_reset_shared('io'); explain (analyze,buffers) SELECT * FROM u2;
|...
| Seq Scan on u2 (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.111..458.239 rows=999999 loops=1)
| Buffers: shared hit=2048 read=2377 dirtied=2377 written=2345

|postgres=# SELECT COUNT(1), relname, COUNT(1) FILTER(WHERE isdirty) FROM pg_buffercache b LEFT JOIN pg_class c ON pg_relation_filenode(c.oid)=b.relfilenode GROUP BY 2 ORDER BY 1 DESC LIMIT 11;
| count | relname | count
|-------+---------------------------------+-------
| 13619 | | 0
| 2080 | u2 | 2080
| 104 | pg_attribute | 4
| 71 | pg_statistic | 1
| 51 | pg_class | 1

It says that SELECT caused 2377 buffers to be dirtied, of which 2080 are
associated with the new table in pg_buffercache.

I think it's a known behavior that hint bits do not use the strategy
ring buffer. For BAS_BULKREAD, ring_size = 256kB (32, 8kB pages), but
there's 2080 dirty pages in the buffercache (~16MB).

But the IO view says that 2345 of the pages were "reused", which seems
misleading to me. Maybe that just follows from the behavior and the view is
fine. If the view is fine, maybe this case should still be specifically
mentioned in the docs.

--
Justin

#97

andres@anarazel.de

about 3 years ago

In reply to: Justin Pryzby (#96)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-11-22 23:43:29 -0600, Justin Pryzby wrote:

I think there may be a problem/deficiency with hint bits:

|postgres=# DROP TABLE u2; CREATE TABLE u2 AS SELECT generate_series(1,999999)a; SELECT pg_stat_reset_shared('io'); explain (analyze,buffers) SELECT * FROM u2;
|...
| Seq Scan on u2 (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.111..458.239 rows=999999 loops=1)
| Buffers: shared hit=2048 read=2377 dirtied=2377 written=2345

|postgres=# SELECT COUNT(1), relname, COUNT(1) FILTER(WHERE isdirty) FROM pg_buffercache b LEFT JOIN pg_class c ON pg_relation_filenode(c.oid)=b.relfilenode GROUP BY 2 ORDER BY 1 DESC LIMIT 11;
| count | relname | count
|-------+---------------------------------+-------
| 13619 | | 0
| 2080 | u2 | 2080
| 104 | pg_attribute | 4
| 71 | pg_statistic | 1
| 51 | pg_class | 1

It says that SELECT caused 2377 buffers to be dirtied, of which 2080 are
associated with the new table in pg_buffercache.

Note that there's 2048 dirty buffers for u2 in shared_buffers before the
SELECT, despite the relation being 4425 blocks long, due to the CTAS using
BAS_BULKWRITE.

|postgres=# SELECT * FROM pg_stat_io WHERE backend_type!~'autovac|archiver|logger|standalone|startup|^wal|background worker' or true ORDER BY 2;
| backend_type | io_context | io_object | read | written | extended | op_bytes | evicted | reused | files_synced | stats_reset
|...
| client backend | bulkread | relation | 2377 | 2345 | | 8192 | 0 | 2345 | | 2022-11-22 22:32:33.044552-06

I think it's a known behavior that hint bits do not use the strategy
ring buffer. For BAS_BULKREAD, ring_size = 256kB (32, 8kB pages), but
there's 2080 dirty pages in the buffercache (~16MB).

I don't think there's any "circumvention" of the ringbuffer here. There's 2048
buffers for u2 in s_b before, all dirty, there's 2080 after, also all
dirty. So the ringbuffer restricted the increase in shared buffers used for u2
to 2080-2048=32 additional buffers.

The reason hint bits don't prevent pages from being written out here is that a
BAS_BULKREAD strategy doesn't cause all buffer writes to be rejected, it just
causes buffer writes to be rejected when the page LSN would require a WAL
flush. And that's not typically the case when you just set a hint bit, unless
you use wal_log_hint_bits = true.

If I turn on wal_log_hints=true and add a CHECKPOINT after the CTAS I see 0
reuses (and 4425 dirty buffers), which is what I'd expect.

But the IO view says that 2345 of the pages were "reused", which seems
misleading to me. Maybe that just follows from the behavior and the view is
fine. If the view is fine, maybe this case should still be specifically
mentioned in the docs.

I think that's just confusing due to the reset. 2048 + 2345 = 4393, but we
only have 2080 buffers for u2 in s_b.

Greetings,

Andres Freund

#98

melanieplageman@gmail.com

about 3 years ago

In reply to: Andres Freund (#95)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v38 attached.

On Sun, Nov 20, 2022 at 7:38 PM Andres Freund <andres@anarazel.de> wrote:

One good follow up patch will be to rip out the accounting for
pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps
buffers_alloc and replace it with a subselect getting the equivalent data from
pg_stat_io. It might not be quite worth doing for buffers_alloc because of
the way that's tied into bgwriter pacing.

I don't see how it will make sense to have buffers_backend and
buffers_backend_fsync respond to a different reset target than the rest
of the fields in pg_stat_bgwriter.

On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote:
@@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,

isExtend = (blockNum == P_NEW);
+     if (isLocalBuf)
+     {
+             /*
+              * Though a strategy object may be passed in, no strategy is employed
+              * when using local buffers. This could happen when doing, for example,
+              * CREATE TEMPORRARY TABLE AS ...
+              */
+             io_context = IOCONTEXT_BUFFER_POOL;
+             io_object = IOOBJECT_TEMP_RELATION;
+     }
+     else
+     {
+             io_context = IOContextForStrategy(strategy);
+             io_object = IOOBJECT_RELATION;
+     }
I think given how frequently ReadBuffer_common() is called in some workloads,
it'd be good to make IOContextForStrategy inlinable. But I guess that's not
easily doable, because struct BufferAccessStrategyData is only defined in
freelist.c.

Correct

Could we defer this until later, given that we don't currently need this in
case of buffer hits afaict?

Yes, you are right. In ReadBuffer_common(), we can easily move the
IOContextForStrategy() call to directly before using io_context. I've
done that in the attached version.

@@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
BufferAccessStrategy strategy,
bool *foundPtr)
{
+ bool from_ring;
+ IOContext io_context;
BufferTag newTag; /* identity of requested block */
uint32 newHash; /* hash value for newTag */
LWLock *newPartitionLock; /* buffer partition lock for it */
@@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
LWLockRelease(newPartitionLock);

+ io_context = IOContextForStrategy(strategy);

Hm - doesn't this mean we do IOContextForStrategy() twice? Once in
ReadBuffer_common() and then again here?

Yes. So, there are a few options for addressing this.

- if the goal is to call IOStrategyForContext() exactly once in a
given codepath, BufferAlloc() can set IOContext
(passed by reference as an output parameter). I don't like this much
because it doesn't make sense to me that BufferAlloc() would set the
"io_context" parameter -- especially given that strategy is already
passed as a parameter and is obviously available to the caller.
I also don't see a good way of waiting until BufferAlloc() returns to count
the IO operations counted in FlushBuffer() and BufferAlloc() itself.

- if the goal is to avoid calling IOStrategyForContext() in more common
codepaths or to call it as close to its use as possible, then we can
push down its call in BufferAlloc() to the two locations where it is
used -- when a dirty buffer must be flushed and when a block was
evicted or reused. This will avoid calling it when we are not evicting
a block from a valid buffer.

However, if we do that, I don't know how to avoid calling it twice in
that codepath. Even though we can assume io_context was set in the
first location by the time we get to the second location, we would
need to initialize the variable with something if we only plan to set
it in some branches and there is no "invalid" or "default" value of
the IOContext enum.

Given the above, I've left the call in BufferAlloc() as is in the
attached version.

/* Loop here in case we have to try another victim buffer */
for (;;)
{
+
/*
* Ensure, while the spinlock's not yet held, that there's a free
* refcount entry.
@@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* Select a victim buffer.  The buffer is returned with its header
* spinlock still held!
*/
-             buf = StrategyGetBuffer(strategy, &buf_state);
+             buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);

Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);

I think patch 0001 relies on this change already having been made, If I am not misunderstanding?

Fixed.

@@ -1263,13 +1291,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
}

+                             /*
+                              * When a strategy is in use, only flushes of dirty buffers
+                              * already in the strategy ring are counted as strategy writes
+                              * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+                              * purpose of IO operation statistics tracking.
+                              *
+                              * If a shared buffer initially added to the ring must be
+                              * flushed before being used, this is counted as an
+                              * IOCONTEXT_BUFFER_POOL IOOP_WRITE.
+                              *
+                              * If a shared buffer added to the ring later because the

Missing word?

Fixed.

+                              * current strategy buffer is pinned or in use or because all
+                              * strategy buffers were dirty and rejected (for BAS_BULKREAD
+                              * operations only) requires flushing, this is counted as an
+                              * IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring will be false).

I think this makes sense for now, but it'd be good if somebody else could
chime in on this...

+                              *
+                              * When a strategy is not in use, the write can only be a
+                              * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL
+                              * IOOP_WRITE).
+                              */
+
/* OK, do the I/O */
TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);

-                             FlushBuffer(buf, NULL);
+                             FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
LWLockRelease(BufferDescriptorGetContentLock(buf));
ScheduleBufferTagForWriteback(&BackendWritebackContext,

+     if (oldFlags & BM_VALID)
+     {
+             /*
+             * When a BufferAccessStrategy is in use, evictions adding a
+             * shared buffer to the strategy ring are counted in the
+             * corresponding strategy's context.

Perhaps "adding a shared buffer to the ring are counted in the corresponding
context"? "strategy's context" sounds off to me.

Fixed.

This includes the evictions
+             * done to add buffers to the ring initially as well as those
+             * done to add a new shared buffer to the ring when current
+             * buffer is pinned or otherwise in use.

I think this sentence could use a few commas, but not sure.

s/current/the current/?

Reworded.

+             * We wait until this point to count reuses and evictions in order to
+             * avoid incorrectly counting a buffer as reused or evicted when it was
+             * released because it was concurrently pinned or in use or counting it
+             * as reused when it was rejected or when we errored out.
+             */

I can't quite parse this sentence.

I've reworded the whole comment.
I think it is clearer now.

+             IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT;
+
+             pgstat_count_io_op(io_op, IOOBJECT_RELATION, io_context);
+     }

I'd just inline the variable, but ...

Done.

@@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
LocalRefCount[b]++;
ResourceOwnerRememberBuffer(CurrentResourceOwner,
BufferDescriptorGetBuffer(bufHdr));
+
break;
}
}

Spurious change.

Removed.

pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

*foundPtr = false;
+
return bufHdr;
}

Dito.

Removed.

+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:

s/worth reporting/we currently report/?

Updated

+     /*
+      * In core Postgres, only regular backends and WAL Sender processes
+      * executing queries will use local buffers and operate on temporary
+      * relations. Parallel workers will not use local buffers (see
+      * InitLocalBuffers()); however, extensions leveraging background workers
+      * have no such limitation, so track IO Operations on
+      * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
+      */
+     no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype
+             == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype ==
+             B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+     if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL && io_object ==
+                     IOOBJECT_TEMP_RELATION)
+             return false;

Personally I don't like line breaks on the == and would rather break earlier
on the && or ||.

I've gone through and fixed all of these that I could find.

+     for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+     {
+             PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
+             PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
+
+             for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+             {
Is there any compiler that'd complain if you used IOContext/IOObject/IOOp as the
type in the for loop? I don't think so? Then you'd not need the casts in other
places, which I think would make the code easier to read.

I changed the type and currently get no compiler warnings, however, on
a previous CI run,
with the type changed to an enum I got the following warning:

/tmp/cirrus-ci-build/src/include/utils/pgstat_internal.h:605:48:
error: no ‘operator++(int)’ declared for postfix ‘++’ [-fpermissive]
605 | io_context < IOCONTEXT_NUM_TYPES; io_context++)

I'm not sure why I am no longer getting it.

+                     PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
+                     PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
+
+                     if (!expect_backend_stats ||
+                             !pgstat_bktype_io_context_io_object_valid(MyBackendType,
+                                     (IOContext) io_context, (IOObject) io_object))
+                     {
+                             pgstat_io_context_ops_assert_zero(sharedent);
+                             pgstat_io_context_ops_assert_zero(pendingent);
+                             continue;
+                     }
+
+                     for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+                     {
+                             if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context,
+                                                             (IOObject) io_object, (IOOp) io_op)))

Superfluous parens after the !, I think?

Thanks! I've looked for other occurrences as well and fixed them.

void
pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
}
pgstat_unlock_entry(entry_ref);
+
+     /*
+      * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+      * Operation stats, however this will not be called after an entire
Missing "until"?

Fixed.

+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
Does this need to be in pgstat.h? Perhaps pgstat_internal.h would suffice,
afaict it's not used outside of pgstat code?

It is used in pgstatfuncs.c during the view creation.

+
+/*
+ * Assert that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which is not valid for the passed-in BackendType. The
+ * passed-in array of PgStat_IOOpCounters must contain stats from the
+ * BackendType specified by the second parameter. Caller is responsible for
+ * locking of the passed-in PgStatShared_IOContextOps, if needed.
+ */
+static inline void
+pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
+             BackendType bktype)
+{
This doesn't look like it should be an inline function - it's quite long.

I think it's also too complicated for the compiler to optimize out if
assertions are disabled. So you'd need to handle this with an explicit #ifdef
USE_ASSERT_CHECKING.

I've made it a static helper function in pgstat.c.

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       The context or location of an IO operation.
+      </para>
+       <itemizedlist>
+        <listitem>
+        <para>
+        <varname>io_context</varname> <literal>buffer pool</literal> refers to
+        IO operations on data in both the shared buffer pool and process-local
+        buffer pools used for temporary relation data.
+        </para>
+         <para>

The indentation in the sgml part of the patch seems to be a bit wonky.

I'll address this and the other docs feedback in a separate patchset and email.

+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+     PgStat_BackendIOContextOps *backends_io_stats;
+     ReturnSetInfo *rsinfo;
+     Datum           reset_time;
+
+     InitMaterializedSRF(fcinfo, 0);
+     rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+     backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+     reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+     for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+     {
+             Datum           bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype));
+             bool            expect_backend_stats = true;
+             PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+             /*
+              * For those BackendTypes without IO Operation stats, skip
+              * representing them in the view altogether.
+              */
+             expect_backend_stats = pgstat_io_op_stats_collected((BackendType)
+                                                                                                                     bktype);
+
+             for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+             {
+                     const char *io_context_str = pgstat_io_context_desc(io_context);
+                     PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
+
+                     for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+                     {
+                             PgStat_IOOpCounters *counters = &io_objs->data[io_object];
+                             const char *io_obj_str = pgstat_io_object_desc(io_object);
+
+                             Datum           values[IO_NUM_COLUMNS] = {0};
+                             bool            nulls[IO_NUM_COLUMNS] = {0};
+
+                             /*
+                             * Some combinations of IOContext, IOObject, and BackendType are
+                             * not valid for any type of IOOp. In such cases, omit the
+                             * entire row from the view.
+                             */
+                             if (!expect_backend_stats ||
+                                     !pgstat_bktype_io_context_io_object_valid((BackendType) bktype,
+                                             (IOContext) io_context, (IOObject) io_object))
+                             {
+                                     pgstat_io_context_ops_assert_zero(counters);
+                                     continue;
+                             }

Perhaps mention in a comment two loops up that we don't skip the nested loops
despite !expect_backend_stats because we want to assert here?

Done.

I've also removed the test for bulkread reads from regress because
CREATE DATABASE is expensive and added it to the verify_heapam test
since it is one of the only users of a BULKREAD strategy which
unconditionally uses a BULKREAD strategy.

Thanks,
Melanie

Attachments:

v38-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchapplication/octet-stream; name=v38-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchDownload

From b69dec7f6ddab6447ecdb5b38bdafa2bf073756a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Oct 2022 11:03:05 -0700
Subject: [PATCH v38 1/4] Remove BufferAccessStrategyData->current_was_in_ring

It is a duplication of StrategyGetBuffer->from_ring.
---
 src/backend/storage/buffer/bufmgr.c   |  5 +++--
 src/backend/storage/buffer/freelist.c | 22 ++++++++--------------
 src/include/storage/buf_internals.h   |  4 ++--
 3 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..fa32f24e19 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1121,6 +1121,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1200,7 +1201,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1255,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..5299bb8711 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,12 +81,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +192,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +209,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -625,10 +624,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +640,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +649,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -682,14 +676,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..7b67250747 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,10 +392,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
-- 
2.38.1

v38-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v38-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 98aa4e8109fe48dfb3ad3adde1bbba44e9a16485 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 3 Nov 2022 12:19:48 -0400
Subject: [PATCH v38 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (
evictions, reuses, rejections, repossessions, reads, writes, extends,
and fsyncs) done through each IOContext (shared buffers, local buffers,
and buffers reserved by a BufferAccessStrategy) by each type of backend
(e.g. client backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts. Those
cells will be NULL in the view to distinguish between 0 observed IOOps
of that type and an invalid combination. For example, local buffers are
not fsynced so cells for all BackendTypes for IOCONTEXT_LOCAL and
IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType and
IOContext combination (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "unit" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/t/001_verify_heapam.pl |  21 +
 doc/src/sgml/monitoring.sgml           | 621 ++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql   |  15 +
 src/backend/utils/adt/pgstatfuncs.c    | 143 ++++++
 src/include/catalog/pg_proc.dat        |   9 +
 src/test/regress/expected/rules.out    |  12 +
 src/test/regress/expected/stats.out    | 225 +++++++++
 src/test/regress/sql/stats.sql         | 138 ++++++
 8 files changed, 1168 insertions(+), 16 deletions(-)

diff --git a/contrib/amcheck/t/001_verify_heapam.pl b/contrib/amcheck/t/001_verify_heapam.pl
index 019eed33a0..616ed4ed98 100644
--- a/contrib/amcheck/t/001_verify_heapam.pl
+++ b/contrib/amcheck/t/001_verify_heapam.pl
@@ -49,6 +49,15 @@ detects_heap_corruption(
 # Check a corrupt table with all-frozen data
 #
 fresh_test_table('test');
+
+# verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows us
+# to reliably test that pg_stat_io BULKREAD reads are being captured without
+# relying on the size of shared buffers or on an expensive operation like
+# CREATE DATABASE.
+my $stats_reads_before = $node->safe_psql('postgres',
+	qq(SELECT sum(read) FROM pg_stat_io WHERE io_context = 'bulkread')
+);
+
 $node->safe_psql('postgres', q(VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test));
 detects_no_corruption("verify_heapam('test')",
 	"all-frozen not corrupted table");
@@ -75,6 +84,18 @@ check_all_options_uncorrupted('test_seq', 'plain');
 reset_test_sequence('test_seq');
 check_all_options_uncorrupted('test_seq', 'plain');
 
+$node->safe_psql('postgres', qq(SELECT pg_stat_force_next_flush()));
+is(
+	$node->safe_psql('postgres',
+		qq(
+		SELECT sum(read) > '$stats_reads_before'
+			FROM pg_stat_io WHERE io_context = 'bulkread'
+		)),
+	qq(t),
+	qq(Confirm that bulkread BufferAccessStrategy reads were captured in pg_stat_io)
+);
+
+
 # Returns the filesystem path for the named relation.
 sub relation_filepath
 {
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 89fca710db..a74409bfd3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -658,20 +667,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are primarily useful to
+   determine the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call. However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3604,13 +3613,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3619,7 +3627,588 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend type
+   and IO context containing global data for the cluster on IO operations done
+   by that backend type in that IO context. Currently only a subset of IO
+   operations are tracked here. WAL IO, IO on temporary files, and some forms
+   of IO outside of shared buffers (such as when building indexes or moving a
+   table from one tablespace to another) may be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+       See <link linkend="monitoring-pg-stat-activity-view">
+       <structname>pg_stat_activity</structname></link> for more information on
+       <varname>backend_type</varname>s. Some <varname>backend_type</varname>s
+       do not accumulate IO operation statistics and will not be included in
+       the view.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type>
+      </para>
+      <para>
+       The context or location of an IO operation.
+      </para>
+       <itemizedlist>
+        <listitem>
+        <para>
+        <varname>io_context</varname> <literal>buffer pool</literal> refers to
+        IO operations on data in both the shared buffer pool and process-local
+        buffer pools used for temporary relation data.
+        </para>
+         <para>
+        Operations on temporary relations are tracked in
+        <varname>io_context</varname> <literal>buffer pool</literal> and
+        <varname>io_object</varname> <literal>temp relation</literal>.
+        </para>
+         <para>
+         Operations on permanent relations are tracked in
+         <varname>io_context</varname> <literal>buffer pool</literal> and
+         <varname>io_object</varname> <literal>relation</literal>.
+         </para>
+        </listitem>
+
+        <listitem>
+        <para>
+         <varname>io_context</varname> <literal>vacuum</literal> refers to the IO
+         operations incurred while vacuuming and analyzing.
+         </para>
+        </listitem>
+
+        <listitem>
+        <para>
+         <varname>io_context</varname> <literal>bulkread</literal> refers to IO
+         operations specially designated as <literal>bulk reads</literal>, such
+         as the sequential scan of a large table.
+         </para>
+        </listitem>
+
+        <listitem>
+        <para>
+         <varname>io_context</varname> <literal>bulkwrite</literal> refers to IO
+         operations specially designated as <literal>bulk writes</literal>, such
+         as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+
+       <para>
+       These last three <varname>io_context</varname>s are counted separately
+       because the autovacuum daemon, explicit <command>VACUUM</command>,
+       explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+       writes use a fixed amount of memory, acquiring the equivalent number of
+       shared buffers and reusing them circularly to avoid occupying an undue
+       portion of the main shared buffer pool. This pattern is called a
+       <quote>Buffer Access Strategy</quote> in the
+       <productname>PostgreSQL</productname> source code and the fixed-size
+       ring buffer is referred to as a <quote>strategy ring buffer</quote> for
+       the purposes of this view's documentation. These
+       <varname>io_context</varname>s are referred to as <quote>strategy
+       contexts</quote> and IO operations on strategy contexts are referred to
+       as <quote>strategy operations</quote>.
+      </para>
+      <para>
+      Some <varname>io_context</varname>s are not used by some
+      <varname>backend_type</varname>s and will not be in the view. For
+      example, the checkpointer does not use a Buffer Access Strategy
+      (currently), so there will be no rows for <varname>backend_type</varname>
+      <literal>checkpointer</literal> and any of the strategy
+      <varname>io_context</varname>s.
+      </para>
+      <para>
+      Some IO operations are invalid in combination with certain
+      <varname>io_context</varname>s and <varname>io_object</varname>s. Those
+      cells will be NULL to distinguish between 0 observed IO operations of
+      that type and an invalid combination. For example, temporary tables are
+      not fsynced, so cells for all <varname>backend_type</varname>s for
+      <varname>io_object</varname> <literal>temp relation</literal> in
+      <varname>io_context</varname> <literal>buffer pool</literal> for
+      <varname>files_synced</varname> will be NULL. Some
+      <varname>backend_type</varname>s never perform certain IO operations.
+      Those cells will also be NULL in the view. For example <varname>backend
+      type</varname> <literal>background writer</literal> should not perform
+      reads.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_object</structfield> <type>text</type>
+      </para>
+      <para>
+      Object operated on in a given <varname>io_context</varname> by a given
+      <varname>backend_type</varname>.
+      </para>
+
+      <para> Some <varname>backend_type</varname>s will never do IO operations
+      on some <varname>io_object</varname>s, either at all or in certain
+      <varname>io_context</varname>s. These rows are omitted from the
+      view.</para>
+
+     </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Reads by this <varname>backend_type</varname> into buffers in this
+       <varname>io_context</varname>.
+       <varname>read</varname> plus <varname>extended</varname> for
+       <varname>backend_type</varname>s
+
+       <itemizedlist>
+
+       <listitem>
+        <para>
+       <literal>autovacuum launcher</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>autovacuum worker</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>client backend</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>standalone backend</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>background worker</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>walsender</literal>
+        </para>
+       </listitem>
+
+       </itemizedlist>
+
+       for all
+       <varname>io_context</varname>s is similar to the sum of
+       <itemizedlist>
+
+       <listitem>
+        <para>
+       <varname>heap_blks_read</varname>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <varname>idx_blks_read</varname>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <varname>tidx_blks_read</varname>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <varname>toast_blks_read</varname>
+        </para>
+       </listitem>
+
+       </itemizedlist>
+
+       in <link linkend="monitoring-pg-statio-all-tables-view">
+       <structname>pg_statio_all_tables</structname></link> and
+       <varname>blks_read</varname> from <link
+       linkend="monitoring-pg-stat-database-view">
+       <structname>pg_stat_database</structname></link>
+
+       The difference is that reads done as part of <command>CREATE
+       DATABASE</command> are not counted in
+       <structname>pg_statio_all_tables</structname> and
+       <structname>pg_stat_database</structname>
+       </para>
+
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>,
+       <varname>read</varname> for
+       <varname>backend_type</varname>s
+
+       <itemizedlist>
+       <listitem>
+        <para>
+       <literal>autovacuum launcher</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>autovacuum worker</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>client backend</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>standalone backend</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>background worker</literal>
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+       <literal>walsender</literal>
+        </para>
+       </listitem>
+       </itemizedlist>
+
+       for all
+       <varname>io_context</varname>s is equivalent to
+       <varname>shared_blks_read</varname> plus
+       <varname>local_blks_read</varname> in <varname>pg_stat_statements</varname>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Writes of data in this <varname>io_context</varname> written out by this
+       <varname>backend_type</varname>.
+       </para>
+
+       <para>
+       Normal client backends should be able to rely on auxiliary processes
+       like the checkpointer and background writer to write out dirty data as
+       much as possible. Large numbers of writes by
+       <varname>backend_type</varname> <literal>client backend</literal> in
+       <varname>io_context</varname> <literal>buffer pool</literal> and
+       <varname>io_object</varname> <literal>relation</literal> could indicate
+       a misconfiguration of shared buffers or of checkpointer. More
+       information on checkpointer configuration can be found in <xref
+       linkend="wal-configuration"/>.
+       </para>
+
+       <para>Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>
+       </para>
+
+       <para>Also, the sum of
+       <varname>written</varname> plus <varname>extended</varname> in this view
+       for <varname>backend_type</varname>s
+       <itemizedlist>
+       <listitem>
+       <para>
+       <literal>client backend</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <literal>autovacuum worker</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <literal>background worker</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <literal>walsender</literal>
+       </para>
+       </listitem>
+       </itemizedlist>
+       on
+
+       <itemizedlist>
+       <listitem>
+       <para>
+       <varname>io_object</varname> <literal>relation</literal> in
+       <varname>io_context</varname>s <literal>buffer pool</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <varname>io_object</varname> <literal>relation</literal> in
+       <varname>io_context</varname> <literal>bulkread</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <varname>io_object</varname> <literal>relation</literal> in
+       <varname>io_context</varname> <literal>bulkwrite</literal>
+       </para>
+       </listitem>
+
+       <listitem>
+       <para>
+       <varname>io_object</varname> <literal>relation</literal> in
+       <varname>io_context</varname> <literal>vacuum</literal>
+       </para>
+       </listitem>
+       </itemizedlist>
+
+       is equivalent to <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname>
+       </para>
+
+       <para>If using the <productname>PostgreSQL</productname> extension,
+       <xref linkend="pgstatstatements"/>, <varname>written</varname> plus
+       <varname>extended</varname> for
+       <varname>backend_type</varname>s
+
+       <itemizedlist>
+        <listitem>
+        <para>
+       <literal>autovacuum launcher</literal>
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>autovacuum worker</literal>
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>client backend</literal>
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>standalone backend</literal>
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>background worker</literal> and
+       </para>
+        </listitem>
+
+        <listitem>
+        <para>
+       <literal>walsender</literal>
+       </para>
+        </listitem>
+
+       </itemizedlist>
+
+       for all <varname>io_context</varname>s is equivalent to
+       <varname>shared_blks_written</varname> plus
+       <varname>local_blks_written</varname> in
+       <varname>pg_stat_statements</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Extends of relations done by this <varname>backend_type</varname> in
+       order to write data in this <varname>io_context</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>op_bytes</structfield> <type>bigint</type>
+      </para>
+      <para>
+      The number of bytes per unit of IO read, written, or extended. For
+      block-oriented IO of relation data, reads, writes, and extends are done
+      in <varname>block_size</varname> units, derived from the build-time
+      parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+      default. Future values could include those derived from
+      <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
+      constant multipliers once non-block-oriented IO (e.g. temporary file IO)
+      is tracked here.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>evicted</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times a <varname>backend_type</varname> has evicted a block
+       from a shared or local buffer in order to reuse the buffer in this
+       <varname>io_context</varname>. Blocks are only evicted when there are no
+       unoccupied buffers.
+       </para>
+
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>relation</literal> counts the number of times a block from a
+       shared buffer was evicted so that it can be replaced with another block,
+       also in shared buffers.
+
+       A high <varname>evicted</varname> count in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>relation</literal> could indicate that shared buffers is too
+       small and should be set to a larger value.
+       </para>
 
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times occupied shared
+       buffers were added to the fixed-size strategy ring buffer, causing the
+       buffer contents to be evicted. If the current buffer in the ring is
+       pinned or in use by another backend, it may be replaced by a new shared
+       buffer. If this shared buffer contains valid data, that block must be
+       evicted and will count as <varname>evicted</varname>.
+
+       Seeing a large number of <varname>evicted</varname> in strategy
+       <varname>io_context</varname>s can provide insight into primary working
+       set cache misses.
+       </para>
+
+       <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>temp relation</literal> counts the number of times a block of
+       data from an existing local buffer was evicted in order to replace it
+       with another block, also in local buffers.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type>
+      </para>
+      <para>
+       The number of times an existing buffer in the strategy ring was reused
+       as part of an operation in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+       <varname>io_context</varname>s. When a <quote>Buffer Access
+       Strategy</quote> reuses a buffer in the strategy ring, it evicts the
+       buffer contents, incrementing <varname>reused</varname>. When a
+       <quote>Buffer Access Strategy</quote> adds a new shared buffer to the
+       strategy ring and this shared buffer is occupied, the <quote>Buffer
+       Access Strategy</quote> must evict the contents of the shared buffer,
+       incrementing <varname>evicted</varname>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type>
+      </para>
+       <para>
+       Number of files <literal>fsync</literal>ed by this
+       <varname>backend_type</varname> for the purpose of persisting data
+       dirtied in this <varname>io_context</varname>. <literal>fsync</literal>s
+       are done at segment boundaries so <varname>op_bytes</varname>
+       does not apply to the <varname>files_synced</varname> column.
+       <literal>fsync</literal>s done by backends in order to persist data
+       written in <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as <varname>io_context</varname> <literal>buffer pool</literal>
+       <varname>io_object</varname> <literal>relation</literal>
+       <varname>files_synced</varname>.
+       </para>
+
+       <para>
+       Normal client backends should be able to rely on the checkpointer to
+       ensure data is persisted to permanent storage. Large numbers of
+       <varname>files_synced</varname> by <varname>backend_type</varname>
+       <literal>client backend</literal> could indicate a misconfiguration of
+       shared buffers or of checkpointer. More information on checkpointer
+       configuration can be found in <xref linkend="wal-configuration"/>.
+       </para>
+
+       <para>
+       Note that the sum of <varname>files_synced</varname> for all
+       <varname>io_context</varname> <literal>buffer pool</literal>
+       <varname>io_object</varname> <literal>relation</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..296b3acf6e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.read,
+       b.written,
+       b.extended,
+       b.op_bytes,
+       b.evicted,
+       b.reused,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a135cad0ce..db05b3bf6c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1731,6 +1731,149 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+
+	pg_unreachable();
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip representing
+		 * them in the view altogether. We still loop through their counters so
+		 * that we can assert that all values are zero.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected(bktype);
+
+		for (IOContext io_context = IOCONTEXT_BULKREAD;
+				io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+			PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
+
+			for (IOObject io_object = IOOBJECT_RELATION;
+					io_object < IOOBJECT_NUM_TYPES; io_object++)
+			{
+				PgStat_IOOpCounters *counters = &io_objs->data[io_object];
+				const char *io_obj_str = pgstat_io_object_desc(io_object);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				* Some combinations of IOContext, IOObject, and BackendType are
+				* not valid for any type of IOOp. In such cases, omit the
+				* entire row from the view.
+				*/
+				if (!expect_backend_stats ||
+					!pgstat_bktype_io_context_io_object_valid(bktype,
+						io_context, io_object))
+				{
+					pgstat_io_context_ops_assert_zero(counters);
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(io_obj_str);
+				values[IO_COL_READS] = Int64GetDatum(counters->reads);
+				values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+				values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+				/*
+				* Hard-code this to blocks until we have non-block-oriented IO
+				* represented in the view as well
+				*/
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				values[IO_COL_EVICTIONS] = Int64GetDatum(counters->evictions);
+				values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+				values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				* Some combinations of BackendType and IOOp, of IOContext and
+				* IOOp, and of IOObject and IOOp are not valid. Set these cells
+				* in the view NULL and assert that these stats are zero as
+				* expected.
+				*/
+				for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					if (!pgstat_io_op_valid(bktype, io_context, io_object,
+								io_op))
+					{
+						pgstat_io_op_assert_zero(counters, io_op);
+						nulls[pgstat_io_op_get_index(io_op)] = true;
+					}
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..1416fa27d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5679,6 +5679,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,read,written,extended,op_bytes,evicted,reused,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 37c1c86473..5960d289a0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.read,
+    b.written,
+    b.extended,
+    b.op_bytes,
+    b.evicted,
+    b.reused,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, read, written, extended, op_bytes, evicted, reused, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..eed0017518 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,229 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_BUFFER_POOL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_BUFFER_POOL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'buffer pool', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..7e0437d928 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,142 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_BUFFER_POOL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_BUFFER_POOL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'buffer pool', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.38.1

v38-0002-Track-IO-operation-statistics-locally.patchapplication/octet-stream; name=v38-0002-Track-IO-operation-statistics-locally.patchDownload

From d73145de3a99f9191a20bb67a2b3565d56145517 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 25 Nov 2022 15:23:18 -0500
Subject: [PATCH v38 2/4] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be counted as an IOOp "written" on an
IOObject IOOBJECT_RELATION in IOContext IOCONTEXT_BUFFER_POOL by
BackendType "checkpointer".

Each IOOp (evict, reuse, read, write, extend, and fsync) is counted per
IOObject (relation, temp relation) per IOContext (bulkread, bulkwrite,
buffer pool, or vacuum) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOContext IOCONTEXT_BUFFER_POOL concerns operations on local and shared
buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_EVICT IOOps are counted in IOCONTEXT_BUFFER_POOL when a buffer is
acquired or allocated through [Local]BufferAlloc() and no
BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, shared buffers added to the
strategy ring are counted as IOOP_EVICT IOOps in the
IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM) IOContext. When one of these
buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 +
 src/backend/storage/buffer/bufmgr.c        |  92 ++++++-
 src/backend/storage/buffer/freelist.c      |  36 ++-
 src/backend/storage/buffer/localbuf.c      |   4 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 272 +++++++++++++++++++++
 src/include/pgstat.h                       |  80 ++++++
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   6 +
 11 files changed, 503 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..04a8f89637 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_BUFFER_POOL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_BUFFER_POOL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_BUFFER_POOL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_BUFFER_POOL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_BUFFER_POOL);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fa32f24e19..02b76e7a83 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,8 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+		IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -986,10 +989,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		/*
+		* Though a strategy object may be passed in, no strategy is employed
+		* when using local buffers. This could happen when doing, for example,
+		* CREATE TEMPORARY TABLE AS ...
+		*/
+		io_context = IOCONTEXT_BUFFER_POOL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1015,6 +1035,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			instr_time	io_start,
 						io_time;
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 				INSTR_TIME_SET_CURRENT(io_start);
 
@@ -1122,6 +1144,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			bool *foundPtr)
 {
 	bool		from_ring;
+	IOContext	io_context;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1188,9 +1211,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1264,13 +1290,35 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_BUFFER_POOL IOOP_WRITE.
+				 *
+				 * If a shared buffer which was added to the ring later because
+				 * the current strategy buffer is pinned or in use or because
+				 * all strategy buffers were dirty and rejected (for
+				 * BAS_BULKREAD operations only) requires flushing, this is
+				 * counted as an IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring
+				 * will be false).
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL
+				 * IOOP_WRITE).
+				 */
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1442,6 +1490,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		* When a BufferAccessStrategy is in use, blocks evicted from shared
+		* buffers are counted as IOOP_EVICT IO Operations in the corresponding
+		* context (e.g. IOCONTEXT_BULKWRITE).
+		* Shared buffers are evicted by a strategy in two cases:
+		* - while initially claiming buffers for the strategy ring
+		* - to replace an existing strategy ring buffer because it is pinned or
+		*   in use and cannot be reused
+		* Blocks evicted from buffers already in the strategy ring are counted
+		* as IOOP_REUSE IO Operations in the corresponding strategy context.
+		*
+		* At this point, we can accurately count evictions and reuses, because
+		* we have successfully claimed the valid buffer. Previously, we may
+		* have been forced to release the buffer due to concurrent pinners or
+		* erroring out.
+		*/
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+				IOOBJECT_RELATION, io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2571,7 +2641,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2821,7 +2891,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2901,6 +2971,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3552,6 +3624,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3587,7 +3661,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3685,7 +3759,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3895,7 +3969,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3922,7 +3996,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5299bb8711..263454cae9 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -601,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -664,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_BUFFER_POOL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_BUFFER_POOL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..5cfb531bb2 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
@@ -256,6 +259,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..a1bb1cef54 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_BUFFER_POOL);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..015c65ea3a
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,272 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_object, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context].data[io_object];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_BUFFER_POOL:
+			return "buffer pool";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+
+	pg_unreachable();
+}
+
+const char *
+pgstat_io_object_desc(IOObject io_object)
+{
+	switch(io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+
+	pg_unreachable();
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+
+	pg_unreachable();
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not do IO operations worth reporting statistics on:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+		bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
+}
+
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Some
+ * IOObjects are never operated on in some IOContexts. Check that the given
+ * BackendType is expected to do IO in the given IOContext and that the given
+ * IOObject is expected to be operated on in the given IOContext..
+ */
+bool
+pgstat_bktype_io_context_io_object_valid(BackendType bktype,
+		IOContext io_context, IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Currently, IO operations on temporary relations can only occur in the
+	 * IOCONTEXT_BUFFER_POOL IOContext.
+	 */
+	if (io_context != IOCONTEXT_BUFFER_POOL &&
+			io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO Operations on
+	 * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL &&
+			io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+			(io_context == IOCONTEXT_BULKREAD ||
+			 io_context == IOCONTEXT_BULKWRITE ||
+			 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+			io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+				bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	 /*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_BUFFER_POOL IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_BUFFER_POOL &&
+			io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOObject io_object, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_io_object_valid(bktype, io_context, io_object))
+		return false;
+
+	if (!pgstat_io_op_valid(bktype, io_context, io_object, io_op))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..e2beafb9b2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,63 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts
+ * When adding a new value, ensure that the proper assertions are added to
+ * pgstat_io_context_ops_assert_zero() and pgstat_io_op_assert_zero() (though
+ * the compiler will remind you about the latter)
+ */
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_BUFFER_POOL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter reuses;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStat_IOObjectOps;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +511,28 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char * pgstat_io_object_desc(IOObject io_object);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_io_object_valid(BackendType bktype,
+		IOContext io_context, IOObject io_object);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context,
+		IOObject io_object, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype,
+		IOContext io_context, IOObject io_object, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e1bd22441b..206f4c0b3e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f5802195d..187828eb90 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2029,9 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOObjectOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.38.1

v38-0003-Aggregate-IO-operation-stats-per-BackendType.patchapplication/octet-stream; name=v38-0003-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 50861d120966988bb121fb1f74a8b25b4a994b21 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 3 Nov 2022 12:17:40 -0400
Subject: [PATCH v38 3/4] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  78 ++++++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 169 +++++++++++++++++-
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  51 ++++++
 src/include/utils/pgstat_internal.h           |  43 +++++
 src/tools/pgindent/typedefs.list              |   4 +
 13 files changed, 383 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5579b8b9e0..89fca710db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5394,6 +5394,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1ebe3bbf29..12542ec58e 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[bktype]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1415,6 +1436,49 @@ pgstat_write_statsfile(void)
 	}
 }
 
+/*
+ * Assert that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which is not valid for the passed-in BackendType. The
+ * passed-in array of PgStat_IOOpCounters must contain stats from the
+ * BackendType specified by the second parameter. Caller is responsible for
+ * locking of the passed-in PgStatShared_IOContextOps, if needed.
+ */
+static void
+pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
+		BackendType bktype)
+{
+	bool		expect_backend_stats = true;
+
+	if (!pgstat_io_op_stats_collected(bktype))
+		expect_backend_stats = false;
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+			io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *context = &backend_io_context_ops->data[io_context];
+
+		for (IOObject io_object = IOOBJECT_RELATION;
+				io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *object = &context->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(bktype, io_context,
+					io_object))
+			{
+				pgstat_io_context_ops_assert_zero(object);
+				continue;
+			}
+
+			for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_io_op_valid(bktype, io_context, io_object, io_op))
+					pgstat_io_op_assert_zero(object, io_op);
+			}
+		}
+	}
+}
+
 /* helpers for pgstat_read_statsfile() */
 static bool
 read_chunk(FILE *fpin, void *ptr, size_t len)
@@ -1495,6 +1559,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(&shmem->io_ops.stats[bktype],
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 015c65ea3a..555172bd13 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,42 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			target->evictions += source->evictions;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
@@ -54,6 +90,87 @@ pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+			io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
+		PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
+
+		for (IOObject io_object = IOOBJECT_RELATION;
+				io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
+			PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(MyBackendType,
+					io_context, io_object))
+			{
+				pgstat_io_context_ops_assert_zero(sharedent);
+				pgstat_io_context_ops_assert_zero(pendingent);
+				continue;
+			}
+
+			for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_io_op_valid(MyBackendType, io_context, io_object,
+							io_op))
+				{
+					pgstat_io_op_assert_zero(sharedent, io_op);
+					pgstat_io_op_assert_zero(pendingent, io_op);
+					continue;
+				}
+
+				pgstat_accum_io_op(sharedent, pendingent, io_op);
+			}
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -116,11 +233,61 @@ pgstat_io_op_desc(IOOp io_op)
 	pg_unreachable();
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
 * The following BackendTypes do not participate in the cumulative stats
-* subsystem or do not do IO operations worth reporting statistics on:
+* subsystem or do not perform IO operations on which we currently report:
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f92e16e7af..1c84e1a5f0 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called until after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 706692862c..4251079ae1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ae3365d917..a135cad0ce 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2090,6 +2090,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2098,7 +2100,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..d1a43a662e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER + 1
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e2beafb9b2..b6db8fa9a0 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -334,6 +335,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -516,6 +523,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char * pgstat_io_object_desc(IOObject io_object);
 extern const char *pgstat_io_op_desc(IOOp io_op);
@@ -532,6 +540,49 @@ extern bool pgstat_expect_io_op(BackendType bktype,
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->evictions == 0);
+	Assert(counters->extends == 0);
+	Assert(counters->fsyncs == 0);
+	Assert(counters->reads == 0);
+	Assert(counters->reuses == 0);
+	Assert(counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			Assert(counters->evictions == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
 
 /*
  * Functions in pgstat_database.c
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index e2c7b59324..96962a2405 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,31 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+
+typedef struct PgStatShared_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStatShared_IOObjectOps;
+
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStatShared_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +444,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +468,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +577,16 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -641,6 +679,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 187828eb90..b16e2cc8da 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2006,12 +2006,15 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
+PgStatShared_IOObjectOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2019,6 +2022,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.38.1

#99

[1]: /messages/by-id/CAAKRu_Zvaj_yFA_eiSRrLZsjhT0J8cJ044QhZfKuXq6WN5bu5g@mail.gmail.com

melanieplageman@gmail.com

about 3 years ago

In reply to: Justin Pryzby (#96)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Nov 23, 2022 at 12:43 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Note that 001 fails to compile without 002:

../src/backend/storage/buffer/bufmgr.c:1257:43: error: ‘from_ring’ undeclared (first use in this function)
1257 | StrategyRejectBuffer(strategy, buf, from_ring))

Thanks!
I fixed this in version 38 attached in response to Andres upthread [1]/messages/by-id/CAAKRu_Zvaj_yFA_eiSRrLZsjhT0J8cJ044QhZfKuXq6WN5bu5g@mail.gmail.com.

My "warnings" script informed me about these gripes from MSVC:

[03:42:30.607] c:\cirrus>call sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0'
[03:42:30.749] c:\cirrus\src\backend\storage\buffer\freelist.c(699) : warning C4715: 'IOContextForStrategy': not all control paths return a value
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(190) : warning C4715: 'pgstat_io_context_desc': not all control paths return a value
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(204) : warning C4715: 'pgstat_io_object_desc': not all control paths return a value
[03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(226) : warning C4715: 'pgstat_io_op_desc': not all control paths return a value
[03:42:30.749] c:\cirrus\src\backend\utils\adt\pgstatfuncs.c(1816) : warning C4715: 'pgstat_io_op_get_index': not all control paths return a value

Thanks, I forgot to look at those warnings in CI.
I added pg_unreachable() and think it silenced the warnings.

In the docs table, you say things like:
| io_context vacuum refers to the IO operations incurred while vacuuming and analyzing.

..but it's a bit unclear (maybe due to the way the docs are rendered).
I think it may be more clear to say "when <io_context> is
<vacuum>, ..."

So, because I use this language [column name] [column value] so often in
the docs, I would prefer a pattern that is as concise as possible. I
agree it may be hard to see due to the rendering. Currently, I am using
<varname> tags for the column name and <literal> tags for the column
value. Is there another tag type I could use to perhaps make this more
clear without adding additional words?

This is what the code looks like for the above docs text:
<varname>io_context</varname> <literal>vacuum</literal> refers to the IO

| acquiring the equivalent number of shared buffers

I don't think "equivelent" fits here, since it's actually acquiring a
different number of buffers.

I'm planning to do docs changes in a separate patchset after addressing
code feedback. I plan to change "equivalent" to "corresponding" here.

There's a missing period before " The difference is"

The sentence beginning "read plus extended for backend_types" is difficult to
parse due to having a bulleted list in its middle.

Will address in future version.

There aren't many references to "IOOps", which is good, because I
started to read it as "I oops".

Grep'ing for this in the code, I only use the word IOOp(s) in the code
when I very clearly want to use the type name -- and never in the docs.
But, yes, it does look like "I oops" :)

+        * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+        * Operation stats, however this will not be called after an entire

=> I think that's intended to say *until* after ?

Fixed in v38.

+ * Functions to assert that invalid IO Operation counters are zero.

=> There's a missing newline above this comment.

Fixed in v38.

+       Assert(counters->evictions == 0 && counters->extends == 0 &&
+                       counters->fsyncs == 0 && counters->reads == 0 && counters->reuses
+                       == 0 && counters->writes == 0);

=> It'd be more readable and also maybe help debugging if these were separate
assertions.

I have made this change.

+pgstat_io_op_stats_collected(BackendType bktype)
+{
+       return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+               bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
Similar: I'd prefer to see this as 5 "ifs" or a "switch" to return
false, else return true. But YMMV.

I don't know that separating it into multiple if statements or a switch
would make it more clear to me or help me with debugging here.

Separately, since this is used in non-assert builds, I would like to
ensure it is efficient. Do you know if a switch or if statements will
be compiled to the exact same thing as this at useful optimization
levels?

+ * CREATE TEMPORRARY TABLE AS ...

=> typo: temporary

Fixed in v38.

+ if (strategy_io_context && io_op == IOOP_FSYNC)

=> Extra space.

Fixed.

pgstat_count_io_op() has a superflous newline before "}".

I couldn't find the one you are referencing.
Do you mind pasting in the code?

Thanks,
Melanie

#100

[1]: /messages/by-id/20221121003815.qnwlnz2lhkow2e5w@awork3.anarazel.de
[2]: /messages/by-id/20221123054329.GG11463@telsasoft.com

melanieplageman@gmail.com

about 3 years ago

In reply to: Maciek Sakrejda (#94)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Thanks for the review, Maciek!

I've attached a new version 39 of the patch which addresses your docs
feedback from this email as well as docs feedback from Andres in [1]/messages/by-id/20221121003815.qnwlnz2lhkow2e5w@awork3.anarazel.de and
Justin in [2]/messages/by-id/20221123054329.GG11463@telsasoft.com.

I've made some additional code changes addressing a few of their other
points as well, and I've moved the verify_heapam test to a plain sql
test in contrib/amcheck instead of putting it in the perl test.

This patchset also includes various cleanup, pgindenting, and addressing
the sgml indentation issue brought up in the thread.

On Mon, Nov 7, 2022 at 1:26 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

On Thu, Nov 3, 2022 at 10:00 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I'm reviewing the rendered docs now, and I noticed sentences like this
are a bit hard to scan: they force the reader to parse a big list of
backend types before even getting to the meat of what this is talking
about. Should we maybe reword this so that the backend list comes at
the end of the sentence? Or maybe even use a list (e.g., like in the
"state" column description in pg_stat_activity)?

Good idea with the bullet points.
For the lengthy lists, I've added bullet point lists to the docs for
several of the columns. It is quite long now but, hopefully, clearer?
Let me know if you think it improves the readability.

Hmm, I should have tried this before suggesting it. I think the lists
break up the flow of the column description too much. What do you
think about the attached (on top of your patches--attaching it as a
.diff to hopefully not confuse cfbot)? I kept the lists for backend
types but inlined the others as a middle ground. I also added a few
omitted periods and reworded "read plus extended" to avoid starting
the sentence with a (lowercase) varname (I think in general it's fine
to do that, but the more complicated sentence structure here makes it
easier to follow if the sentence starts with a capital).

Alternately, what do you think about pulling equivalencies to existing
views out of the main column descriptions, and adding them after the
main table as a sort of footnote? Most view docs don't have anything
like that, but pg_stat_replication does and it might be a good pattern
to follow.

Thoughts?

Thanks for including a patch!
In the attached v39, I've taken your suggestion of flattening some of
the lists and done some rewording as well. I have also moved the note
about equivalence with pg_stat_statements columns to the
pg_stat_statements documentation. The result is quite a bit different
than what I had before, so I would be interested to hear your thoughts.

My concern with the blue "note" section like you mentioned is that it
would be harder to read the lists of backend types than it was in the
tabular format.

+ <varname>io_context</varname>s. When a <quote>Buffer Access
+ Strategy</quote> reuses a buffer in the strategy ring, it must evict its
+ contents, incrementing <varname>reused</varname>. When a <quote>Buffer
+ Access Strategy</quote> adds a new shared buffer to the strategy ring
+ and this shared buffer is occupied, the <quote>Buffer Access
+ Strategy</quote> must evict the contents of the shared buffer,
+ incrementing <varname>evicted</varname>.
I think the parallel phrasing here makes this a little hard to follow.
Specifically, I think "must evict its contents" for the strategy case
sounds like a bad thing, but in fact this is a totally normal thing
that happens as part of strategy access, no? The idea is you probably
won't need that buffer again, so it's fine to evict it. I'm not sure
how to reword, but I think the current phrasing is misleading.
I had trouble rephrasing this. I changed a few words. I see what you
mean. It is worth noting that reusing strategy buffers when there are
buffers on the freelist may not be the best behavior, so I wouldn't
necessarily consider "reused" a good thing. However, I'm not sure how
much the user could really do about this. I would at least like this
phrasing to be clear (evicted is for shared buffers, reused is for
strategy buffers), so, perhaps this section requires more work.
Oh, I see. I think the updated wording works better. Although I think
we can drop the quotes around "Buffer Access Strategy" here. They're
useful when defining the term originally, but after that I think it's
clearer to use the term unquoted.

Thanks! I've fixed this.

Just to understand this better myself, though: can you clarify when
"reused" is not a normal, expected part of the strategy execution? I
was under the impression that a ring buffer is used because each page
is needed only "once" (i.e., for one set of operations) for the
command using the strategy ring buffer. Naively, in that situation, it
seems better to reuse a no-longer-needed buffer than to claim another
buffer from the freelist (where other commands may eventually make
better use of it).

You are right: reused is a normal, expected part of strategy
execution. And you are correct: the idea behind reusing existing
strategy buffers instead of taking buffers off the freelist is to leave
those buffers for blocks that we might expect to be accessed more than
once.

In practice, however, if you happen to not be using many shared buffers,
and then do a large COPY, for example, you will end up doing a bunch of
writes (in order to reuse the strategy buffers) that you perhaps didn't
need to do at that time had you leveraged the freelist. I think the
decision about which tradeoff to make is quite contentious, though.

Some more notes on the docs patch:

+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>io_context</structfield> <type>text</type>
+ </para>
+ <para>
+ The context or location of an IO operation.
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <varname>io_context</varname> <literal>buffer pool</literal> refers to
+ IO operations on data in both the shared buffer pool and process-local
+ buffer pools used for temporary relation data.
+ </para>
+ <para>
+ Operations on temporary relations are tracked in
+ <varname>io_context</varname> <literal>buffer pool</literal> and
+ <varname>io_object</varname> <literal>temp relation</literal>.
+ </para>
+ <para>
+ Operations on permanent relations are tracked in
+ <varname>io_context</varname> <literal>buffer pool</literal> and
+ <varname>io_object</varname> <literal>relation</literal>.
+ </para>
+ </listitem>

I changed this.

+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>io_object</structfield> <type>text</type>
+ </para>
+ <para>
+ Object operated on in a given <varname>io_context</varname> by a given
+ <varname>backend_type</varname>.
+ </para>

Is this a fixed set of objects we should list, like for io_context?

I've added this.

- Melanie

Attachments:

v39-0003-Aggregate-IO-operation-stats-per-BackendType.patchapplication/octet-stream; name=v39-0003-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 21b4564e3c6d0a26382e65f12d73d3d1f4dd13c9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 29 Nov 2022 18:42:33 -0500
Subject: [PATCH v39 3/4] Aggregate IO operation stats per BackendType

Stats on IOOps on all IOObjects in all IOContexts for a backend are
tracked locally. Add functionality for backends to flush these stats to
shared memory and accumulate them with those from all other backends,
exited and live. Also add reset and snapshot functions used by
cumulative stats system for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  75 ++++++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 167 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  51 ++++++
 src/include/utils/pgstat_internal.h           |  43 +++++
 src/tools/pgindent/typedefs.list              |   4 +
 13 files changed, 379 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5579b8b9e0..89fca710db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5394,6 +5394,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1ebe3bbf29..93c9a22061 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[bktype]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1415,6 +1436,46 @@ pgstat_write_statsfile(void)
 	}
 }
 
+/*
+ * Assert that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which is not valid for the passed-in BackendType. The
+ * passed-in array of PgStat_IOOpCounters must contain stats from the
+ * BackendType specified by the second parameter. Caller is responsible for
+ * locking of the passed-in PgStatShared_IOContextOps, if needed.
+ */
+static void
+pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
+										   BackendType bktype)
+{
+	bool		expect_backend_stats = pgstat_io_op_stats_collected(bktype);
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *context = &backend_io_context_ops->data[io_context];
+
+		for (IOObject io_object = IOOBJECT_RELATION;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *object = &context->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(bktype, io_context,
+														  io_object))
+			{
+				pgstat_io_context_ops_assert_zero(object);
+				continue;
+			}
+
+			for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_io_op_valid(bktype, io_context, io_object, io_op))
+					pgstat_io_op_assert_zero(object, io_op);
+			}
+		}
+	}
+}
+
 /* helpers for pgstat_read_statsfile() */
 static bool
 read_chunk(FILE *fpin, void *ptr, size_t len)
@@ -1495,6 +1556,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(&shmem->io_ops.stats[bktype],
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 86fc38d28b..1b1c16d9a3 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,42 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			target->evictions += source->evictions;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
@@ -54,6 +90,87 @@ pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
 			break;
 	}
 
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
+		PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
+
+		for (IOObject io_object = IOOBJECT_RELATION;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
+			PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(MyBackendType,
+														  io_context, io_object))
+			{
+				pgstat_io_context_ops_assert_zero(sharedent);
+				pgstat_io_context_ops_assert_zero(pendingent);
+				continue;
+			}
+
+			for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_io_op_valid(MyBackendType, io_context, io_object,
+										io_op))
+				{
+					pgstat_io_op_assert_zero(sharedent, io_op);
+					pgstat_io_op_assert_zero(pendingent, io_op);
+					continue;
+				}
+
+				pgstat_accum_io_op(sharedent, pendingent, io_op);
+			}
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -116,6 +233,56 @@ pgstat_io_op_desc(IOOp io_op)
 	pg_unreachable();
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f92e16e7af..1c84e1a5f0 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called until after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 706692862c..4251079ae1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ae3365d917..a135cad0ce 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2090,6 +2090,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2098,7 +2100,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..aa14338221 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8cbec3f59e..7521866519 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -49,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -334,6 +335,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -516,6 +523,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_object_desc(IOObject io_object);
 extern const char *pgstat_io_op_desc(IOOp io_op);
@@ -532,6 +540,49 @@ extern bool pgstat_expect_io_op(BackendType bktype,
 /* IO stats translation function in freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->evictions == 0);
+	Assert(counters->extends == 0);
+	Assert(counters->fsyncs == 0);
+	Assert(counters->reads == 0);
+	Assert(counters->reuses == 0);
+	Assert(counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			Assert(counters->evictions == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
 
 /*
  * Functions in pgstat_database.c
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index e2c7b59324..96962a2405 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,31 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+
+typedef struct PgStatShared_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStatShared_IOObjectOps;
+
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStatShared_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +444,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +468,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +577,16 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -641,6 +679,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 187828eb90..b16e2cc8da 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2006,12 +2006,15 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
+PgStatShared_IOObjectOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2019,6 +2022,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.38.1

v39-0002-Track-IO-operation-statistics-locally.patchapplication/octet-stream; name=v39-0002-Track-IO-operation-statistics-locally.patchDownload

From 6e05fe70e53e893b34464e9f5a949ccdc439ca22 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 29 Nov 2022 18:42:23 -0500
Subject: [PATCH v39 2/4] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be counted as an IOOp "written" on an
IOObject IOOBJECT_RELATION in IOContext IOCONTEXT_BUFFER_POOL by
BackendType "checkpointer".

Each IOOp (evict, reuse, read, write, extend, and fsync) is counted per
IOObject (relation, temp relation) per IOContext (bulkread, bulkwrite,
buffer pool, or vacuum) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOContext IOCONTEXT_BUFFER_POOL concerns operations on local and shared
buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_EVICT IOOps are counted in IOCONTEXT_BUFFER_POOL when a buffer is
acquired or allocated through [Local]BufferAlloc() and no
BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, shared buffers added to the
strategy ring are counted as IOOP_EVICT IOOps in the
IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM] IOContext. When one of these
buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

IO Operations on buffers containing temporary table data are counted as
operations on IOOBJECT_TEMP_RELATION IOObjects.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 +
 src/backend/storage/buffer/bufmgr.c        |  94 ++++++-
 src/backend/storage/buffer/freelist.c      |  37 ++-
 src/backend/storage/buffer/localbuf.c      |   4 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 281 +++++++++++++++++++++
 src/include/pgstat.h                       |  80 ++++++
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   6 +
 11 files changed, 515 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..79e84b5a7a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_BUFFER_POOL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at
+		 * this point, so count the fsync as being in the
+		 * IOCONTEXT_BUFFER_POOL IOContext. This is probably okay, because the
+		 * number of backend fsyncs doesn't say anything about the efficacy of
+		 * the BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_BUFFER_POOL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM]
+		 * under IOCONTEXT_BUFFER_POOL is likely clearer when investigating
+		 * the number of backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_BUFFER_POOL);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fa32f24e19..d824764850 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,8 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -986,10 +989,28 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+
+		/*
+		 * Though a strategy object may be passed in, no strategy is employed
+		 * when using local buffers. This could happen when doing, for
+		 * example, CREATE TEMPORARY TABLE AS ...
+		 */
+		io_context = IOCONTEXT_BUFFER_POOL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1015,6 +1036,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			instr_time	io_start,
 						io_time;
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 				INSTR_TIME_SET_CURRENT(io_start);
 
@@ -1122,6 +1145,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			bool *foundPtr)
 {
 	bool		from_ring;
+	IOContext	io_context;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1188,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1264,13 +1291,35 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_BUFFER_POOL IOOP_WRITE.
+				 *
+				 * If a shared buffer which was added to the ring later
+				 * because the current strategy buffer is pinned or in use or
+				 * because all strategy buffers were dirty and rejected (for
+				 * BAS_BULKREAD operations only) requires flushing, this is
+				 * counted as an IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring
+				 * will be false).
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer
+				 * (IOCONTEXT_BUFFER_POOL IOOP_WRITE).
+				 */
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1442,6 +1491,29 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT IO Operations in the corresponding
+		 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer because
+		 * it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE IO Operations in the corresponding strategy
+		 * context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2571,7 +2643,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2821,7 +2893,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2901,6 +2973,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3552,6 +3626,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3587,7 +3663,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3685,7 +3761,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3895,7 +3971,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3922,7 +3998,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_BUFFER_POOL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5299bb8711..c40e6662dc 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -601,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -664,6 +665,40 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_BUFFER_POOL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_BUFFER_POOL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..5cfb531bb2 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
@@ -256,6 +259,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_BUFFER_POOL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..a1bb1cef54 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_BUFFER_POOL);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..86fc38d28b
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,281 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_object, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context].data[io_object];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_BUFFER_POOL:
+			return "buffer pool";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+
+	pg_unreachable();
+}
+
+const char *
+pgstat_io_object_desc(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+
+	pg_unreachable();
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+
+	pg_unreachable();
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO operations on which we currently report:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+		default:
+			return true;
+	}
+}
+
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Some
+ * IOObjects are never operated on in some IOContexts. Check that the given
+ * BackendType is expected to do IO in the given IOContext and that the given
+ * IOObject is expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_io_object_valid(BackendType bktype,
+										 IOContext io_context, IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Currently, IO operations on temporary relations can only occur in the
+	 * IOCONTEXT_BUFFER_POOL IOContext.
+	 */
+	if (io_context != IOCONTEXT_BUFFER_POOL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO Operations on
+	 * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_BUFFER_POOL IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_BUFFER_POOL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOObject io_object, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_io_object_valid(bktype, io_context, io_object))
+		return false;
+
+	if (!pgstat_io_op_valid(bktype, io_context, io_object, io_op))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..8cbec3f59e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/buf.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -276,6 +277,63 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts.
+ * When adding a new value, ensure that the proper assertions are added to
+ * pgstat_io_context_ops_assert_zero() and pgstat_io_op_assert_zero() (though
+ * the compiler will remind you about the latter).
+ */
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_BUFFER_POOL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter reuses;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStat_IOObjectOps;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +511,28 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_object_desc(IOObject io_object);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_io_object_valid(BackendType bktype,
+													 IOContext io_context, IOObject io_object);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context,
+							   IOObject io_object, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype,
+								IOContext io_context, IOObject io_object, IOOp io_op);
+
+/* IO stats translation function in freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e1bd22441b..206f4c0b3e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f5802195d..187828eb90 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2026,6 +2029,9 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOObjectOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.38.1

v39-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchapplication/octet-stream; name=v39-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchDownload

From 9f332e7da38df572ff4db0b3dd947ba859ef4054 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 29 Nov 2022 18:42:15 -0500
Subject: [PATCH v39 1/4] Remove BufferAccessStrategyData->current_was_in_ring

It is a duplication of StrategyGetBuffer->from_ring.
---
 src/backend/storage/buffer/bufmgr.c   |  5 +++--
 src/backend/storage/buffer/freelist.c | 22 ++++++++--------------
 src/include/storage/buf_internals.h   |  4 ++--
 3 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..fa32f24e19 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1121,6 +1121,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1200,7 +1201,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1255,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..5299bb8711 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,12 +81,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +192,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +209,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -625,10 +624,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +640,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +649,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -682,14 +676,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..7b67250747 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,10 +392,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
-- 
2.38.1

v39-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v39-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From e5a02cb83e8c96e517d03c1a173665a6957893c7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 29 Nov 2022 18:42:39 -0500
Subject: [PATCH v39 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps (
evictions, reuses, rejections, repossessions, reads, writes, extends,
and fsyncs) done on each IOObject (relation, temp relation) in each
IOContext (shared buffers and buffers reserved by a
BufferAccessStrategy) by each type of backend (e.g. client backend,
checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  31 ++
 contrib/amcheck/sql/check_heap.sql      |  24 ++
 doc/src/sgml/monitoring.sgml            | 410 +++++++++++++++++++++++-
 doc/src/sgml/pgstatstatements.sgml      |  84 ++++-
 src/backend/catalog/system_views.sql    |  15 +
 src/backend/utils/adt/pgstatfuncs.c     | 144 +++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 +
 src/test/regress/expected/stats.out     | 225 +++++++++++++
 src/test/regress/sql/stats.sql          | 138 ++++++++
 10 files changed, 1064 insertions(+), 28 deletions(-)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..667d5747a8 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,19 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +101,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam will need to read in the page written out by ALTER TABLE ...
+-- SET TABLESPACE ... causing an additional bulkread, which should be reflected
+-- in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +225,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..84ffffebf9 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,26 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +47,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam will need to read in the page written out by ALTER TABLE ...
+-- SET TABLESPACE ... causing an additional bulkread, which should be reflected
+-- in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +133,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 89fca710db..840e4695ef 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -658,20 +667,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are primarily useful to
+   determine the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call. However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3604,13 +3613,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3619,7 +3627,377 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each valid
+   backend type, IO context, IO object combination containing global data for
+   the cluster on IO operations done by that backend type in that IO context on
+   that IO object. Currently only a subset of IO operations are tracked here.
+   WAL IO, IO on temporary files, and some forms of IO outside of shared
+   buffers (such as when building indexes or moving a table from one tablespace
+   to another) may be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type></para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+       See <link linkend="monitoring-pg-stat-activity-view">
+       <structname>pg_stat_activity</structname></link> for more information on
+       <varname>backend_type</varname>s. Some <varname>backend_type</varname>s
+       do not accumulate IO operation statistics and will not be included in
+       the view.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type></para>
+      <para>
+       The context or location of an IO operation.
+      </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>buffer pool</literal> refers to IO operations on data in both
+          the shared buffer pool and process-local buffer pools used for
+          temporary relation data.
+         </para>
+         <para>
+          Operations on temporary relations are tracked in
+          <varname>io_context</varname> <literal>buffer pool</literal> for
+          <varname>io_object</varname> <literal>temp relation</literal>.
+         </para>
+         <para>
+          Operations on permanent relations are tracked in
+          <varname>io_context</varname> <literal>buffer pool</literal> for
+          <varname>io_object</varname> <literal>relation</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal> refers to the IO operations incurred while
+          vacuuming and analyzing.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal> refers to IO operations specially
+          designated as <literal>bulk reads</literal>, such as the sequential
+          scan of a large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal> refers to IO operations specially
+          designated as <literal>bulk writes</literal>, such as
+          <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+       <para>
+        These last three <varname>io_context</varname>s are counted separately
+        because the autovacuum daemon, explicit <command>VACUUM</command>,
+        explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+        writes acquire a fixed number of shared buffers and reuse them
+        circularly to avoid occupying an undue portion of the main shared
+        buffer pool. This pattern is called a <quote>Buffer Access
+        Strategy</quote> in the <productname>PostgreSQL</productname> source
+        code and the fixed-size ring buffer is referred to as a <quote>strategy
+        ring buffer</quote> for the purposes of this view's documentation.
+        These <varname>io_context</varname>s are referred to as <quote>strategy
+        contexts</quote> and IO operations on strategy contexts are referred to
+        as <quote>strategy operations</quote>.
+      </para>
+      <para>
+       Some <varname>io_context</varname>s are not used by some
+       <varname>backend_type</varname>s and will not be in the view. For
+       example, the checkpointer does not use a Buffer Access Strategy
+       (currently), so there will be no rows for
+       <varname>backend_type</varname> <literal>checkpointer</literal> in any
+       of the strategy <varname>io_context</varname>s.
+      </para>
+      <para>
+       Some IO operations are invalid in combination with certain
+       <varname>io_context</varname>s and <varname>io_object</varname>s. Those
+       cells will be NULL to distinguish between 0 observed IO operations of
+       that type and an invalid combination. For example, temporary tables are
+       not fsynced, so cells for all <varname>backend_type</varname>s for
+       <varname>io_object</varname> <literal>temp relation</literal> in
+       <varname>io_context</varname> <literal>buffer pool</literal> for
+       <varname>files_synced</varname> will be NULL. Some
+       <varname>backend_type</varname>s never perform certain IO operations.
+       Those cells will also be NULL in the view. For example
+       <literal>background writer</literal> should not perform reads.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_object</structfield> <type>text</type></para>
+      <para>
+       Object operated on in a given <varname>io_context</varname> by a given
+       <varname>backend_type</varname>. Current values are
+       <literal>relation</literal>, which includes permanent relations, and
+       <literal>temp relation</literal> which includes temporary relations
+       created by <command>CREATE TEMPORARY TABLE...</command>.
+      </para>
+      <para>
+       Some <varname>backend_type</varname>s will never do IO operations on
+       some <varname>io_object</varname>s, either at all or in certain
+       <varname>io_context</varname>s. These rows are omitted from the view.
+      </para>
+     </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type></para>
+      <para>
+       Reads by a given <varname>backend_type</varname> of a given
+       <varname>io_object</varname> into buffers in a given
+       <varname>io_context</varname>.
+      </para>
+      <para>
+       Note that the sum of
+       <varname>heap_blks_read</varname>,
+       <varname>idx_blks_read</varname>,
+       <varname>tidx_blks_read</varname>, and
+       <varname>toast_blks_read</varname>
+       in <link linkend="monitoring-pg-statio-all-tables-view">
+       <structname>pg_statio_all_tables</structname></link> as well as
+       <varname>blks_read</varname> in <link
+       linkend="monitoring-pg-stat-database-view">
+       <structname>pg_stat_database</structname></link> are both similar to
+       <varname>read</varname> plus <varname>extended</varname> for all
+       <varname>io_context</varname>s for the following
+       <varname>backend_type</varname>s in <structname>pg_stat_io</structname>:
+       <itemizedlist>
+        <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+        <listitem><para><literal>autovacuum worker</literal></para></listitem>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>standalone backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+       The difference is that reads done as part of <command>CREATE
+       DATABASE</command> are not counted in
+       <structname>pg_statio_all_tables</structname> and
+       <structname>pg_stat_database</structname>.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type></para>
+      <para>
+       Writes by a given <varname>backend_type</varname> of a given
+       <varname>io_object</varname> of data from a given
+       <varname>io_context</varname>.
+      </para>
+      <para>
+       Normal client backends should be able to rely on auxiliary processes
+       like the checkpointer and background writer to write out dirty data as
+       much as possible. Large numbers of writes by
+       <varname>backend_type</varname> <literal>client backend</literal> in
+       <varname>io_context</varname> <literal>buffer pool</literal> and
+       <varname>io_object</varname> <literal>relation</literal> could indicate
+       a misconfiguration of shared buffers or of checkpointer. More
+       information on checkpointer configuration can be found in <xref
+       linkend="wal-configuration"/>.
+      </para>
+      <para>
+       Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname> is equivalent to
+       <structname>pg_stat_io</structname>'s <varname>written</varname> plus
+       <varname>extended</varname> for <varname>io_context</varname>s
+       <literal>buffer pool</literal>, <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, and <literal>vacuum</literal> on
+       <varname>io_object</varname> <literal>relation</literal> for
+       <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>autovacuum worker</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type></para>
+      <para>
+       Extends of relations done by a given <varname>backend_type</varname> in
+       order to write data for a given <varname>io_object</varname> in a given
+       <varname>io_context</varname>.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>op_bytes</structfield> <type>bigint</type></para>
+      <para>
+       The number of bytes per unit of IO read, written, or extended. For
+       block-oriented IO of relation data, reads, writes, and extends are done
+       in <varname>block_size</varname> units, derived from the build-time
+       parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+       default. Future values could include those derived from
+       <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
+       constant multipliers once non-block-oriented IO (e.g. temporary file IO)
+       is tracked here.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>evicted</structfield> <type>bigint</type></para>
+      <para>
+       Number of times a <varname>backend_type</varname> has evicted a block
+       from a shared or local buffer in order to reuse the buffer in this
+       <varname>io_context</varname>. Blocks are only evicted when there are no
+       unoccupied buffers.
+      </para>
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>relation</literal> counts the number of times a block from a
+       shared buffer was evicted so that it can be replaced with another block,
+       also in shared buffers.
+      </para>
+      <para>
+       A high <varname>evicted</varname> count in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>relation</literal> could indicate that shared buffers is too
+       small and should be set to a larger value.
+      </para>
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times occupied shared
+       buffers were added to the fixed-size strategy ring buffer, causing the
+       buffer contents to be evicted. If the current buffer in the ring is
+       pinned or in use by another backend, it may be replaced by a new shared
+       buffer. If this shared buffer contains valid data, that block must be
+       evicted and will count as <varname>evicted</varname>.
+      </para>
+      <para>
+       Seeing a large number of <varname>evicted</varname> in strategy
+       <varname>io_context</varname>s can provide insight into primary working
+       set cache misses.
+      </para>
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>temp relation</literal> counts the number of times a block of
+       data from an existing local buffer was evicted in order to replace it
+       with another block, also in local buffers.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type></para>
+      <para>
+       The number of times an existing buffer in the strategy ring was reused
+       as part of an operation in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+       <varname>io_context</varname>s. When a Buffer Access Strategy reuses a
+       buffer in the strategy ring, it evicts the buffer contents, incrementing
+       <varname>reused</varname>. When a Buffer Access Strategy adds a new
+       shared buffer to the strategy ring and this shared buffer is occupied,
+       the Buffer Access Strategy must evict the contents of the shared buffer,
+       incrementing <varname>evicted</varname>.
+      </para>
+      </entry>
+     </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type></para>
+      <para>
+       Number of files <literal>fsync</literal>ed by a given
+       <varname>backend_type</varname> for the purpose of persisting data from
+       a given <varname>io_object</varname> dirtied in a given
+       <varname>io_context</varname>. <literal>fsync</literal>s are done at
+       segment boundaries so <varname>op_bytes</varname> does not apply to the
+       <varname>files_synced</varname> column. <literal>fsync</literal>s done
+       by backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as <varname>io_context</varname> <literal>buffer pool</literal>
+       <varname>io_object</varname> <literal>relation</literal>
+       <varname>files_synced</varname>.
+      </para>
+      <para>
+       Normal client backends should be able to rely on the checkpointer to
+       ensure data is persisted to permanent storage. Large numbers of
+       <varname>files_synced</varname> by <varname>backend_type</varname>
+       <literal>client backend</literal> could indicate a misconfiguration of
+       shared buffers or of checkpointer. More information on checkpointer
+       configuration can be found in <xref linkend="wal-configuration"/>.
+      </para>
+      <para>
+       Note that the sum of <varname>files_synced</varname> for all
+       <varname>io_context</varname> <literal>buffer pool</literal>
+       <varname>io_object</varname> <literal>relation</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type></para>
+      <para>
+       Time at which these statistics were last reset.
+      </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index ea90365c7f..bc59ef6f3d 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -254,11 +254,27 @@
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>shared_blks_read</structfield> <type>bigint</type>
+       <structfield>shared_blks_read</structfield> <type>bigint</type></para>
+      <para>
+       Total number of shared blocks read by the statement.
       </para>
       <para>
-       Total number of shared blocks read by the statement
-      </para></entry>
+       <varname>shared_blks_read</varname> in
+       <structname>pg_stat_statements</structname> is equivalent to
+       <link linkend="monitoring-pg-stat-io-view"><structname>pg_stat_io</structname></link>'s
+       <varname>read</varname> for all <varname>io_context</varname>s with
+       <varname>io_object</varname> <literal>relation</literal> for
+       <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+        <listitem><para><literal>autovacuum worker</literal></para></listitem>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>standalone backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+      </entry>
      </row>
 
      <row>
@@ -272,11 +288,28 @@
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>shared_blks_written</structfield> <type>bigint</type>
-      </para>
+       <structfield>shared_blks_written</structfield> <type>bigint</type></para>
       <para>
        Total number of shared blocks written by the statement
-      </para></entry>
+      </para>
+      <para>
+       <varname>shared_blks_written</varname> in
+       <structname>pg_stat_statements</structname> is equivalent to
+       <link linkend="monitoring-pg-stat-io-view"><structname>pg_stat_io</structname></link>'s
+       <varname>written</varname> plus <varname>extended</varname> for all
+       <varname>io_context</varname>s with <varname>io_object</varname>
+       <literal>relation</literal> for <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+        <listitem><para><literal>autovacuum worker</literal></para></listitem>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>standalone backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+
+      </entry>
      </row>
 
      <row>
@@ -290,11 +323,24 @@
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>local_blks_read</structfield> <type>bigint</type>
-      </para>
+       <structfield>local_blks_read</structfield> <type>bigint</type></para>
       <para>
        Total number of local blocks read by the statement
-      </para></entry>
+      </para>
+      <para>
+       <varname>local_blks_read</varname> in
+       <structname>pg_stat_statements</structname> is equivalent to
+       <link linkend="monitoring-pg-stat-io-view"><structname>pg_stat_io</structname></link>'s
+       <varname>read</varname> for <varname>io_context</varname>
+       <literal>buffer pool</literal> with <varname>io_object</varname>
+       <literal>temp relation</literal> for <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+      </entry>
      </row>
 
      <row>
@@ -308,11 +354,25 @@
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>local_blks_written</structfield> <type>bigint</type>
-      </para>
+       <structfield>local_blks_written</structfield> <type>bigint</type></para>
       <para>
        Total number of local blocks written by the statement
-      </para></entry>
+      </para>
+      <para>
+       <varname>local_blks_written</varname> in
+       <structname>pg_stat_statements</structname> is equivalent to
+       <link linkend="monitoring-pg-stat-io-view"><structname>pg_stat_io</structname></link>'s
+       <varname>written</varname> plus <varname>extended</varname> for
+       <varname>io_context</varname> <literal>buffer pool</literal> with
+       <varname>io_object</varname> <literal>temp relation</literal> for
+       <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+      </entry>
      </row>
 
      <row>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..296b3acf6e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.read,
+       b.written,
+       b.extended,
+       b.op_bytes,
+       b.evicted,
+       b.reused,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a135cad0ce..fda10fc429 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1731,6 +1731,150 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+
+	pg_unreachable();
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected(bktype);
+
+		for (IOContext io_context = IOCONTEXT_BULKREAD;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+			PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
+
+			for (IOObject io_object = IOOBJECT_RELATION;
+				 io_object < IOOBJECT_NUM_TYPES; io_object++)
+			{
+				PgStat_IOOpCounters *counters = &io_objs->data[io_object];
+				const char *io_obj_str = pgstat_io_object_desc(io_object);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!expect_backend_stats ||
+					!pgstat_bktype_io_context_io_object_valid(bktype,
+															  io_context, io_object))
+				{
+					pgstat_io_context_ops_assert_zero(counters);
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(io_obj_str);
+				values[IO_COL_READS] = Int64GetDatum(counters->reads);
+				values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+				values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+
+				/*
+				 * Hard-code this to blocks until we have non-block-oriented
+				 * IO represented in the view as well
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				values[IO_COL_EVICTIONS] = Int64GetDatum(counters->evictions);
+				values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+				values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Some combinations of BackendType and IOOp, of IOContext and
+				 * IOOp, and of IOObject and IOOp are not valid. Set these
+				 * cells in the view NULL and assert that these stats are zero
+				 * as expected.
+				 */
+				for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					if (!pgstat_io_op_valid(bktype, io_context, io_object,
+											io_op))
+					{
+						pgstat_io_op_assert_zero(counters, io_op);
+						nulls[pgstat_io_op_get_index(io_op)] = true;
+					}
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..1416fa27d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5679,6 +5679,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,read,written,extended,op_bytes,evicted,reused,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 37c1c86473..5960d289a0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.read,
+    b.written,
+    b.extended,
+    b.op_bytes,
+    b.evicted,
+    b.reused,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, read, written, extended, op_bytes, evicted, reused, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..eed0017518 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,229 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_BUFFER_POOL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_BUFFER_POOL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'buffer pool', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..7e0437d928 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,142 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_BUFFER_POOL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_BUFFER_POOL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'buffer pool', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'buffer pool' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.38.1

#101

pryzby@telsasoft.com

about 3 years ago

In reply to: Melanie Plageman (#99)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Nov 28, 2022 at 09:08:36PM -0500, Melanie Plageman wrote:

+pgstat_io_op_stats_collected(BackendType bktype)
+{
+       return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER &&
+               bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER;
Similar: I'd prefer to see this as 5 "ifs" or a "switch" to return
false, else return true. But YMMV.
I don't know that separating it into multiple if statements or a switch
would make it more clear to me or help me with debugging here.

Separately, since this is used in non-assert builds, I would like to
ensure it is efficient. Do you know if a switch or if statements will
be compiled to the exact same thing as this at useful optimization
levels?

This doesn't seem like a detail worth much bother, but I did a test.

With -O2 (but not -O1 nor -Og) the assembly (gcc 9.4) is the same when
written like:

+       if (bktype == B_INVALID)
+               return false;
+       if (bktype == B_ARCHIVER)
+               return false;
+       if (bktype == B_LOGGER)
+               return false;
+       if (bktype == B_WAL_RECEIVER)
+               return false;
+       if (bktype == B_WAL_WRITER)
+               return false;
+
+       return true;

objdump --disassemble=pgstat_io_op_stats_collected src/backend/postgres_lib.a.p/utils_activity_pgstat_io_ops.c.o

0000000000000110 <pgstat_io_op_stats_collected>:
110: f3 0f 1e fa endbr64
114: b8 01 00 00 00 mov $0x1,%eax
119: 83 ff 0d cmp $0xd,%edi
11c: 77 10 ja 12e <pgstat_io_op_stats_collected+0x1e>
11e: b8 03 29 00 00 mov $0x2903,%eax
123: 89 f9 mov %edi,%ecx
125: 48 d3 e8 shr %cl,%rax
128: 48 f7 d0 not %rax
12b: 83 e0 01 and $0x1,%eax
12e: c3 retq

I was surprised, but the assembly is *not* the same when I used a switch{}.

I think it's fine to write however you want.

pgstat_count_io_op() has a superflous newline before "}".

I couldn't find the one you are referencing.
Do you mind pasting in the code?

+               case IOOP_WRITE:
+                       pending_counters->writes++;
+                       break;
+       }
+ --> here <--
+}

--
Justin

#102

m.sakrejda@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#100)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Nov 29, 2022 at 5:13 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Thanks for the review, Maciek!

I've attached a new version 39 of the patch which addresses your docs
feedback from this email as well as docs feedback from Andres in [1] and
Justin in [2].

This looks great! Just a couple of minor comments.

You are right: reused is a normal, expected part of strategy
execution. And you are correct: the idea behind reusing existing
strategy buffers instead of taking buffers off the freelist is to leave
those buffers for blocks that we might expect to be accessed more than
once.

In practice, however, if you happen to not be using many shared buffers,
and then do a large COPY, for example, you will end up doing a bunch of
writes (in order to reuse the strategy buffers) that you perhaps didn't
need to do at that time had you leveraged the freelist. I think the
decision about which tradeoff to make is quite contentious, though.

Thanks for the explanation--that makes sense.

On Mon, Nov 7, 2022 at 1:26 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

Alternately, what do you think about pulling equivalencies to existing
views out of the main column descriptions, and adding them after the
main table as a sort of footnote? Most view docs don't have anything
like that, but pg_stat_replication does and it might be a good pattern
to follow.

Thoughts?

Thanks for including a patch!
In the attached v39, I've taken your suggestion of flattening some of
the lists and done some rewording as well. I have also moved the note
about equivalence with pg_stat_statements columns to the
pg_stat_statements documentation. The result is quite a bit different
than what I had before, so I would be interested to hear your thoughts.

My concern with the blue "note" section like you mentioned is that it
would be harder to read the lists of backend types than it was in the
tabular format.

Oh, I wasn't thinking of doing a separate "note": just additional
paragraphs of text after the table (like what pg_stat_replication has
before its "note", or the brief comment after the pg_stat_archiver
table). But I think the updated docs work also.

+      <para>
+       The context or location of an IO operation.
+      </para>

maybe "...of an IO operation:" (colon) instead?

+       default. Future values could include those derived from
+       <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
+       constant multipliers once non-block-oriented IO (e.g. temporary file IO)
+       is tracked here.

I know Lukas had commented that we should communicate that the goal is
to eventually provide relatively comprehensive I/O stats in this view
(you do that in the view description and I think that works), and this
is sort of along those lines, but I think speculative documentation
like this is not all that helpful. I'd drop this last sentence. Just
my two cents.

+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>buffer pool</literal> and <varname>io_object</varname>
+       <literal>temp relation</literal> counts the number of times a block of
+       data from an existing local buffer was evicted in order to replace it
+       with another block, also in local buffers.
+      </para>

Doesn't this follow from the first sentence of the column description?
I think we could drop this, no?

Otherwise, the docs look good to me.

Thanks,
Maciek

#103

andres@anarazel.de

about 3 years ago

In reply to: Melanie Plageman (#100)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

- I think it might be worth to rename IOCONTEXT_BUFFER_POOL to
IOCONTEXT_{NORMAL, PLAIN, DEFAULT}. I'd like at some point to track WAL IO ,
temporary file IO etc, and it doesn't seem useful to define a version of
BUFFER_POOL for each of them. And it'd make it less confusing, because all
the other existing contexts are also in the buffer pool (for now, can't wait
for "bypass" or whatever to be tracked as well).

- given that IOContextForStrategy() is defined in freelist.c, and that
declaring it in pgstat.h requires including buf.h, I think it's probably
better to move IOContextForStrategy()'s declaration to freelist.h (doesn't
exist, but whatever the right one is)

- pgstat_backend_io_stats_assert_well_formed() doesn't seem to belong in
pgstat.c. Why not pgstat_io_ops.c?

- Do pgstat_io_context_ops_assert_zero(), pgstat_io_op_assert_zero() have to
be in pgstat.h?

I think the only non-trival thing is the first point, the rest is stuff than I
also evolve during commit.

Greetings,

Andres Freund

#104

[1]: /messages/by-id/20221130025113.GD24131@telsasoft.com
[2]: /messages/by-id/CAOtHd0BfFdMqO7-zDOk=iJTatzSDgVcgYcaR1_wk0GS4NN+RUQ@mail.gmail.com

melanieplageman@gmail.com

about 3 years ago

In reply to: Andres Freund (#103)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Attached is v40.

I have addressed the feedback from Justin [1]/messages/by-id/20221130025113.GD24131@telsasoft.com and Maciek [2]/messages/by-id/CAOtHd0BfFdMqO7-zDOk=iJTatzSDgVcgYcaR1_wk0GS4NN+RUQ@mail.gmail.com as well.
I took all of the suggestions regarding the docs that Maciek made,
including the following:

+       default. Future values could include those derived from
+       <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and
+       constant multipliers once non-block-oriented IO (e.g. temporary file IO)
+       is tracked here.
I know Lukas had commented that we should communicate that the goal is
to eventually provide relatively comprehensive I/O stats in this view
(you do that in the view description and I think that works), and this
is sort of along those lines, but I think speculative documentation
like this is not all that helpful. I'd drop this last sentence. Just
my two cents.

I have removed this and added the relevant part of this as a comment to
the view generating function pg_stat_get_io().

On Mon, Dec 5, 2022 at 2:32 PM Andres Freund <andres@anarazel.de> wrote:

- I think it might be worth to rename IOCONTEXT_BUFFER_POOL to
IOCONTEXT_{NORMAL, PLAIN, DEFAULT}. I'd like at some point to track WAL IO ,
temporary file IO etc, and it doesn't seem useful to define a version of
BUFFER_POOL for each of them. And it'd make it less confusing, because all
the other existing contexts are also in the buffer pool (for now, can't wait
for "bypass" or whatever to be tracked as well).

In attached v40, I've renamed IOCONTEXT_BUFFER_POOL to IOCONTEXT_NORMAL.

- given that IOContextForStrategy() is defined in freelist.c, and that
declaring it in pgstat.h requires including buf.h, I think it's probably
better to move IOContextForStrategy()'s declaration to freelist.h (doesn't
exist, but whatever the right one is)

I have moved it to buf_internals.h.

- pgstat_backend_io_stats_assert_well_formed() doesn't seem to belong in
pgstat.c. Why not pgstat_io_ops.c?

I put it in pgstat.c because it is only used there -- so I made it
static. I've moved it to pg_stat_io_ops.c and declared it in
pgstat_internal.h

- Do pgstat_io_context_ops_assert_zero(), pgstat_io_op_assert_zero() have to
be in pgstat.h?

They are used in pgstatfuncs.c, which I presume should not include
pgstat_internal.h. Or did you mean that I should not put them in a
header file at all?

- Melanie

Attachments:

v40-0002-Track-IO-operation-statistics-locally.patchapplication/octet-stream; name=v40-0002-Track-IO-operation-statistics-locally.patchDownload

From 171f5a5be7b93751378a9cd4d5d8b9731401e78e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 5 Dec 2022 19:32:42 -0500
Subject: [PATCH v40 2/4] Track IO operation statistics locally

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be counted as an IOOp "written" on an
IOObject IOOBJECT_RELATION in IOContext IOCONTEXT_NORMAL by BackendType
"checkpointer".

Each IOOp (evict, extend, fsync, read, reuse, and write) is counted per
IOObject (relation, temp relation) per IOContext (normal, bulkread,
bulkwrite, or vacuum) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers.

The IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and IOCONTEXT_VACUUM
IOContexts concern IO operations on buffers as part of a
BufferAccessStrategy.

IOOP_EVICT IOOps are counted in IOCONTEXT_NORMAL when a buffer is
acquired or allocated through [Local]BufferAlloc() and no
BufferAccessStrategy is in use.

When a BufferAccessStrategy is in use, shared buffers added to the
strategy ring are counted as IOOP_EVICT IOOps in the
IOCONTEXT_[BULKREAD|BULKWRITE|VACUUM] IOContext. When one of these
buffers is reused, it is counted as an IOOP_REUSE IOOp in the
corresponding strategy IOContext.

IOOP_WRITE IOOps are counted in the BufferAccessStrategy IOContexts
whenever the reused dirty buffer is written out.

IO Operations on buffers containing temporary table data are counted as
operations on IOOBJECT_TEMP_RELATION IOObjects.

Stats on IOOps in all IOContexts for a given backend are counted in a
backend's local memory. A subsequent commit will expose functions for
aggregating and viewing these stats.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c      |  13 +
 src/backend/storage/buffer/bufmgr.c        |  93 ++++++-
 src/backend/storage/buffer/freelist.c      |  37 ++-
 src/backend/storage/buffer/localbuf.c      |   4 +
 src/backend/storage/sync/sync.c            |   2 +
 src/backend/utils/activity/Makefile        |   1 +
 src/backend/utils/activity/meson.build     |   1 +
 src/backend/utils/activity/pgstat_io_ops.c | 280 +++++++++++++++++++++
 src/include/pgstat.h                       | 119 +++++++++
 src/include/storage/buf_internals.h        |   2 +
 src/include/storage/bufmgr.h               |   7 +-
 src/tools/pgindent/typedefs.list           |   6 +
 12 files changed, 554 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..783bca52fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,19 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fa32f24e19..f2e371e1d8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,8 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -986,10 +989,28 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+
+		/*
+		 * Though a strategy object may be passed in, no strategy is employed
+		 * when using local buffers. This could happen when doing, for
+		 * example, CREATE TEMPORARY TABLE AS ...
+		 */
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1015,6 +1036,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			instr_time	io_start,
 						io_time;
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 				INSTR_TIME_SET_CURRENT(io_start);
 
@@ -1122,6 +1145,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			bool *foundPtr)
 {
 	bool		from_ring;
+	IOContext	io_context;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1188,6 +1212,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1264,13 +1290,35 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, only flushes of dirty buffers
+				 * already in the strategy ring are counted as strategy writes
+				 * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the
+				 * purpose of IO operation statistics tracking.
+				 *
+				 * If a shared buffer initially added to the ring must be
+				 * flushed before being used, this is counted as an
+				 * IOCONTEXT_NORMAL IOOP_WRITE.
+				 *
+				 * If a shared buffer which was added to the ring later
+				 * because the current strategy buffer is pinned or in use or
+				 * because all strategy buffers were dirty and rejected (for
+				 * BAS_BULKREAD operations only) requires flushing, this is
+				 * counted as an IOCONTEXT_NORMAL IOOP_WRITE (from_ring will
+				 * be false).
+				 *
+				 * When a strategy is not in use, the write can only be a
+				 * "regular" write of a dirty shared buffer (IOCONTEXT_NORMAL
+				 * IOOP_WRITE).
+				 */
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1442,6 +1490,29 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT IO Operations in the
+		 * corresponding context (e.g. IOCONTEXT_BULKWRITE). Shared buffers
+		 * are evicted by a strategy in two cases: 1) while initially claiming
+		 * buffers for the strategy ring 2) to replace an existing strategy
+		 * ring buffer because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE IO Operations in the corresponding strategy
+		 * context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2571,7 +2642,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2821,7 +2892,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2901,6 +2972,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3552,6 +3625,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3587,7 +3662,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3685,7 +3760,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3895,7 +3970,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3922,7 +3997,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5299bb8711..d318976b9e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -601,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -664,6 +665,40 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40..e27d623174 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -226,6 +227,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
@@ -256,6 +259,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..684a1c3e21 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 5b3b558a67..1038324c32 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io_ops.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..6fbb6b185e
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,280 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOContextOps pending_IOOpStats;
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_expect_io_op(MyBackendType, io_context, io_object, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context].data[io_object];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+}
+
+const char *
+pgstat_io_context_desc(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+
+	pg_unreachable();
+}
+
+const char *
+pgstat_io_object_desc(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+
+	pg_unreachable();
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+
+	pg_unreachable();
+}
+
+/*
+* IO Operation statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO operations on which we currently report:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO Operations and false if it does not.
+*/
+bool
+pgstat_io_op_stats_collected(BackendType bktype)
+{
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+		default:
+			return true;
+	}
+}
+
+
+/*
+ * Some BackendTypes do not perform IO operations in certain IOContexts. Some
+ * IOObjects are never operated on in some IOContexts. Check that the given
+ * BackendType is expected to do IO in the given IOContext and that the given
+ * IOObject is expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_bktype_io_context_io_object_valid(BackendType bktype,
+										 IOContext io_context, IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Currently, IO operations on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO Operations on
+	 * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO operations in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_io_op_valid(BackendType bktype, IOContext io_context, IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/*
+	 * Some BackendTypes should never track IO Operation statistics.
+	 */
+	Assert(pgstat_io_op_stats_collected(bktype));
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+bool
+pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOObject io_object, IOOp io_op)
+{
+	if (!pgstat_io_op_stats_collected(bktype))
+		return false;
+
+	if (!pgstat_bktype_io_context_io_object_valid(bktype, io_context, io_object))
+		return false;
+
+	if (!pgstat_io_op_valid(bktype, io_context, io_object, io_op))
+		return false;
+
+	return true;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9e2ce6f011..a57e39042f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -276,6 +276,63 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Contexts.
+ * When adding a new value, ensure that the proper assertions are added to
+ * pgstat_io_context_ops_assert_zero() and pgstat_io_op_assert_zero() (though
+ * the compiler will remind you about the latter).
+ */
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter reuses;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStat_IOObjectOps;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +510,68 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern const char *pgstat_io_context_desc(IOContext io_context);
+extern const char *pgstat_io_object_desc(IOObject io_object);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+
+/* Validation functions in pgstat_io_ops.c */
+extern bool pgstat_io_op_stats_collected(BackendType bktype);
+extern bool pgstat_bktype_io_context_io_object_valid(BackendType bktype,
+													 IOContext io_context, IOObject io_object);
+extern bool pgstat_io_op_valid(BackendType bktype, IOContext io_context,
+							   IOObject io_object, IOOp io_op);
+extern bool pgstat_expect_io_op(BackendType bktype,
+								IOContext io_context, IOObject io_object, IOOp io_op);
+
+/*
+ * Functions to assert that invalid IO Operation counters are zero.
+ */
+static inline void
+pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
+{
+	Assert(counters->evictions == 0);
+	Assert(counters->extends == 0);
+	Assert(counters->fsyncs == 0);
+	Assert(counters->reads == 0);
+	Assert(counters->reuses == 0);
+	Assert(counters->writes == 0);
+}
+
+static inline void
+pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			Assert(counters->evictions == 0);
+			return;
+		case IOOP_EXTEND:
+			Assert(counters->extends == 0);
+			return;
+		case IOOP_FSYNC:
+			Assert(counters->fsyncs == 0);
+			return;
+		case IOOP_READ:
+			Assert(counters->reads == 0);
+			return;
+		case IOOP_REUSE:
+			Assert(counters->reuses == 0);
+			return;
+		case IOOP_WRITE:
+			Assert(counters->writes == 0);
+			return;
+	}
+
+	/* Should not reach here */
+	Assert(false);
+}
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7b67250747..0c80ec9230 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,6 +392,7 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e1bd22441b..206f4c0b3e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * statistics on IO operations using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 58daeca831..28362f00a4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1105,7 +1105,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2025,6 +2028,9 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IOContextOps
+PgStat_IOObjectOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.38.1

v40-0003-Aggregate-IO-operation-stats-per-BackendType.patchapplication/octet-stream; name=v40-0003-Aggregate-IO-operation-stats-per-BackendType.patchDownload

From 6370184c4223a7482857a1b8f9c507c1d3c81e3c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 5 Dec 2022 19:38:07 -0500
Subject: [PATCH v40 3/4] Aggregate IO operation stats per BackendType

Stats on IOOps on all IOObjects in all IOContexts for a backend are
already tracked locally. Add functionality for backends to flush these
stats to shared memory and accumulate them with those from all other
backends, exited and live. Also add reset and snapshot functions used by
cumulative stats system for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/pgstat.c           |  35 +++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 207 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |   8 +
 src/include/utils/pgstat_internal.h           |  46 ++++
 src/tools/pgindent/typedefs.list              |   4 +
 13 files changed, 339 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 11a8ebe5ec..e0db24c154 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5396,6 +5396,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1ebe3bbf29..d2ba5fd9f3 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1321,6 +1334,14 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stat_reset_timestamp);
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops.stats[bktype]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1495,6 +1516,20 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
+		goto error;
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		pgstat_backend_io_stats_assert_well_formed(&shmem->io_ops.stats[bktype],
+												   (BackendType) bktype);
+		if (!read_chunk_s(fpin, &shmem->io_ops.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
index 6fbb6b185e..66c8e3d035 100644
--- a/src/backend/utils/activity/pgstat_io_ops.c
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -20,6 +20,42 @@
 #include "utils/pgstat_internal.h"
 
 static PgStat_IOContextOps pending_IOOpStats;
+bool		have_ioopstats = false;
+
+
+/*
+ * Helper function to accumulate source PgStat_IOOpCounters into target
+ * PgStat_IOOpCounters. If either of the passed-in PgStat_IOOpCounters are
+ * members of PgStatShared_IOContextOps, the caller is responsible for ensuring
+ * that the appropriate lock is held.
+ */
+static void
+pgstat_accum_io_op(PgStat_IOOpCounters *target, PgStat_IOOpCounters *source, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			target->evictions += source->evictions;
+			return;
+		case IOOP_EXTEND:
+			target->extends += source->extends;
+			return;
+		case IOOP_FSYNC:
+			target->fsyncs += source->fsyncs;
+			return;
+		case IOOP_READ:
+			target->reads += source->reads;
+			return;
+		case IOOP_REUSE:
+			target->reuses += source->reuses;
+			return;
+		case IOOP_WRITE:
+			target->writes += source->writes;
+			return;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
 
 void
 pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
@@ -53,6 +89,88 @@ pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
 			pending_counters->writes++;
 			break;
 	}
+
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOContextOps *
+pgstat_fetch_backend_io_context_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOContextOps *type_shstats;
+	bool		expect_backend_stats = true;
+
+	if (!have_ioopstats)
+		return false;
+
+	type_shstats =
+		&pgStatLocal.shmem->io_ops.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(&type_shstats->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&type_shstats->lock, LW_EXCLUSIVE))
+		return true;
+
+	expect_backend_stats = pgstat_io_op_stats_collected(MyBackendType);
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context];
+		PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
+
+		for (IOObject io_object = IOOBJECT_RELATION;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
+			PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(MyBackendType,
+														  io_context, io_object))
+			{
+				pgstat_io_context_ops_assert_zero(sharedent);
+				pgstat_io_context_ops_assert_zero(pendingent);
+				continue;
+			}
+
+			for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_io_op_valid(MyBackendType, io_context, io_object,
+										io_op))
+				{
+					pgstat_io_op_assert_zero(sharedent, io_op);
+					pgstat_io_op_assert_zero(pendingent, io_op);
+					continue;
+				}
+
+				pgstat_accum_io_op(sharedent, pendingent, io_op);
+			}
+		}
+	}
+
+	LWLockRelease(&type_shstats->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
 }
 
 const char *
@@ -115,6 +233,55 @@ pgstat_io_op_desc(IOOp io_op)
 	pg_unreachable();
 }
 
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOContextOps *backends_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOContextOps *backends_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOContextOps *stats_shmem = &backends_stats_shmem->stats[i];
+		PgStat_IOContextOps *stats_snap = &backends_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			backends_stats_snap->stat_reset_timestamp =
+				backends_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
 /*
 * IO Operation statistics are not collected for all BackendTypes.
 *
@@ -278,3 +445,43 @@ pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOObject io_object
 
 	return true;
 }
+
+/*
+ * Assert that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which is not valid for the passed-in BackendType. The
+ * passed-in array of PgStat_IOOpCounters must contain stats from the
+ * BackendType specified by the second parameter. Caller is responsible for
+ * locking of the passed-in PgStatShared_IOContextOps, if needed.
+ */
+void
+pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops,
+										   BackendType bktype)
+{
+	bool		expect_backend_stats = pgstat_io_op_stats_collected(bktype);
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStatShared_IOObjectOps *context = &backend_io_context_ops->data[io_context];
+
+		for (IOObject io_object = IOOBJECT_RELATION;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *object = &context->data[io_object];
+
+			if (!expect_backend_stats ||
+				!pgstat_bktype_io_context_io_object_valid(bktype, io_context,
+														  io_object))
+			{
+				pgstat_io_context_ops_assert_zero(object);
+				continue;
+			}
+
+			for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_io_op_valid(bktype, io_context, io_object, io_op))
+					pgstat_io_op_assert_zero(object, io_op);
+			}
+		}
+	}
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f92e16e7af..1c84e1a5f0 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called until after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations --
+	 * or until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 706692862c..4251079ae1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io_ops.stats[i].lock,
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ae3365d917..a135cad0ce 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2090,6 +2090,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2098,7 +2100,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 795182fa51..aa14338221 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a57e39042f..357973ee8c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -333,6 +334,12 @@ typedef struct PgStat_IOContextOps
 	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
 } PgStat_IOContextOps;
 
+typedef struct PgStat_BackendIOContextOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOContextOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -515,6 +522,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
  */
 
 extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
 extern const char *pgstat_io_context_desc(IOContext io_context);
 extern const char *pgstat_io_object_desc(IOObject io_object);
 extern const char *pgstat_io_op_desc(IOOp io_op);
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index e2c7b59324..923e324011 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,31 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+
+typedef struct PgStatShared_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStatShared_IOObjectOps;
+
+typedef struct PgStatShared_IOContextOps
+{
+	/*
+	 * lock protects ->data If this PgStatShared_IOContextOps is
+	 * PgStatShared_BackendIOContextOps->stats[0], lock also protects
+	 * PgStatShared_BackendIOContextOps->stat_reset_timestamp.
+	 */
+	LWLock		lock;
+	PgStatShared_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStatShared_IOContextOps;
+
+typedef struct PgStatShared_BackendIOContextOps
+{
+	/* ->stats_reset_timestamp is protected by ->stats[0].lock */
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOContextOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +444,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOContextOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +468,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOContextOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +577,19 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_ops_snapshot_cb(void);
+extern bool pgstat_flush_io_ops(bool nowait);
+extern void pgstat_backend_io_stats_assert_well_formed(
+													   PgStatShared_IOContextOps *backend_io_context_ops,
+													   BackendType bktype);
+
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -641,6 +682,11 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
+/*
+ * Variables in pgstat_io_ops.c
+ */
+
+extern PGDLLIMPORT bool have_ioopstats;
 
 /*
  * Variables in pgstat_slru.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 28362f00a4..2214e8e713 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2005,12 +2005,15 @@ PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
 PgStatShared_Archiver
+PgStatShared_BackendIOContextOps
 PgStatShared_BgWriter
 PgStatShared_Checkpointer
 PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IOContextOps
+PgStatShared_IOObjectOps
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2018,6 +2021,7 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
+PgStat_BackendIOContextOps
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_CheckpointerStats
-- 
2.38.1

v40-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchapplication/octet-stream; name=v40-0001-Remove-BufferAccessStrategyData-current_was_in_r.patchDownload

From 207f3b3580ca69e897ddeef7c4afda9c88de8df6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 5 Dec 2022 19:25:44 -0500
Subject: [PATCH v40 1/4] Remove BufferAccessStrategyData->current_was_in_ring

---
 src/backend/storage/buffer/bufmgr.c   |  5 +++--
 src/backend/storage/buffer/freelist.c | 22 ++++++++--------------
 src/include/storage/buf_internals.h   |  4 ++--
 3 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..fa32f24e19 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1121,6 +1121,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1200,7 +1201,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1255,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..5299bb8711 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,12 +81,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +192,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +209,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -625,10 +624,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +640,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +649,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -682,14 +676,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be78..7b67250747 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -392,10 +392,10 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
-- 
2.38.1

v40-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v40-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 09ad8de114e3bcb4baac0e310a450a6ccc9c718e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 5 Dec 2022 19:47:27 -0500
Subject: [PATCH v40 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  31 ++
 contrib/amcheck/sql/check_heap.sql      |  24 ++
 doc/src/sgml/monitoring.sgml            | 401 +++++++++++++++++++++++-
 doc/src/sgml/pgstatstatements.sgml      |  84 ++++-
 src/backend/catalog/system_views.sql    |  15 +
 src/backend/utils/adt/pgstatfuncs.c     | 146 +++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 +
 src/test/regress/expected/stats.out     | 225 +++++++++++++
 src/test/regress/sql/stats.sql          | 138 ++++++++
 10 files changed, 1057 insertions(+), 28 deletions(-)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..667d5747a8 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,19 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +101,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam will need to read in the page written out by ALTER TABLE ...
+-- SET TABLESPACE ... causing an additional bulkread, which should be reflected
+-- in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +225,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..84ffffebf9 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,26 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +47,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam will need to read in the page written out by ALTER TABLE ...
+-- SET TABLESPACE ... causing an additional bulkread, which should be reflected
+-- in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +133,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e0db24c154..37170a19f9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -658,20 +667,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are primarily useful to
+   determine the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call. However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3606,13 +3615,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
    </tgroup>
   </table>
-
   <para>
     Normally, WAL files are archived in order, oldest to newest, but that is
     not guaranteed, and does not hold under special circumstances like when
@@ -3621,7 +3629,368 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each valid
+   backend type, IO context, IO object combination containing global data for
+   the cluster on IO operations done by that backend type in that IO context on
+   that IO object. Currently only a subset of IO operations are tracked here.
+   WAL IO, IO on temporary files, and some forms of IO outside of shared
+   buffers (such as when building indexes or moving a table from one tablespace
+   to another) may be added in the future.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type></para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+       See <link linkend="monitoring-pg-stat-activity-view">
+       <structname>pg_stat_activity</structname></link> for more information on
+       <varname>backend_type</varname>s. Some <varname>backend_type</varname>s
+       do not accumulate IO operation statistics and will not be included in
+       the view.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_context</structfield> <type>text</type></para>
+      <para>
+       The context of an IO operation or location of an IO object:
+      </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal> refers to the default or standard type or
+          location of IO operations on IO objects.
+         </para>
+         <para>
+          Operations on temporary relations use a process-local buffer pool and
+          are counted as <varname>io_context</varname>
+          <literal>normal</literal> , <varname>io_object</varname>
+          <literal>temp relation</literal> operations.
+         </para>
+         <para>
+          IO operations on permanent relations are done by default in shared
+          buffers. These are tracked in <varname>io_context</varname>
+          <literal>normal</literal>, <varname>io_object</varname>
+          <literal>relation</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal> refers to the IO operations incurred while
+          vacuuming and analyzing permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal> refers to IO operations on permanent
+          relations specially designated as <literal>bulkreads</literal>, such
+          as the sequential scan of a large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal> refers to IO operations on permanent
+          relations specially designated as <literal>bulkwrites</literal>,
+          such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+       <para>
+        These last three <varname>io_context</varname>s are counted separately
+        because the autovacuum daemon, explicit <command>VACUUM</command>,
+        explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+        writes acquire a fixed number of shared buffers and reuse them
+        circularly to avoid occupying an undue portion of the main shared
+        buffer pool. This pattern is called a <quote>Buffer Access
+        Strategy</quote> in the <productname>PostgreSQL</productname> source
+        code and the fixed-size ring buffer is referred to as a <quote>strategy
+        ring buffer</quote> for the purposes of this view's documentation.
+        These <varname>io_context</varname>s are referred to as <quote>strategy
+        contexts</quote> and IO operations on strategy contexts are referred to
+        as <quote>strategy operations</quote>.
+      </para>
+      <para>
+       Some <varname>io_context</varname>s are not used by some
+       <varname>backend_type</varname>s and will not be in the view. For
+       example, the checkpointer does not use a Buffer Access Strategy
+       (currently), so there will be no rows for
+       <varname>backend_type</varname> <literal>checkpointer</literal> in any
+       of the strategy <varname>io_context</varname>s.
+      </para>
+      <para>
+       Some IO operations are invalid in combination with certain
+       <varname>io_context</varname>s and <varname>io_object</varname>s. Those
+       cells will be NULL to distinguish between 0 observed IO operations of
+       that type and an invalid combination. For example, temporary tables are
+       not fsynced, so cells for all <varname>backend_type</varname>s for
+       <varname>io_object</varname> <literal>temp relation</literal> in
+       <varname>io_context</varname> <literal>normal</literal> for
+       <varname>files_synced</varname> will be NULL. Some
+       <varname>backend_type</varname>s never perform certain IO operations.
+       Those cells will also be NULL in the view. For example
+       <literal>background writer</literal> should not perform reads.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_object</structfield> <type>text</type></para>
+      <para>
+       Object operated on in a given <varname>io_context</varname> by a given
+       <varname>backend_type</varname>. Current values are
+       <literal>relation</literal>, which includes permanent relations, and
+       <literal>temp relation</literal> which includes temporary relations
+       created by <command>CREATE TEMPORARY TABLE...</command>.
+      </para>
+      <para>
+       Some <varname>backend_type</varname>s will never do IO operations on
+       some <varname>io_object</varname>s, either at all or in certain
+       <varname>io_context</varname>s. These rows are omitted from the view.
+      </para>
+     </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type></para>
+      <para>
+       Reads by a given <varname>backend_type</varname> of a given
+       <varname>io_object</varname> into buffers in a given
+       <varname>io_context</varname>.
+      </para>
+      <para>
+       Note that the sum of
+       <varname>heap_blks_read</varname>,
+       <varname>idx_blks_read</varname>,
+       <varname>tidx_blks_read</varname>, and
+       <varname>toast_blks_read</varname>
+       in <link linkend="monitoring-pg-statio-all-tables-view">
+       <structname>pg_statio_all_tables</structname></link> as well as
+       <varname>blks_read</varname> in <link
+       linkend="monitoring-pg-stat-database-view">
+       <structname>pg_stat_database</structname></link> are both similar to
+       <varname>read</varname> plus <varname>extended</varname> for all
+       <varname>io_context</varname>s for the following
+       <varname>backend_type</varname>s in <structname>pg_stat_io</structname>:
+       <itemizedlist>
+        <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+        <listitem><para><literal>autovacuum worker</literal></para></listitem>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>standalone backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+       The difference is that reads done as part of <command>CREATE
+       DATABASE</command> are not counted in
+       <structname>pg_statio_all_tables</structname> and
+       <structname>pg_stat_database</structname>.
+      </para>
+      </entry>
+     </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>written</structfield> <type>bigint</type></para>
+      <para>
+       Writes by a given <varname>backend_type</varname> of a given
+       <varname>io_object</varname> of data from a given
+       <varname>io_context</varname>.
+      </para>
+      <para>
+       Normal client backends should be able to rely on auxiliary processes
+       like the checkpointer and background writer to write out dirty data as
+       much as possible. Large numbers of writes by
+       <varname>backend_type</varname> <literal>client backend</literal> in
+       <varname>io_context</varname> <literal>normal</literal> and
+       <varname>io_object</varname> <literal>relation</literal> could indicate
+       a misconfiguration of shared buffers or of checkpointer. More
+       information on checkpointer configuration can be found in <xref
+       linkend="wal-configuration"/>.
+      </para>
+      <para>
+       Note that the values of <varname>written</varname> for
+       <varname>backend_type</varname> <literal>background writer</literal> and
+       <varname>backend_type</varname> <literal>checkpointer</literal> are
+       equivalent to the values of <varname>buffers_clean</varname> and
+       <varname>buffers_checkpoint</varname>, respectively, in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+       <varname>buffers_backend</varname> in
+       <structname>pg_stat_bgwriter</structname> is equivalent to
+       <structname>pg_stat_io</structname>'s <varname>written</varname> plus
+       <varname>extended</varname> for <varname>io_context</varname>s
+       <literal>normal</literal>, <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, and <literal>vacuum</literal> on
+       <varname>io_object</varname> <literal>relation</literal> for
+       <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>autovacuum worker</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extended</structfield> <type>bigint</type></para>
+      <para>
+       Extends of relations done by a given <varname>backend_type</varname> in
+       order to write data for a given <varname>io_object</varname> in a given
+       <varname>io_context</varname>.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>op_bytes</structfield> <type>bigint</type></para>
+      <para>
+       The number of bytes per unit of IO read, written, or extended. For
+       block-oriented IO of relation data, reads, writes, and extends are done
+       in <varname>block_size</varname> units, derived from the build-time
+       parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+       default.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>evicted</structfield> <type>bigint</type></para>
+      <para>
+       Number of times a <varname>backend_type</varname> has evicted a block
+       from a shared or local buffer in order to reuse the buffer in this
+       <varname>io_context</varname>. Blocks are only evicted when there are no
+       unoccupied buffers.
+      </para>
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>normal</literal> and <varname>io_object</varname>
+       <literal>relation</literal> counts the number of times a block from a
+       shared buffer was evicted so that it can be replaced with another block,
+       also in shared buffers.
+      </para>
+      <para>
+       A high <varname>evicted</varname> count in <varname>io_context</varname>
+       <literal>normal</literal> and <varname>io_object</varname>
+       <literal>relation</literal> could indicate that shared buffers is too
+       small and should be set to a larger value.
+      </para>
+      <para>
+       <varname>evicted</varname> in <varname>io_context</varname>
+       <literal>vacuum</literal>, <literal>bulkread</literal>, and
+       <literal>bulkwrite</literal> counts the number of times occupied shared
+       buffers were added to the fixed-size strategy ring buffer, causing the
+       buffer contents to be evicted. If the current buffer in the ring is
+       pinned or in use by another backend, it may be replaced by a new shared
+       buffer. If this shared buffer contains valid data, that block must be
+       evicted and will count as <varname>evicted</varname>.
+      </para>
+      <para>
+       Seeing a large number of <varname>evicted</varname> in strategy
+       <varname>io_context</varname>s can provide insight into primary working
+       set cache misses.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>reused</structfield> <type>bigint</type></para>
+      <para>
+       The number of times an existing buffer in the strategy ring was reused
+       as part of an operation in the <literal>bulkread</literal>,
+       <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+       <varname>io_context</varname>s. When a Buffer Access Strategy reuses a
+       buffer in the strategy ring, it evicts the buffer contents, incrementing
+       <varname>reused</varname>. When a Buffer Access Strategy adds a new
+       shared buffer to the strategy ring and this shared buffer is occupied,
+       the Buffer Access Strategy must evict the contents of the shared buffer,
+       incrementing <varname>evicted</varname>.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>files_synced</structfield> <type>bigint</type></para>
+      <para>
+       Number of files <literal>fsync</literal>ed by a given
+       <varname>backend_type</varname> for the purpose of persisting data from
+       a given <varname>io_object</varname> dirtied in a given
+       <varname>io_context</varname>. <literal>fsync</literal>s are done at
+       segment boundaries so <varname>op_bytes</varname> does not apply to the
+       <varname>files_synced</varname> column. <literal>fsync</literal>s done
+       by backends in order to persist data written in
+       <varname>io_context</varname> <literal>vacuum</literal>,
+       <varname>io_context</varname> <literal>bulkread</literal>, or
+       <varname>io_context</varname> <literal>bulkwrite</literal> are counted
+       as <varname>io_context</varname> <literal>normal</literal>
+       <varname>io_object</varname> <literal>relation</literal>
+       <varname>files_synced</varname>.
+      </para>
+      <para>
+       Normal client backends should be able to rely on the checkpointer to
+       ensure data is persisted to permanent storage. Large numbers of
+       <varname>files_synced</varname> by <varname>backend_type</varname>
+       <literal>client backend</literal> could indicate a misconfiguration of
+       shared buffers or of checkpointer. More information on checkpointer
+       configuration can be found in <xref linkend="wal-configuration"/>.
+      </para>
+      <para>
+       Note that the sum of <varname>files_synced</varname> for all
+       <varname>io_context</varname> <literal>normal</literal>
+       <varname>io_object</varname> <literal>relation</literal> for all
+       <varname>backend_type</varname>s except <literal>checkpointer</literal>
+       is equivalent to <varname>buffers_backend_fsync</varname> in <link
+       linkend="monitoring-pg-stat-bgwriter-view">
+       <structname>pg_stat_bgwriter</structname></link>.
+      </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type></para>
+      <para>
+       Time at which these statistics were last reset.
+      </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
  </sect2>
 
  <sect2 id="monitoring-pg-stat-bgwriter-view">
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index ea90365c7f..5df2b23df3 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -254,11 +254,27 @@
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>shared_blks_read</structfield> <type>bigint</type>
+       <structfield>shared_blks_read</structfield> <type>bigint</type></para>
+      <para>
+       Total number of shared blocks read by the statement.
       </para>
       <para>
-       Total number of shared blocks read by the statement
-      </para></entry>
+       <varname>shared_blks_read</varname> in
+       <structname>pg_stat_statements</structname> is equivalent to
+       <link linkend="monitoring-pg-stat-io-view"><structname>pg_stat_io</structname></link>'s
+       <varname>read</varname> for all <varname>io_context</varname>s with
+       <varname>io_object</varname> <literal>relation</literal> for
+       <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+        <listitem><para><literal>autovacuum worker</literal></para></listitem>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>standalone backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+      </entry>
      </row>
 
      <row>
@@ -272,11 +288,28 @@
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>shared_blks_written</structfield> <type>bigint</type>
-      </para>
+       <structfield>shared_blks_written</structfield> <type>bigint</type></para>
       <para>
        Total number of shared blocks written by the statement
-      </para></entry>
+      </para>
+      <para>
+       <varname>shared_blks_written</varname> in
+       <structname>pg_stat_statements</structname> is equivalent to
+       <link linkend="monitoring-pg-stat-io-view"><structname>pg_stat_io</structname></link>'s
+       <varname>written</varname> plus <varname>extended</varname> for all
+       <varname>io_context</varname>s with <varname>io_object</varname>
+       <literal>relation</literal> for <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+        <listitem><para><literal>autovacuum worker</literal></para></listitem>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>standalone backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+
+      </entry>
      </row>
 
      <row>
@@ -290,11 +323,24 @@
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>local_blks_read</structfield> <type>bigint</type>
-      </para>
+       <structfield>local_blks_read</structfield> <type>bigint</type></para>
       <para>
        Total number of local blocks read by the statement
-      </para></entry>
+      </para>
+      <para>
+       <varname>local_blks_read</varname> in
+       <structname>pg_stat_statements</structname> is equivalent to
+       <link linkend="monitoring-pg-stat-io-view"><structname>pg_stat_io</structname></link>'s
+       <varname>read</varname> for <varname>io_context</varname>
+       <literal>normal</literal> with <varname>io_object</varname>
+       <literal>temp relation</literal> for <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+      </entry>
      </row>
 
      <row>
@@ -308,11 +354,25 @@
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>local_blks_written</structfield> <type>bigint</type>
-      </para>
+       <structfield>local_blks_written</structfield> <type>bigint</type></para>
       <para>
        Total number of local blocks written by the statement
-      </para></entry>
+      </para>
+      <para>
+       <varname>local_blks_written</varname> in
+       <structname>pg_stat_statements</structname> is equivalent to
+       <link linkend="monitoring-pg-stat-io-view"><structname>pg_stat_io</structname></link>'s
+       <varname>written</varname> plus <varname>extended</varname> for
+       <varname>io_context</varname> <literal>normal</literal> with
+       <varname>io_object</varname> <literal>temp relation</literal> for
+       <varname>backend_type</varname>s:
+       <itemizedlist>
+        <listitem><para><literal>client backend</literal></para></listitem>
+        <listitem><para><literal>background worker</literal></para></listitem>
+        <listitem><para><literal>walsender</literal></para></listitem>
+       </itemizedlist>
+      </para>
+      </entry>
      </row>
 
      <row>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..296b3acf6e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.read,
+       b.written,
+       b.extended,
+       b.op_bytes,
+       b.evicted,
+       b.reused,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a135cad0ce..c4f03f7280 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1731,6 +1731,152 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+}			io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_io_op_get_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+
+	pg_unreachable();
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOContextOps *backends_io_stats;
+	ReturnSetInfo *rsinfo;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_backend_io_context_ops();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		bool		expect_backend_stats = true;
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		expect_backend_stats = pgstat_io_op_stats_collected(bktype);
+
+		for (IOContext io_context = IOCONTEXT_BULKREAD;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *io_context_str = pgstat_io_context_desc(io_context);
+			PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
+
+			for (IOObject io_object = IOOBJECT_RELATION;
+				 io_object < IOOBJECT_NUM_TYPES; io_object++)
+			{
+				PgStat_IOOpCounters *counters = &io_objs->data[io_object];
+				const char *io_obj_str = pgstat_io_object_desc(io_object);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!expect_backend_stats ||
+					!pgstat_bktype_io_context_io_object_valid(bktype,
+															  io_context, io_object))
+				{
+					pgstat_io_context_ops_assert_zero(counters);
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(io_context_str);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(io_obj_str);
+				values[IO_COL_READS] = Int64GetDatum(counters->reads);
+				values[IO_COL_WRITES] = Int64GetDatum(counters->writes);
+				values[IO_COL_EXTENDS] = Int64GetDatum(counters->extends);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				values[IO_COL_EVICTIONS] = Int64GetDatum(counters->evictions);
+				values[IO_COL_REUSES] = Int64GetDatum(counters->reuses);
+				values[IO_COL_FSYNCS] = Int64GetDatum(counters->fsyncs);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Some combinations of BackendType and IOOp, of IOContext and
+				 * IOOp, and of IOObject and IOOp are not valid. Set these
+				 * cells in the view NULL and assert that these stats are zero
+				 * as expected.
+				 */
+				for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					if (!pgstat_io_op_valid(bktype, io_context, io_object,
+											io_op))
+					{
+						pgstat_io_op_assert_zero(counters, io_op);
+						nulls[pgstat_io_op_get_index(io_op)] = true;
+					}
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..1416fa27d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5679,6 +5679,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,read,written,extended,op_bytes,evicted,reused,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 532ea36990..6ab300b1da 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.read,
+    b.written,
+    b.extended,
+    b.op_bytes,
+    b.evicted,
+    b.reused,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, read, written, extended, op_bytes, evicted, reused, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..01070a53a4 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,229 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..962ae5b281 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,142 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
-- 
2.38.1

#105

m.sakrejda@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#104)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

In the pg_stat_statements docs, there are several column descriptions like

Total number of ... by the statement

You added an additional sentence to some describing the equivalent
pg_stat_io values, but you only added a period to the previous
sentence for shared_blks_read (for other columns, the additional
description just follows directly). These should be consistent.

Otherwise, the docs look good to me.

#106

andres@anarazel.de

about 3 years ago

In reply to: Melanie Plageman (#77)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2022-10-06 13:42:09 -0400, Melanie Plageman wrote:

Additionally, some minor notes:

- Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them in the past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realistically one would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which all use the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix)

I have changed the column names to be in the past tense.

For a while I was convinced by the consistency argument (after Melanie
pointing it out to me). But the more I look, the less convinced I am. The
existing IO related stats in pg_stat_database, pg_stat_bgwriter aren't past
tense, just the ones in pg_stat_statements. pg_stat_database uses past tense
for tup_*, but not xact_*, deadlocks, checksum_failures etc.

And even pg_stat_statements isn't consistent about it - otherwise it'd be
'planned' instead of 'plans', 'called' instead of 'calls' etc.

I started to look at the naming "tense" issue again, after I got "confused"
about "extended", because that somehow makes me think about more detailed
stats or such, rather than files getting extended.

ISTM that 'evictions', 'extends', 'fsyncs', 'reads', 'reuses', 'writes' are
clearer than the past tense versions, and about as consistent with existing
columns.

FWIW, I've been hacking on this code a bunch, mostly around renaming things
and changing the 'stacking' of the patches. My current state is at
https://github.com/anarazel/postgres/tree/pg_stat_io
A bit more to do before posting the edited version...

Greetings,

Andres Freund

#107

melanieplageman@gmail.com

about 3 years ago

In reply to: Andres Freund (#106)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, Dec 28, 2022 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, I've been hacking on this code a bunch, mostly around renaming things
and changing the 'stacking' of the patches. My current state is at
https://github.com/anarazel/postgres/tree/pg_stat_io
A bit more to do before posting the edited version...

Here is the bit more done.
I've attached a new version 42 which incorporates all of Andres' changes
on his branch (which I am considering version 41).
I have fixed various issues with counting fsyncs and added more comments
and done cosmetic cleanup.

The docs have substantial changes but still require more work:

- The comparisons between columns in pg_stat_io and pg_stat_statements
have been removed, since the granularity and lifetime are so
different, comparing them isn't quite correct.

- The lists of backend types still take up a lot of visual space in the
definitions, which doesn't look great. I'm not sure what to do about
that.

- Andres has pointed out that it is difficult to read the definitions of
the columns because of the added clutter of the interpretations and
the comparisons to other stats views. I'm not sure if I should cut
these. He and I tried adding that information as a note and in other
various table types, however none of the alternatives were an
improvement.

Besides docs, there is one large change to the code which I am currently
working on, which is to change PgStat_IOOpCounters into an array of
PgStatCounters instead of having individual members for each IOOp type.
I hadn't done this previously because the additional level of nesting
seemed confusing. However, it seems it would simplify the code quite a
bit and is probably worth doing.

- Melanie

Attachments:

v42-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v42-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From c5f813b209023d2ad6247a17969f4410e7511a40 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 28 Dec 2022 12:09:15 -0800
Subject: [PATCH v42 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  31 ++
 contrib/amcheck/sql/check_heap.sql      |  24 ++
 doc/src/sgml/monitoring.sgml            | 418 +++++++++++++++++++++++-
 src/backend/catalog/system_views.sql    |  15 +
 src/backend/utils/adt/pgstatfuncs.c     | 142 ++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 +
 src/test/regress/expected/stats.out     | 225 +++++++++++++
 src/test/regress/sql/stats.sql          | 138 ++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 10 files changed, 1001 insertions(+), 14 deletions(-)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..c44338fd6e 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,19 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +101,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +225,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..210f9b22e2 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,26 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +47,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +133,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 710bd2c52e..b27c6c7bc7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +674,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are primarily useful to
+   determine the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call. However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3628,6 +3637,387 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view shows IO related
+   statistics. The statistics are tracked separately for each backend type, IO
+   context (XXX rephrase), IO object (XXX rephrase), with each combination
+   returned as a separate row (combinations that do not make sense are
+   omitted).
+  </para>
+
+  <para>
+   Currently, IO on relations (e.g. tables, indexes) are tracked. However,
+   relation IO that bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) currently is not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker).
+        See <link linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information on
+        <varname>backend_type</varname>s. Some <varname>backend_type</varname>s
+        do not accumulate IO operation statistics and will not be included in
+        the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an IO operation or location of an IO object:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal> refers to the default or standard type or
+          location of IO operations on IO objects.
+         </para>
+         <para>
+          Operations on temporary relations use a process-local buffer pool and
+          are counted as <varname>io_context</varname>
+          <literal>normal</literal> , <varname>io_object</varname>
+          <literal>temp relation</literal> operations.
+         </para>
+         <para>
+          IO operations on permanent relations are done by default in shared
+          buffers. These are tracked in <varname>io_context</varname>
+          <literal>normal</literal>, <varname>io_object</varname>
+          <literal>relation</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal> refers to the IO operations incurred while
+          vacuuming and analyzing permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal> refers to IO operations on permanent
+          relations specially designated as <literal>bulkreads</literal>, such
+          as the sequential scan of a large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal> refers to IO operations on permanent
+          relations specially designated as <literal>bulkwrites</literal>,
+          such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+       <para>
+        These last three <varname>io_context</varname>s are counted separately
+        because the autovacuum daemon, explicit <command>VACUUM</command>,
+        explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+        writes acquire a limited number of shared buffers and reuse them
+        circularly to avoid occupying an undue portion of the main shared
+        buffer pool. This pattern is called a <quote>Buffer Access
+        Strategy</quote> in the <productname>PostgreSQL</productname> source
+        code and the fixed-size ring buffer is referred to as a <quote>strategy
+        ring buffer</quote> for the purposes of this view's documentation.
+        These <varname>io_context</varname>s are referred to as <quote>strategy
+        contexts</quote> and IO operations on strategy contexts are referred to
+        as <quote>strategy operations</quote>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Object operated on in a given <varname>io_context</varname> by a given
+        <varname>backend_type</varname>. Current values are
+        <literal>relation</literal>, which includes permanent relations, and
+        <literal>temp relation</literal> which includes temporary relations
+        created by <command>CREATE TEMPORARY TABLE...</command>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Reads by a given <varname>backend_type</varname> of a given
+        <varname>io_object</varname> into buffers in a given
+        <varname>io_context</varname>.
+       </para>
+       <para>
+        Note that the sum of
+        <varname>heap_blks_read</varname>,
+        <varname>idx_blks_read</varname>,
+        <varname>tidx_blks_read</varname>, and
+        <varname>toast_blks_read</varname>
+        in <link linkend="monitoring-pg-statio-all-tables-view">
+        <structname>pg_statio_all_tables</structname></link> as well as
+        <varname>blks_read</varname> in <link
+        linkend="monitoring-pg-stat-database-view">
+        <structname>pg_stat_database</structname></link> are both similar to
+        <varname>read</varname> plus <varname>extended</varname> for all
+        <varname>io_context</varname>s for the following
+        <varname>backend_type</varname>s in <structname>pg_stat_io</structname>:
+        <itemizedlist>
+         <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+         <listitem><para><literal>autovacuum worker</literal></para></listitem>
+         <listitem><para><literal>client backend</literal></para></listitem>
+         <listitem><para><literal>standalone backend</literal></para></listitem>
+         <listitem><para><literal>background worker</literal></para></listitem>
+         <listitem><para><literal>walsender</literal></para></listitem>
+        </itemizedlist>
+        The difference is that reads done as part of <command>CREATE
+        DATABASE</command> are not counted in
+        <structname>pg_statio_all_tables</structname> and
+        <structname>pg_stat_database</structname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>written</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Writes by a given <varname>backend_type</varname> of a given
+        <varname>io_object</varname> of data from a given
+        <varname>io_context</varname>.
+       </para>
+       <para>
+        Normal client backends should be able to rely on auxiliary processes
+        like the checkpointer and background writer to write out dirty data as
+        much as possible. Large numbers of writes by
+        <varname>backend_type</varname> <literal>client backend</literal> in
+        <varname>io_context</varname> <literal>normal</literal> and
+        <varname>io_object</varname> <literal>relation</literal> could indicate
+        a misconfiguration of shared buffers or of checkpointer. More
+        information on checkpointer configuration can be found in <xref
+        linkend="wal-configuration"/>.
+       </para>
+       <para>
+        Note that the values of <varname>written</varname> for
+        <varname>backend_type</varname> <literal>background writer</literal> and
+        <varname>backend_type</varname> <literal>checkpointer</literal>
+        correspond to the values of <varname>buffers_clean</varname> and
+        <varname>buffers_checkpoint</varname>, respectively, in <link
+        linkend="monitoring-pg-stat-bgwriter-view">
+        <structname>pg_stat_bgwriter</structname></link>.
+        <varname>buffers_backend</varname> in
+        <structname>pg_stat_bgwriter</structname> corresponds to
+        <structname>pg_stat_io</structname>'s <varname>written</varname> plus
+        <varname>extended</varname> for <varname>io_context</varname>s
+        <literal>normal</literal>, <literal>bulkread</literal>,
+        <literal>bulkwrite</literal>, and <literal>vacuum</literal> on
+        <varname>io_object</varname> <literal>relation</literal> for
+        <varname>backend_type</varname>s:
+        <itemizedlist>
+         <listitem><para><literal>client backend</literal></para></listitem>
+         <listitem><para><literal>autovacuum worker</literal></para></listitem>
+         <listitem><para><literal>background worker</literal></para></listitem>
+         <listitem><para><literal>walsender</literal></para></listitem>
+        </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extended</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Extends of relations done by a given <varname>backend_type</varname> in
+        order to write data for a given <varname>io_object</varname> in a given
+        <varname>io_context</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of IO read, written, or extended. For
+        block-oriented IO of relation data, reads, writes, and extends are done
+        in <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evicted</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a <varname>backend_type</varname> has evicted a block
+        from a shared or local buffer in order to reuse the buffer in this
+        <varname>io_context</varname>. Blocks are only evicted when there are no
+        unoccupied buffers.
+       </para>
+       <para>
+        <varname>evicted</varname> in <varname>io_context</varname>
+        <literal>normal</literal> and <varname>io_object</varname>
+        <literal>relation</literal> counts the number of times a block from a
+        shared buffer was evicted so that it can be replaced with another block,
+        also in shared buffers.
+       </para>
+       <para>
+        A high <varname>evicted</varname> count in <varname>io_context</varname>
+        <literal>normal</literal> and <varname>io_object</varname>
+        <literal>relation</literal> could indicate that shared buffers is too
+        small and should be set to a larger value.
+       </para>
+       <para>
+        <varname>evicted</varname> in <varname>io_context</varname>
+        <literal>vacuum</literal>, <literal>bulkread</literal>, and
+        <literal>bulkwrite</literal> counts the number of times occupied shared
+        buffers were added to the size-limited strategy ring buffer, causing the
+        buffer contents to be evicted. If the to-be-used buffer in the ring is
+        pinned or in use by another backend, it may be replaced by a new shared
+        buffer. If this shared buffer contains valid data, that block must be
+        evicted and will count as <varname>evicted</varname>.
+       </para>
+       <para>
+        Seeing a large number of <varname>evicted</varname> in strategy
+        <varname>io_context</varname>s can provide insight into primary working
+        set cache misses.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reused</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in the strategy ring was reused
+        as part of an operation in the <literal>bulkread</literal>,
+        <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+        <varname>io_context</varname>s. When a Buffer Access Strategy reuses a
+        buffer in the strategy ring, it evicts the buffer contents, incrementing
+        <varname>reused</varname>. When a Buffer Access Strategy adds a new
+        shared buffer to the strategy ring and this shared buffer is occupied,
+        the Buffer Access Strategy must evict the contents of the shared buffer,
+        incrementing <varname>evicted</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>files_synced</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of files <literal>fsync</literal>ed by a given
+        <varname>backend_type</varname> for the purpose of persisting data from
+        a given <varname>io_object</varname> dirtied in a given
+        <varname>io_context</varname>. <literal>fsync</literal>s are done at
+        segment boundaries so <varname>op_bytes</varname> does not apply to the
+        <varname>files_synced</varname> column.
+
+        <literal>fsync</literal>s are always tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+       <para>
+        Normally client backends rely on the checkpointer to ensure data is
+        persisted to permanent storage. Large numbers of
+        <varname>files_synced</varname> by <varname>backend_type</varname>
+        <literal>client backend</literal> could indicate a misconfiguration of
+        shared buffers or of checkpointer. More information on checkpointer
+        configuration can be found in <xref linkend="wal-configuration"/>.
+       </para>
+       <para>
+        Note that the sum of <varname>files_synced</varname> for all
+        <varname>io_context</varname> <literal>normal</literal>
+        <varname>io_object</varname> <literal>relation</literal> for all
+        <varname>backend_type</varname>s except <literal>checkpointer</literal>
+        corresponds to <varname>buffers_backend_fsync</varname> in <link
+        linkend="monitoring-pg-stat-bgwriter-view">
+        <structname>pg_stat_bgwriter</structname></link>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some <varname>backend_type</varname>s do not perform IO operations in some
+   <varname>io_context</varname>s and/or <varname>io_object</varname>s. These
+   rows are omitted from the view.  For example, the checkpointer does not use
+   a Buffer Access Strategy, so there will be no rows for
+   <varname>backend_type</varname> <literal>checkpointer</literal> in any of
+   the strategy <varname>io_context</varname>s.
+
+   On a more granular level, some IO operations are invalid in combination
+   with certain <varname>io_context</varname>s and
+   <varname>io_object</varname>s. Those cells will be NULL to distinguish
+   between 0 observed IO operations of that type and an invalid
+   combination. For example, temporary tables are not fsynced, so cells for
+   all <varname>backend_type</varname>s for <varname>io_object</varname>
+   <literal>temp relation</literal> in <varname>io_context</varname>
+   <literal>normal</literal> for <varname>files_synced</varname> will be
+   NULL. Some <varname>backend_type</varname>s never perform certain IO
+   operations.  Those cells will also be NULL in the view. For example
+   <literal>background writer</literal> should not perform reads.
+  </para>
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..71646f5aef 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.read,
+       b.written,
+       b.extended,
+       b.op_bytes,
+       b.evicted,
+       b.reused,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 42b890b806..ad369cd7ec 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1234,6 +1234,148 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		bool		bktype_tracked;
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+		for (IOContext io_context = IOCONTEXT_BULKREAD;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+			const PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
+
+			for (IOObject io_obj = IOOBJECT_RELATION;
+				 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+			{
+				const PgStat_IOOpCounters *counters = &io_objs->data[io_obj];
+				const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_object(bktype, io_context, io_obj))
+				{
+					Assert(pgstat_iszero_io_object(counters));
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				/*
+				 * Some combinations of BackendType and IOOp, of IOContext and
+				 * IOOp, and of IOObject and IOOp are not tracked. Set these
+				 * cells in the view NULL and assert that these stats are zero
+				 * as expected.
+				 */
+				for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_context, io_obj, io_op);
+
+					if (!nulls[col_idx])
+						values[col_idx] =
+							Int64GetDatum(pgstat_get_io_op_value(counters, io_op));
+					else
+						Assert(pgstat_iszero_io_op(counters, io_op));
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7be9a50147..782f27523f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5686,6 +5686,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,read,written,extended,op_bytes,evicted,reused,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..2d0e7dc5c5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.read,
+    b.written,
+    b.extended,
+    b.op_bytes,
+    b.evicted,
+    b.reused,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, read, written, extended, op_bytes, evicted, reused, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..01070a53a4 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,229 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..962ae5b281 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,142 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9336bf9796..ae871165cf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3372,6 +3372,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.38.1

v42-0003-pgstat-Count-IO-for-relations.patchapplication/octet-stream; name=v42-0003-pgstat-Count-IO-for-relations.patchDownload

From 5617306ae7fe1d4019eeb596497f765c907a4c2e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 28 Dec 2022 12:49:35 -0800
Subject: [PATCH v42 3/4] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 102 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 +++++++++++----
 src/backend/storage/buffer/localbuf.c |  29 ++++++--
 src/backend/storage/smgr/md.c         |  25 +++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 189 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..3709d2e810 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +874,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +989,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1007,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1034,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1129,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output paramter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1193,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1211,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1226,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1280,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1295,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, *io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1467,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, *io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2618,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2868,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2960,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3622,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3656,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3754,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3964,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3991,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..f5e2138701 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -100,14 +101,22 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc(). Note that, unlike BufferAlloc(),
+ * no locking is required and IO_IN_PROGRESS does not get set.
+ *
+ * Only the default access strategy is supported with local buffers, so no
+ * BufferAccessStrategy is passed to LocalBufferAlloc(). The selected buffer's
+ * usage_count is, therefore, unconditionally advanced. Also, the passed-in
+ * io_context is always set to IOCONTEXT_NORMAL. This indicates to the caller
+ * not to use the BufferAccessStrategy to set the io_context itself.
+ *
+ * This is important in cases like CREATE TEMPORARY TABLE AS ..., in which a
+ * BufferAccessStrategy object may have been created for the CTAS operation but
+ * it will not be used because it will operate on local buffers.
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +136,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here for convenience since there is no function call
+	 * overhead to avoid in the case of a local buffer hit (like that of
+	 * IOCOntextForStrategy()).
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +247,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +274,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..2115d7184a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer or bgwriter. Since other IO operations bypassing the
+		 * buffer manager could also be tracked in such an IOContext, wait
+		 * until these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	if (result >= 0)
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.38.1

v42-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchapplication/octet-stream; name=v42-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchDownload

From 4427f5e31a17f73da8083c4c2c8be6fe9a1a8607 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Dec 2022 18:23:19 -0800
Subject: [PATCH v42 1/4] pgindent and some manual cleanup in pgstat related
 code

---
 src/backend/storage/buffer/bufmgr.c          | 22 ++++++++++----------
 src/backend/storage/buffer/localbuf.c        |  4 ++--
 src/backend/utils/activity/pgstat.c          |  3 ++-
 src/backend/utils/activity/pgstat_relation.c |  1 +
 src/backend/utils/adt/pgstatfuncs.c          |  2 +-
 src/include/pgstat.h                         |  1 +
 src/include/utils/pgstat_internal.h          |  1 +
 7 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..8075828e8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -516,7 +516,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr_reln->smgr_rlocator.locator,
-				   forkNum, blockNum);
+				  forkNum, blockNum);
 
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
@@ -3297,8 +3297,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3425,8 +3425,8 @@ DropDatabaseBuffers(Oid dbid)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
@@ -3572,8 +3572,8 @@ FlushRelationBuffers(Relation rel)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator))
 			continue;
@@ -3645,8 +3645,8 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3880,8 +3880,8 @@ FlushDatabaseBuffers(Oid dbid)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b2720df6ea..8372acc383 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -610,8 +610,8 @@ AtProcExit_LocalBuffers(void)
 {
 	/*
 	 * We shouldn't be holding any remaining pins; if we are, and assertions
-	 * aren't enabled, we'll fail later in DropRelationBuffers while
-	 * trying to drop the temp rels.
+	 * aren't enabled, we'll fail later in DropRelationBuffers while trying to
+	 * drop the temp rels.
 	 */
 	CheckForLocalBufferLeaks();
 }
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7e9dc17e68..0fa5370bcd 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -426,7 +426,7 @@ pgstat_discard_stats(void)
 		ereport(DEBUG2,
 				(errcode_for_file_access(),
 				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-						PGSTAT_STAT_PERMANENT_FILENAME)));
+								 PGSTAT_STAT_PERMANENT_FILENAME)));
 	}
 
 	/*
@@ -986,6 +986,7 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 										 kind_info->shared_size);
+
 		/*
 		 * Acquire the LWLock directly instead of using
 		 * pg_stat_lock_entry_shared() which requires a reference.
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 1730425de1..2e20b93c20 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -783,6 +783,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	if (lstats->t_counts.t_numscans)
 	{
 		TimestampTz t = GetCurrentTransactionStopTimestamp();
+
 		if (t > tabentry->lastscan)
 			tabentry->lastscan = t;
 	}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..58bd1360b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -906,7 +906,7 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
 	clean_ipv6_addr(beentry->st_clientaddr.addr.ss_family, remote_host);
 
 	PG_RETURN_DATUM(DirectFunctionCall1(inet_in,
-										 CStringGetDatum(remote_host)));
+										CStringGetDatum(remote_host)));
 }
 
 Datum
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..5e3326a3b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -476,6 +476,7 @@ extern void pgstat_report_connect(Oid dboid);
 
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dboid);
 
+
 /*
  * Functions in pgstat_function.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 08412d6404..12fd51f1ae 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -626,6 +626,7 @@ extern void pgstat_wal_snapshot_cb(void);
 extern bool pgstat_subscription_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_subscription_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
 
+
 /*
  * Functions in pgstat_xact.c
  */
-- 
2.38.1

v42-0002-pgstat-Infrastructure-to-track-IO-operations.patchapplication/octet-stream; name=v42-0002-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From 9fd9fafe1245bad0802772810f3450e651f7de63 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 5 Dec 2022 19:25:44 -0500
Subject: [PATCH v42 2/4] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOp "written" on an
IOObject IOOBJECT_RELATION in IOContext IOCONTEXT_NORMAL by BackendType
"checkpointer".

Each IOOp (evict, extend, fsync, read, reuse, and write) can be counted
per IOObject (relation, temp relation) per IOContext (normal, bulkread,
bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  38 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 446 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          | 126 +++++
 src/include/utils/pgstat_internal.h           |  34 ++
 src/tools/pgindent/typedefs.list              |   8 +
 15 files changed, 693 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5bcba0fdec..710bd2c52e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5403,6 +5403,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..8451be0617 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,15 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io.stat_reset_timestamp);
+	for (BackendType bktype = B_INVALID + 1; bktype < BACKEND_NUM_TYPES;
+		 bktype++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io.stats[bktype]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1519,21 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stat_reset_timestamp))
+		goto error;
+
+	for (BackendType bktype = B_INVALID + 1; bktype < BACKEND_NUM_TYPES;
+		 bktype++)
+	{
+		Assert(pgstat_bktype_io_stats_valid(&shmem->io.stats[bktype],
+											bktype));
+		if (!read_chunk_s(fpin, &shmem->io.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..981372c24c
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,446 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_IOContextOps pending_IOOpStats;
+bool		have_iostats = false;
+
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	PgStat_IOOpCounters *pending_counters;
+
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_context, io_object, io_op));
+
+	pending_counters = &pending_IOOpStats.data[io_context].data[io_object];
+
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			pending_counters->evictions++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_REUSE:
+			pending_counters->reuses++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_IOContextOps *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOObjectOps *shared_objs = &bktype_shstats->data[io_context];
+		PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context];
+
+		for (IOObject io_object = IOOBJECT_RELATION;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object];
+			PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object];
+
+#define IO_ACC(fld) sharedent->fld += pendingent->fld
+			IO_ACC(evictions);
+			IO_ACC(extends);
+			IO_ACC(fsyncs);
+			IO_ACC(reads);
+			IO_ACC(reuses);
+			IO_ACC(writes);
+#undef IO_ACC
+		}
+	}
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_IOContextOps *bktype_shstats = &pgStatLocal.shmem->io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_IOContextOps *bktype_shstats = &pgStatLocal.shmem->io.stats[i];
+		PgStat_IOContextOps *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO in certain IOContexts. Some IOObjects
+ * are never operated on in some IOContexts. Check that the given BackendType
+ * is expected to do IO in the given IOContext and that the given IOObject is
+ * expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOContext io_context,
+						IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+					IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_context, io_object))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in array of PgStat_IOOpCounters must contain stats from the
+ * BackendType specified by the second parameter. Caller is responsible for
+ * locking of the passed-in PgStat_IOContextOps, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_IOContextOps *context_ops,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		PgStat_IOObjectOps *context = &context_ops->data[io_context];
+
+		for (IOObject io_object = IOOBJECT_RELATION;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			PgStat_IOOpCounters *object = &context->data[io_object];
+
+			if (!bktype_tracked ||
+				!pgstat_tracks_io_object(bktype, io_context,
+										 io_object))
+			{
+				if (!pgstat_iszero_io_object(object))
+					return false;
+				continue;
+			}
+
+			for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if (!pgstat_tracks_io_op(bktype, io_context, io_object, io_op) &&
+					!pgstat_iszero_io_op(object, io_op))
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..42b890b806 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1593,6 +1593,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1601,7 +1603,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0ffeefc437..0aaf600a78 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..859442f69b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,71 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO for various IO Contexts.  When adding a new
+ * value, ensure that the proper paths are added to pgstat_iszero_io_object()
+ * and pgstat_iszero_io_op() (though the compiler will remind you about the
+ * latter).
+ */
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter evictions;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter reuses;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOObjectOps
+{
+	PgStat_IOOpCounters data[IOOBJECT_NUM_TYPES];
+} PgStat_IOObjectOps;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +519,66 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOContext io_context, IOObject io_object);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+								IOObject io_object, IOOp io_op);
+
+/*
+ * Functions to check if counters are zero.
+ */
+static inline bool
+pgstat_iszero_io_object(const PgStat_IOOpCounters *counters)
+{
+	return
+		counters->evictions == 0 &&
+		counters->extends == 0 &&
+		counters->fsyncs == 0 &&
+		counters->reads == 0 &&
+		counters->reuses == 0 &&
+		counters->writes == 0;
+}
+
+static inline PgStat_Counter
+pgstat_get_io_op_value(const PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return counters->evictions;
+		case IOOP_EXTEND:
+			return counters->extends;
+		case IOOP_FSYNC:
+			return counters->fsyncs;
+		case IOOP_READ:
+			return counters->reads;
+		case IOOP_REUSE:
+			return counters->reuses;
+		case IOOP_WRITE:
+			return counters->writes;
+	}
+
+	pg_unreachable();
+}
+
+static inline bool
+pgstat_iszero_io_op(const PgStat_IOOpCounters *counters, IOOp io_op)
+{
+	return pgstat_get_io_op_value(counters, io_op) == 0;
+}
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..8f4f9b760c 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,19 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* shared version of PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects ->stats[i]. locks[0] also protects
+	 * ->stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +432,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +456,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +565,17 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+extern bool pgstat_flush_io(bool nowait);
+extern bool pgstat_bktype_io_stats_valid(PgStat_IOContextOps *context_ops,
+										 BackendType bktype);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +670,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 50d86cb01b..9336bf9796 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2010,6 +2013,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2027,6 +2031,10 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
+PgStat_IOContextOps
+PgStat_IOObjectOps
+PgStat_IOOpCounters
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.38.1

#108

melanieplageman@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#107)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Jan 2, 2023 at 5:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Besides docs, there is one large change to the code which I am currently
working on, which is to change PgStat_IOOpCounters into an array of
PgStatCounters instead of having individual members for each IOOp type.
I hadn't done this previously because the additional level of nesting
seemed confusing. However, it seems it would simplify the code quite a
bit and is probably worth doing.

As described above, attached v43 uses an array for the PgStatCounters of
IOOps instead of struct members.

Attachments:

v43-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchapplication/octet-stream; name=v43-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchDownload

From 0d151b5ebb65b38c89c87500885306fc4b2a2a63 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Dec 2022 18:23:19 -0800
Subject: [PATCH v43 1/4] pgindent and some manual cleanup in pgstat related
 code

---
 src/backend/storage/buffer/bufmgr.c          | 22 ++++++++++----------
 src/backend/storage/buffer/localbuf.c        |  4 ++--
 src/backend/utils/activity/pgstat.c          |  3 ++-
 src/backend/utils/activity/pgstat_relation.c |  1 +
 src/backend/utils/adt/pgstatfuncs.c          |  2 +-
 src/include/pgstat.h                         |  1 +
 src/include/utils/pgstat_internal.h          |  1 +
 7 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..8075828e8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -516,7 +516,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr_reln->smgr_rlocator.locator,
-				   forkNum, blockNum);
+				  forkNum, blockNum);
 
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
@@ -3297,8 +3297,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3425,8 +3425,8 @@ DropDatabaseBuffers(Oid dbid)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
@@ -3572,8 +3572,8 @@ FlushRelationBuffers(Relation rel)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator))
 			continue;
@@ -3645,8 +3645,8 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3880,8 +3880,8 @@ FlushDatabaseBuffers(Oid dbid)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b2720df6ea..8372acc383 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -610,8 +610,8 @@ AtProcExit_LocalBuffers(void)
 {
 	/*
 	 * We shouldn't be holding any remaining pins; if we are, and assertions
-	 * aren't enabled, we'll fail later in DropRelationBuffers while
-	 * trying to drop the temp rels.
+	 * aren't enabled, we'll fail later in DropRelationBuffers while trying to
+	 * drop the temp rels.
 	 */
 	CheckForLocalBufferLeaks();
 }
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7e9dc17e68..0fa5370bcd 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -426,7 +426,7 @@ pgstat_discard_stats(void)
 		ereport(DEBUG2,
 				(errcode_for_file_access(),
 				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-						PGSTAT_STAT_PERMANENT_FILENAME)));
+								 PGSTAT_STAT_PERMANENT_FILENAME)));
 	}
 
 	/*
@@ -986,6 +986,7 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 										 kind_info->shared_size);
+
 		/*
 		 * Acquire the LWLock directly instead of using
 		 * pg_stat_lock_entry_shared() which requires a reference.
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 1730425de1..2e20b93c20 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -783,6 +783,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	if (lstats->t_counts.t_numscans)
 	{
 		TimestampTz t = GetCurrentTransactionStopTimestamp();
+
 		if (t > tabentry->lastscan)
 			tabentry->lastscan = t;
 	}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..58bd1360b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -906,7 +906,7 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
 	clean_ipv6_addr(beentry->st_clientaddr.addr.ss_family, remote_host);
 
 	PG_RETURN_DATUM(DirectFunctionCall1(inet_in,
-										 CStringGetDatum(remote_host)));
+										CStringGetDatum(remote_host)));
 }
 
 Datum
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..5e3326a3b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -476,6 +476,7 @@ extern void pgstat_report_connect(Oid dboid);
 
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dboid);
 
+
 /*
  * Functions in pgstat_function.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 08412d6404..12fd51f1ae 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -626,6 +626,7 @@ extern void pgstat_wal_snapshot_cb(void);
 extern bool pgstat_subscription_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_subscription_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
 
+
 /*
  * Functions in pgstat_xact.c
  */
-- 
2.38.1

v43-0002-pgstat-Infrastructure-to-track-IO-operations.patchapplication/octet-stream; name=v43-0002-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From 699a469c440a0ce9676f04126ec6080e8f7739df Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 5 Dec 2022 19:25:44 -0500
Subject: [PATCH v43 2/4] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOp "written" on an
IOObject IOOBJECT_RELATION in IOContext IOCONTEXT_NORMAL by BackendType
"checkpointer".

Each IOOp (evict, extend, fsync, read, reuse, and write) can be counted
per IOObject (relation, temp relation) per IOContext (normal, bulkread,
bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  38 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 395 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  78 ++++
 src/include/utils/pgstat_internal.h           |  34 ++
 src/tools/pgindent/typedefs.list              |   8 +
 15 files changed, 594 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5bcba0fdec..710bd2c52e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5403,6 +5403,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..8451be0617 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,15 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io.stat_reset_timestamp);
+	for (BackendType bktype = B_INVALID + 1; bktype < BACKEND_NUM_TYPES;
+		 bktype++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io.stats[bktype]);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1519,21 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stat_reset_timestamp))
+		goto error;
+
+	for (BackendType bktype = B_INVALID + 1; bktype < BACKEND_NUM_TYPES;
+		 bktype++)
+	{
+		Assert(pgstat_bktype_io_stats_valid(&shmem->io.stats[bktype],
+											bktype));
+		if (!read_chunk_s(fpin, &shmem->io.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..9e14d8a491
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,395 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_IOContextOps pending_IOOpStats;
+bool		have_iostats = false;
+
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_context, io_object, io_op));
+
+	pending_IOOpStats.data[io_context].data[io_object].data[io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_IOContextOps *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (IOObject io_object = IOOBJECT_RELATION;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+			for (IOOp io_op = IOOP_EVICT;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_context].data[io_object].data[io_op] +=
+					pending_IOOpStats.data[io_context].data[io_object].data[io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_IOContextOps *bktype_shstats = &pgStatLocal.shmem->io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_IOContextOps *bktype_shstats = &pgStatLocal.shmem->io.stats[i];
+		PgStat_IOContextOps *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOContextOps to
+		 * protect the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO in certain IOContexts. Some IOObjects
+ * are never operated on in some IOContexts. Check that the given BackendType
+ * is expected to do IO in the given IOContext and that the given IOObject is
+ * expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOContext io_context,
+						IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+					IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_context, io_object))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in array of PgStat_IOOps must contain stats from the
+ * BackendType specified by the second parameter. Caller is responsible for
+ * locking of the passed-in PgStat_IOContextOps, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_IOContextOps *context_ops,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOContext io_context = IOCONTEXT_BULKREAD;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		for (IOObject io_object = IOOBJECT_RELATION;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
+					 context_ops->data[io_context].data[io_object].data[io_op] != 0)
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..42b890b806 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1593,6 +1593,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1601,7 +1603,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0ffeefc437..0aaf600a78 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..adf54b2e27 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,66 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO for various IO Contexts.  When adding a new
+ * value, ensure that the proper paths are added to pgstat_iszero_io_object()
+ * and pgstat_iszero_io_op() (though the compiler will remind you about the
+ * latter).
+ */
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_IOOps
+{
+	PgStat_Counter data[IOOP_NUM_TYPES];
+} PgStat_IOOps;
+
+typedef struct PgStat_IOObjectOps
+{
+	PgStat_IOOps data[IOOBJECT_NUM_TYPES];
+} PgStat_IOObjectOps;
+
+typedef struct PgStat_IOContextOps
+{
+	PgStat_IOObjectOps data[IOCONTEXT_NUM_TYPES];
+} PgStat_IOContextOps;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +514,23 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOContext io_context, IOObject io_object);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+								IOObject io_object, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..8f4f9b760c 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,19 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* shared version of PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects ->stats[i]. locks[0] also protects
+	 * ->stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOContextOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +432,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +456,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +565,17 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+extern bool pgstat_flush_io(bool nowait);
+extern bool pgstat_bktype_io_stats_valid(PgStat_IOContextOps *context_ops,
+										 BackendType bktype);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +670,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 50d86cb01b..3a455311db 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2010,6 +2013,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2027,6 +2031,10 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
+PgStat_IOContextOps
+PgStat_IOObjectOps
+PgStat_IOOps
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.38.1

v43-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v43-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From bda748c9b87b71898f2d7e2a1a9f92d1d8b5dcbb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 28 Dec 2022 12:09:15 -0800
Subject: [PATCH v43 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  31 ++
 contrib/amcheck/sql/check_heap.sql      |  24 ++
 doc/src/sgml/monitoring.sgml            | 418 +++++++++++++++++++++++-
 src/backend/catalog/system_views.sql    |  15 +
 src/backend/utils/adt/pgstatfuncs.c     | 155 +++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 +
 src/test/regress/expected/stats.out     | 225 +++++++++++++
 src/test/regress/sql/stats.sql          | 138 ++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 10 files changed, 1014 insertions(+), 14 deletions(-)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..c44338fd6e 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,19 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +101,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +225,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..210f9b22e2 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,26 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +47,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +133,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 710bd2c52e..b27c6c7bc7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +674,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are primarily useful to
+   determine the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call. However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3628,6 +3637,387 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view shows IO related
+   statistics. The statistics are tracked separately for each backend type, IO
+   context (XXX rephrase), IO object (XXX rephrase), with each combination
+   returned as a separate row (combinations that do not make sense are
+   omitted).
+  </para>
+
+  <para>
+   Currently, IO on relations (e.g. tables, indexes) are tracked. However,
+   relation IO that bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) currently is not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker).
+        See <link linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information on
+        <varname>backend_type</varname>s. Some <varname>backend_type</varname>s
+        do not accumulate IO operation statistics and will not be included in
+        the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an IO operation or location of an IO object:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal> refers to the default or standard type or
+          location of IO operations on IO objects.
+         </para>
+         <para>
+          Operations on temporary relations use a process-local buffer pool and
+          are counted as <varname>io_context</varname>
+          <literal>normal</literal> , <varname>io_object</varname>
+          <literal>temp relation</literal> operations.
+         </para>
+         <para>
+          IO operations on permanent relations are done by default in shared
+          buffers. These are tracked in <varname>io_context</varname>
+          <literal>normal</literal>, <varname>io_object</varname>
+          <literal>relation</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal> refers to the IO operations incurred while
+          vacuuming and analyzing permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal> refers to IO operations on permanent
+          relations specially designated as <literal>bulkreads</literal>, such
+          as the sequential scan of a large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal> refers to IO operations on permanent
+          relations specially designated as <literal>bulkwrites</literal>,
+          such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+       <para>
+        These last three <varname>io_context</varname>s are counted separately
+        because the autovacuum daemon, explicit <command>VACUUM</command>,
+        explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+        writes acquire a limited number of shared buffers and reuse them
+        circularly to avoid occupying an undue portion of the main shared
+        buffer pool. This pattern is called a <quote>Buffer Access
+        Strategy</quote> in the <productname>PostgreSQL</productname> source
+        code and the fixed-size ring buffer is referred to as a <quote>strategy
+        ring buffer</quote> for the purposes of this view's documentation.
+        These <varname>io_context</varname>s are referred to as <quote>strategy
+        contexts</quote> and IO operations on strategy contexts are referred to
+        as <quote>strategy operations</quote>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Object operated on in a given <varname>io_context</varname> by a given
+        <varname>backend_type</varname>. Current values are
+        <literal>relation</literal>, which includes permanent relations, and
+        <literal>temp relation</literal> which includes temporary relations
+        created by <command>CREATE TEMPORARY TABLE...</command>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Reads by a given <varname>backend_type</varname> of a given
+        <varname>io_object</varname> into buffers in a given
+        <varname>io_context</varname>.
+       </para>
+       <para>
+        Note that the sum of
+        <varname>heap_blks_read</varname>,
+        <varname>idx_blks_read</varname>,
+        <varname>tidx_blks_read</varname>, and
+        <varname>toast_blks_read</varname>
+        in <link linkend="monitoring-pg-statio-all-tables-view">
+        <structname>pg_statio_all_tables</structname></link> as well as
+        <varname>blks_read</varname> in <link
+        linkend="monitoring-pg-stat-database-view">
+        <structname>pg_stat_database</structname></link> are both similar to
+        <varname>read</varname> plus <varname>extended</varname> for all
+        <varname>io_context</varname>s for the following
+        <varname>backend_type</varname>s in <structname>pg_stat_io</structname>:
+        <itemizedlist>
+         <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+         <listitem><para><literal>autovacuum worker</literal></para></listitem>
+         <listitem><para><literal>client backend</literal></para></listitem>
+         <listitem><para><literal>standalone backend</literal></para></listitem>
+         <listitem><para><literal>background worker</literal></para></listitem>
+         <listitem><para><literal>walsender</literal></para></listitem>
+        </itemizedlist>
+        The difference is that reads done as part of <command>CREATE
+        DATABASE</command> are not counted in
+        <structname>pg_statio_all_tables</structname> and
+        <structname>pg_stat_database</structname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>written</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Writes by a given <varname>backend_type</varname> of a given
+        <varname>io_object</varname> of data from a given
+        <varname>io_context</varname>.
+       </para>
+       <para>
+        Normal client backends should be able to rely on auxiliary processes
+        like the checkpointer and background writer to write out dirty data as
+        much as possible. Large numbers of writes by
+        <varname>backend_type</varname> <literal>client backend</literal> in
+        <varname>io_context</varname> <literal>normal</literal> and
+        <varname>io_object</varname> <literal>relation</literal> could indicate
+        a misconfiguration of shared buffers or of checkpointer. More
+        information on checkpointer configuration can be found in <xref
+        linkend="wal-configuration"/>.
+       </para>
+       <para>
+        Note that the values of <varname>written</varname> for
+        <varname>backend_type</varname> <literal>background writer</literal> and
+        <varname>backend_type</varname> <literal>checkpointer</literal>
+        correspond to the values of <varname>buffers_clean</varname> and
+        <varname>buffers_checkpoint</varname>, respectively, in <link
+        linkend="monitoring-pg-stat-bgwriter-view">
+        <structname>pg_stat_bgwriter</structname></link>.
+        <varname>buffers_backend</varname> in
+        <structname>pg_stat_bgwriter</structname> corresponds to
+        <structname>pg_stat_io</structname>'s <varname>written</varname> plus
+        <varname>extended</varname> for <varname>io_context</varname>s
+        <literal>normal</literal>, <literal>bulkread</literal>,
+        <literal>bulkwrite</literal>, and <literal>vacuum</literal> on
+        <varname>io_object</varname> <literal>relation</literal> for
+        <varname>backend_type</varname>s:
+        <itemizedlist>
+         <listitem><para><literal>client backend</literal></para></listitem>
+         <listitem><para><literal>autovacuum worker</literal></para></listitem>
+         <listitem><para><literal>background worker</literal></para></listitem>
+         <listitem><para><literal>walsender</literal></para></listitem>
+        </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extended</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Extends of relations done by a given <varname>backend_type</varname> in
+        order to write data for a given <varname>io_object</varname> in a given
+        <varname>io_context</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of IO read, written, or extended. For
+        block-oriented IO of relation data, reads, writes, and extends are done
+        in <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evicted</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a <varname>backend_type</varname> has evicted a block
+        from a shared or local buffer in order to reuse the buffer in this
+        <varname>io_context</varname>. Blocks are only evicted when there are no
+        unoccupied buffers.
+       </para>
+       <para>
+        <varname>evicted</varname> in <varname>io_context</varname>
+        <literal>normal</literal> and <varname>io_object</varname>
+        <literal>relation</literal> counts the number of times a block from a
+        shared buffer was evicted so that it can be replaced with another block,
+        also in shared buffers.
+       </para>
+       <para>
+        A high <varname>evicted</varname> count in <varname>io_context</varname>
+        <literal>normal</literal> and <varname>io_object</varname>
+        <literal>relation</literal> could indicate that shared buffers is too
+        small and should be set to a larger value.
+       </para>
+       <para>
+        <varname>evicted</varname> in <varname>io_context</varname>
+        <literal>vacuum</literal>, <literal>bulkread</literal>, and
+        <literal>bulkwrite</literal> counts the number of times occupied shared
+        buffers were added to the size-limited strategy ring buffer, causing the
+        buffer contents to be evicted. If the to-be-used buffer in the ring is
+        pinned or in use by another backend, it may be replaced by a new shared
+        buffer. If this shared buffer contains valid data, that block must be
+        evicted and will count as <varname>evicted</varname>.
+       </para>
+       <para>
+        Seeing a large number of <varname>evicted</varname> in strategy
+        <varname>io_context</varname>s can provide insight into primary working
+        set cache misses.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reused</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in the strategy ring was reused
+        as part of an operation in the <literal>bulkread</literal>,
+        <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+        <varname>io_context</varname>s. When a Buffer Access Strategy reuses a
+        buffer in the strategy ring, it evicts the buffer contents, incrementing
+        <varname>reused</varname>. When a Buffer Access Strategy adds a new
+        shared buffer to the strategy ring and this shared buffer is occupied,
+        the Buffer Access Strategy must evict the contents of the shared buffer,
+        incrementing <varname>evicted</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>files_synced</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of files <literal>fsync</literal>ed by a given
+        <varname>backend_type</varname> for the purpose of persisting data from
+        a given <varname>io_object</varname> dirtied in a given
+        <varname>io_context</varname>. <literal>fsync</literal>s are done at
+        segment boundaries so <varname>op_bytes</varname> does not apply to the
+        <varname>files_synced</varname> column.
+
+        <literal>fsync</literal>s are always tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+       <para>
+        Normally client backends rely on the checkpointer to ensure data is
+        persisted to permanent storage. Large numbers of
+        <varname>files_synced</varname> by <varname>backend_type</varname>
+        <literal>client backend</literal> could indicate a misconfiguration of
+        shared buffers or of checkpointer. More information on checkpointer
+        configuration can be found in <xref linkend="wal-configuration"/>.
+       </para>
+       <para>
+        Note that the sum of <varname>files_synced</varname> for all
+        <varname>io_context</varname> <literal>normal</literal>
+        <varname>io_object</varname> <literal>relation</literal> for all
+        <varname>backend_type</varname>s except <literal>checkpointer</literal>
+        corresponds to <varname>buffers_backend_fsync</varname> in <link
+        linkend="monitoring-pg-stat-bgwriter-view">
+        <structname>pg_stat_bgwriter</structname></link>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some <varname>backend_type</varname>s do not perform IO operations in some
+   <varname>io_context</varname>s and/or <varname>io_object</varname>s. These
+   rows are omitted from the view.  For example, the checkpointer does not use
+   a Buffer Access Strategy, so there will be no rows for
+   <varname>backend_type</varname> <literal>checkpointer</literal> in any of
+   the strategy <varname>io_context</varname>s.
+
+   On a more granular level, some IO operations are invalid in combination
+   with certain <varname>io_context</varname>s and
+   <varname>io_object</varname>s. Those cells will be NULL to distinguish
+   between 0 observed IO operations of that type and an invalid
+   combination. For example, temporary tables are not fsynced, so cells for
+   all <varname>backend_type</varname>s for <varname>io_object</varname>
+   <literal>temp relation</literal> in <varname>io_context</varname>
+   <literal>normal</literal> for <varname>files_synced</varname> will be
+   NULL. Some <varname>backend_type</varname>s never perform certain IO
+   operations.  Those cells will also be NULL in the view. For example
+   <literal>background writer</literal> should not perform reads.
+  </para>
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..71646f5aef 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.read,
+       b.written,
+       b.extended,
+       b.op_bytes,
+       b.evicted,
+       b.reused,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 42b890b806..0f8e48e44e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1234,6 +1234,161 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgstat_iszero_io_object(const PgStat_IOOps *io_ops)
+{
+	for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+	{
+		if (io_ops->data[io_op] != 0)
+			return false;
+	}
+
+	return true;
+}
+#endif
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		bool		bktype_tracked;
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+		for (IOContext io_context = IOCONTEXT_BULKREAD;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+			const PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context];
+
+			for (IOObject io_obj = IOOBJECT_RELATION;
+				 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+			{
+				const PgStat_IOOps *object = &io_objs->data[io_obj];
+				const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_object(bktype, io_context, io_obj))
+				{
+					Assert(pgstat_iszero_io_object(object));
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				/*
+				 * Some combinations of BackendType and IOOp, of IOContext and
+				 * IOOp, and of IOObject and IOOp are not tracked. Set these
+				 * cells in the view NULL and assert that these stats are zero
+				 * as expected.
+				 */
+				for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_context, io_obj, io_op);
+
+					if (!nulls[col_idx])
+						values[col_idx] = Int64GetDatum(object->data[io_op]);
+					else
+						Assert(object->data[io_op] == 0);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7be9a50147..782f27523f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5686,6 +5686,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,read,written,extended,op_bytes,evicted,reused,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..2d0e7dc5c5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.read,
+    b.written,
+    b.extended,
+    b.op_bytes,
+    b.evicted,
+    b.reused,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, read, written, extended, op_bytes, evicted, reused, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..01070a53a4 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,229 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..962ae5b281 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,142 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3a455311db..9ce5092766 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3372,6 +3372,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.38.1

v43-0003-pgstat-Count-IO-for-relations.patchapplication/octet-stream; name=v43-0003-pgstat-Count-IO-for-relations.patchDownload

From 7dd8b8f53d38cb26c199d7be6ecba01af0c59136 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 28 Dec 2022 12:49:35 -0800
Subject: [PATCH v43 3/4] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 102 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 +++++++++++----
 src/backend/storage/buffer/localbuf.c |  29 ++++++--
 src/backend/storage/smgr/md.c         |  25 +++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 189 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..3709d2e810 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +874,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +989,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1007,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1034,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1129,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output paramter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1193,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1211,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1226,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1280,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1295,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, *io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1467,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, *io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2618,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2868,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2960,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3622,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3656,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3754,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3964,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3991,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..f5e2138701 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -100,14 +101,22 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc(). Note that, unlike BufferAlloc(),
+ * no locking is required and IO_IN_PROGRESS does not get set.
+ *
+ * Only the default access strategy is supported with local buffers, so no
+ * BufferAccessStrategy is passed to LocalBufferAlloc(). The selected buffer's
+ * usage_count is, therefore, unconditionally advanced. Also, the passed-in
+ * io_context is always set to IOCONTEXT_NORMAL. This indicates to the caller
+ * not to use the BufferAccessStrategy to set the io_context itself.
+ *
+ * This is important in cases like CREATE TEMPORARY TABLE AS ..., in which a
+ * BufferAccessStrategy object may have been created for the CTAS operation but
+ * it will not be used because it will operate on local buffers.
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +136,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here for convenience since there is no function call
+	 * overhead to avoid in the case of a local buffer hit (like that of
+	 * IOCOntextForStrategy()).
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +247,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +274,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..2115d7184a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer or bgwriter. Since other IO operations bypassing the
+		 * buffer manager could also be tracked in such an IOContext, wait
+		 * until these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	if (result >= 0)
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.38.1

#109

melanieplageman@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#108)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Jan 2, 2023 at 8:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jan 2, 2023 at 5:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Besides docs, there is one large change to the code which I am currently
working on, which is to change PgStat_IOOpCounters into an array of
PgStatCounters instead of having individual members for each IOOp type.
I hadn't done this previously because the additional level of nesting
seemed confusing. However, it seems it would simplify the code quite a
bit and is probably worth doing.

As described above, attached v43 uses an array for the PgStatCounters of
IOOps instead of struct members.

This wasn't quite a multi-dimensional array. Attached is v44, in which I
have removed all of the granular struct types -- PgStat_IOOps,
PgStat_IOContext, and PgStat_IOObject by collapsing them into a single
array of PgStat_Counters in a new struct PgStat_BackendIO. I needed to
keep this in addition to PgStat_IO to have a data type for backends to
track their stats in locally.

I've also done another round of cleanup.

- Melanie

Attachments:

v44-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchapplication/octet-stream; name=v44-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 312b3d61c7f0fc55af232c96b1188ec64e868d7f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:23 -0500
Subject: [PATCH v44 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  31 ++
 contrib/amcheck/sql/check_heap.sql      |  24 ++
 doc/src/sgml/monitoring.sgml            | 418 +++++++++++++++++++++++-
 src/backend/catalog/system_views.sql    |  15 +
 src/backend/utils/adt/pgstatfuncs.c     | 154 +++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 +
 src/test/regress/expected/stats.out     | 225 +++++++++++++
 src/test/regress/sql/stats.sql          | 138 ++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 10 files changed, 1013 insertions(+), 14 deletions(-)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..c44338fd6e 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,19 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +101,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +225,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..210f9b22e2 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,26 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +47,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +133,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 710bd2c52e..b27c6c7bc7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO Context for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +674,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are primarily useful to
+   determine the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call. However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3628,6 +3637,387 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view shows IO related
+   statistics. The statistics are tracked separately for each backend type, IO
+   context (XXX rephrase), IO object (XXX rephrase), with each combination
+   returned as a separate row (combinations that do not make sense are
+   omitted).
+  </para>
+
+  <para>
+   Currently, IO on relations (e.g. tables, indexes) are tracked. However,
+   relation IO that bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) currently is not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker).
+        See <link linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information on
+        <varname>backend_type</varname>s. Some <varname>backend_type</varname>s
+        do not accumulate IO operation statistics and will not be included in
+        the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an IO operation or location of an IO object:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal> refers to the default or standard type or
+          location of IO operations on IO objects.
+         </para>
+         <para>
+          Operations on temporary relations use a process-local buffer pool and
+          are counted as <varname>io_context</varname>
+          <literal>normal</literal> , <varname>io_object</varname>
+          <literal>temp relation</literal> operations.
+         </para>
+         <para>
+          IO operations on permanent relations are done by default in shared
+          buffers. These are tracked in <varname>io_context</varname>
+          <literal>normal</literal>, <varname>io_object</varname>
+          <literal>relation</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal> refers to the IO operations incurred while
+          vacuuming and analyzing permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal> refers to IO operations on permanent
+          relations specially designated as <literal>bulkreads</literal>, such
+          as the sequential scan of a large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal> refers to IO operations on permanent
+          relations specially designated as <literal>bulkwrites</literal>,
+          such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+       <para>
+        These last three <varname>io_context</varname>s are counted separately
+        because the autovacuum daemon, explicit <command>VACUUM</command>,
+        explicit <command>ANALYZE</command>, many bulk reads, and many bulk
+        writes acquire a limited number of shared buffers and reuse them
+        circularly to avoid occupying an undue portion of the main shared
+        buffer pool. This pattern is called a <quote>Buffer Access
+        Strategy</quote> in the <productname>PostgreSQL</productname> source
+        code and the fixed-size ring buffer is referred to as a <quote>strategy
+        ring buffer</quote> for the purposes of this view's documentation.
+        These <varname>io_context</varname>s are referred to as <quote>strategy
+        contexts</quote> and IO operations on strategy contexts are referred to
+        as <quote>strategy operations</quote>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Object operated on in a given <varname>io_context</varname> by a given
+        <varname>backend_type</varname>. Current values are
+        <literal>relation</literal>, which includes permanent relations, and
+        <literal>temp relation</literal> which includes temporary relations
+        created by <command>CREATE TEMPORARY TABLE...</command>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Reads by a given <varname>backend_type</varname> of a given
+        <varname>io_object</varname> into buffers in a given
+        <varname>io_context</varname>.
+       </para>
+       <para>
+        Note that the sum of
+        <varname>heap_blks_read</varname>,
+        <varname>idx_blks_read</varname>,
+        <varname>tidx_blks_read</varname>, and
+        <varname>toast_blks_read</varname>
+        in <link linkend="monitoring-pg-statio-all-tables-view">
+        <structname>pg_statio_all_tables</structname></link> as well as
+        <varname>blks_read</varname> in <link
+        linkend="monitoring-pg-stat-database-view">
+        <structname>pg_stat_database</structname></link> are both similar to
+        <varname>read</varname> plus <varname>extended</varname> for all
+        <varname>io_context</varname>s for the following
+        <varname>backend_type</varname>s in <structname>pg_stat_io</structname>:
+        <itemizedlist>
+         <listitem><para><literal>autovacuum launcher</literal></para></listitem>
+         <listitem><para><literal>autovacuum worker</literal></para></listitem>
+         <listitem><para><literal>client backend</literal></para></listitem>
+         <listitem><para><literal>standalone backend</literal></para></listitem>
+         <listitem><para><literal>background worker</literal></para></listitem>
+         <listitem><para><literal>walsender</literal></para></listitem>
+        </itemizedlist>
+        The difference is that reads done as part of <command>CREATE
+        DATABASE</command> are not counted in
+        <structname>pg_statio_all_tables</structname> and
+        <structname>pg_stat_database</structname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>written</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Writes by a given <varname>backend_type</varname> of a given
+        <varname>io_object</varname> of data from a given
+        <varname>io_context</varname>.
+       </para>
+       <para>
+        Normal client backends should be able to rely on auxiliary processes
+        like the checkpointer and background writer to write out dirty data as
+        much as possible. Large numbers of writes by
+        <varname>backend_type</varname> <literal>client backend</literal> in
+        <varname>io_context</varname> <literal>normal</literal> and
+        <varname>io_object</varname> <literal>relation</literal> could indicate
+        a misconfiguration of shared buffers or of checkpointer. More
+        information on checkpointer configuration can be found in <xref
+        linkend="wal-configuration"/>.
+       </para>
+       <para>
+        Note that the values of <varname>written</varname> for
+        <varname>backend_type</varname> <literal>background writer</literal> and
+        <varname>backend_type</varname> <literal>checkpointer</literal>
+        correspond to the values of <varname>buffers_clean</varname> and
+        <varname>buffers_checkpoint</varname>, respectively, in <link
+        linkend="monitoring-pg-stat-bgwriter-view">
+        <structname>pg_stat_bgwriter</structname></link>.
+        <varname>buffers_backend</varname> in
+        <structname>pg_stat_bgwriter</structname> corresponds to
+        <structname>pg_stat_io</structname>'s <varname>written</varname> plus
+        <varname>extended</varname> for <varname>io_context</varname>s
+        <literal>normal</literal>, <literal>bulkread</literal>,
+        <literal>bulkwrite</literal>, and <literal>vacuum</literal> on
+        <varname>io_object</varname> <literal>relation</literal> for
+        <varname>backend_type</varname>s:
+        <itemizedlist>
+         <listitem><para><literal>client backend</literal></para></listitem>
+         <listitem><para><literal>autovacuum worker</literal></para></listitem>
+         <listitem><para><literal>background worker</literal></para></listitem>
+         <listitem><para><literal>walsender</literal></para></listitem>
+        </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extended</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Extends of relations done by a given <varname>backend_type</varname> in
+        order to write data for a given <varname>io_object</varname> in a given
+        <varname>io_context</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of IO read, written, or extended. For
+        block-oriented IO of relation data, reads, writes, and extends are done
+        in <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evicted</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a <varname>backend_type</varname> has evicted a block
+        from a shared or local buffer in order to reuse the buffer in this
+        <varname>io_context</varname>. Blocks are only evicted when there are no
+        unoccupied buffers.
+       </para>
+       <para>
+        <varname>evicted</varname> in <varname>io_context</varname>
+        <literal>normal</literal> and <varname>io_object</varname>
+        <literal>relation</literal> counts the number of times a block from a
+        shared buffer was evicted so that it can be replaced with another block,
+        also in shared buffers.
+       </para>
+       <para>
+        A high <varname>evicted</varname> count in <varname>io_context</varname>
+        <literal>normal</literal> and <varname>io_object</varname>
+        <literal>relation</literal> could indicate that shared buffers is too
+        small and should be set to a larger value.
+       </para>
+       <para>
+        <varname>evicted</varname> in <varname>io_context</varname>
+        <literal>vacuum</literal>, <literal>bulkread</literal>, and
+        <literal>bulkwrite</literal> counts the number of times occupied shared
+        buffers were added to the size-limited strategy ring buffer, causing the
+        buffer contents to be evicted. If the to-be-used buffer in the ring is
+        pinned or in use by another backend, it may be replaced by a new shared
+        buffer. If this shared buffer contains valid data, that block must be
+        evicted and will count as <varname>evicted</varname>.
+       </para>
+       <para>
+        Seeing a large number of <varname>evicted</varname> in strategy
+        <varname>io_context</varname>s can provide insight into primary working
+        set cache misses.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reused</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in the strategy ring was reused
+        as part of an operation in the <literal>bulkread</literal>,
+        <literal>bulkwrite</literal>, or <literal>vacuum</literal>
+        <varname>io_context</varname>s. When a Buffer Access Strategy reuses a
+        buffer in the strategy ring, it evicts the buffer contents, incrementing
+        <varname>reused</varname>. When a Buffer Access Strategy adds a new
+        shared buffer to the strategy ring and this shared buffer is occupied,
+        the Buffer Access Strategy must evict the contents of the shared buffer,
+        incrementing <varname>evicted</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>files_synced</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of files <literal>fsync</literal>ed by a given
+        <varname>backend_type</varname> for the purpose of persisting data from
+        a given <varname>io_object</varname> dirtied in a given
+        <varname>io_context</varname>. <literal>fsync</literal>s are done at
+        segment boundaries so <varname>op_bytes</varname> does not apply to the
+        <varname>files_synced</varname> column.
+
+        <literal>fsync</literal>s are always tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+       <para>
+        Normally client backends rely on the checkpointer to ensure data is
+        persisted to permanent storage. Large numbers of
+        <varname>files_synced</varname> by <varname>backend_type</varname>
+        <literal>client backend</literal> could indicate a misconfiguration of
+        shared buffers or of checkpointer. More information on checkpointer
+        configuration can be found in <xref linkend="wal-configuration"/>.
+       </para>
+       <para>
+        Note that the sum of <varname>files_synced</varname> for all
+        <varname>io_context</varname> <literal>normal</literal>
+        <varname>io_object</varname> <literal>relation</literal> for all
+        <varname>backend_type</varname>s except <literal>checkpointer</literal>
+        corresponds to <varname>buffers_backend_fsync</varname> in <link
+        linkend="monitoring-pg-stat-bgwriter-view">
+        <structname>pg_stat_bgwriter</structname></link>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some <varname>backend_type</varname>s do not perform IO operations in some
+   <varname>io_context</varname>s and/or <varname>io_object</varname>s. These
+   rows are omitted from the view.  For example, the checkpointer does not use
+   a Buffer Access Strategy, so there will be no rows for
+   <varname>backend_type</varname> <literal>checkpointer</literal> in any of
+   the strategy <varname>io_context</varname>s.
+
+   On a more granular level, some IO operations are invalid in combination
+   with certain <varname>io_context</varname>s and
+   <varname>io_object</varname>s. Those cells will be NULL to distinguish
+   between 0 observed IO operations of that type and an invalid
+   combination. For example, temporary tables are not fsynced, so cells for
+   all <varname>backend_type</varname>s for <varname>io_object</varname>
+   <literal>temp relation</literal> in <varname>io_context</varname>
+   <literal>normal</literal> for <varname>files_synced</varname> will be
+   NULL. Some <varname>backend_type</varname>s never perform certain IO
+   operations.  Those cells will also be NULL in the view. For example
+   <literal>background writer</literal> should not perform reads.
+  </para>
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..71646f5aef 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.read,
+       b.written,
+       b.extended,
+       b.op_bytes,
+       b.evicted,
+       b.reused,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 42b890b806..71c5ff9f1e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1234,6 +1234,160 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgstat_iszero_io_object(const PgStat_Counter *obj)
+{
+	for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+	{
+		if (obj[io_op] != 0)
+			return false;
+	}
+
+	return true;
+}
+#endif
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		bool		bktype_tracked;
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+		for (IOContext io_context = IOCONTEXT_BULKREAD;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			for (IOObject io_obj = IOOBJECT_RELATION;
+				 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+			{
+				const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_object(bktype, io_context, io_obj))
+				{
+					Assert(pgstat_iszero_io_object(bktype_stats->data[io_context][io_obj]));
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				/*
+				 * Some combinations of BackendType and IOOp, of IOContext and
+				 * IOOp, and of IOObject and IOOp are not tracked. Set these
+				 * cells in the view NULL and assert that these stats are zero
+				 * as expected.
+				 */
+				for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_context, io_obj, io_op);
+
+					if (!nulls[col_idx])
+						values[col_idx] =
+							Int64GetDatum(bktype_stats->data[io_context][io_obj][io_op]);
+					else
+						Assert(bktype_stats->data[io_context][io_obj][io_op] == 0);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7be9a50147..782f27523f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5686,6 +5686,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,read,written,extended,op_bytes,evicted,reused,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..2d0e7dc5c5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.read,
+    b.written,
+    b.extended,
+    b.op_bytes,
+    b.evicted,
+    b.reused,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, read, written, extended, op_bytes, evicted, reused, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..01070a53a4 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,229 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..962ae5b281 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,142 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6779e84c2e..c4fc3d98ee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3370,6 +3370,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.38.1

v44-0002-pgstat-Infrastructure-to-track-IO-operations.patchapplication/octet-stream; name=v44-0002-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From bb72ed77141821b3b0a946aa1f9424a04146d5a9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:41 -0500
Subject: [PATCH v44 2/4] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOp "written" on an
IOObject IOOBJECT_RELATION in IOContext IOCONTEXT_NORMAL by BackendType
"checkpointer".

Each IOOp (evict, extend, fsync, read, reuse, and write) can be counted
per IOObject (relation, temp relation) per IOContext (normal, bulkread,
bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  38 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 400 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  67 +++
 src/include/utils/pgstat_internal.h           |  34 ++
 src/tools/pgindent/typedefs.list              |   6 +
 15 files changed, 586 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5bcba0fdec..710bd2c52e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5403,6 +5403,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..4ae5ee51f6 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,15 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io.stat_reset_timestamp);
+	for (BackendType bktype = B_INVALID + 1; bktype < BACKEND_NUM_TYPES;
+		 bktype++)
+		write_chunk_s(fpout, &pgStatLocal.snapshot.io.stats[bktype].data);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1519,21 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stat_reset_timestamp))
+		goto error;
+
+	for (BackendType bktype = B_INVALID + 1; bktype < BACKEND_NUM_TYPES;
+		 bktype++)
+	{
+		Assert(pgstat_bktype_io_stats_valid(&shmem->io.stats[bktype],
+											bktype));
+		if (!read_chunk_s(fpin, &shmem->io.stats[bktype].data))
+			goto error;
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..f6458cc66d
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,400 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_BackendIO PendingIOStats;
+bool		have_iostats = false;
+
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_context, io_object, io_op));
+
+	PendingIOStats.data[io_context][io_object][io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_BackendIO *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+			for (IOOp io_op = IOOP_FIRST;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_context][io_object][io_op] +=
+					PendingIOStats.data[io_context][io_object][io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BackendIO *bktype_shstats = &pgStatLocal.shmem->io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BackendIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BackendIO *bktype_shstats = &pgStatLocal.shmem->io.stats[i];
+		PgStat_BackendIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BackendIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO in certain IOContexts. Some IOObjects
+ * are never operated on in some IOContexts. Check that the given BackendType
+ * is expected to do IO in the given IOContext and that the given IOObject is
+ * expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOContext io_context,
+						IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+					IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_context, io_object))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * ForwardSyncRequest() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BackendIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BackendIO, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still need
+			 * to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
+					backend_io->data[io_context][io_object][io_op] != 0)
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..42b890b806 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1593,6 +1593,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1601,7 +1603,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0ffeefc437..0aaf600a78 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..ea7e19c48d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO operations
+ */
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_FIRST IOOBJECT_RELATION
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_FIRST IOOP_EVICT
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_BackendIO
+{
+	PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_BackendIO stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +503,23 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOContext io_context, IOObject io_object);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+								IOObject io_object, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..11150bf449 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,19 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* shared version of PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects ->stats[i]. locks[0] also protects
+	 * ->stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+
+	TimestampTz stat_reset_timestamp;
+	PgStat_BackendIO stats[BACKEND_NUM_TYPES];
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +432,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +456,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +565,17 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+extern bool pgstat_flush_io(bool nowait);
+extern bool pgstat_bktype_io_stats_valid(PgStat_BackendIO *context_ops,
+										 BackendType bktype);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +670,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 50d86cb01b..6779e84c2e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2010,6 +2013,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2027,6 +2031,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
+PgStat_BackendIO
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.38.1

v44-0003-pgstat-Count-IO-for-relations.patchapplication/octet-stream; name=v44-0003-pgstat-Count-IO-for-relations.patchDownload

From 28a64a8210958331704da01785ab1da0d022c783 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:50 -0500
Subject: [PATCH v44 3/4] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 109 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 ++++++++++----
 src/backend/storage/buffer/localbuf.c |  13 ++-
 src/backend/storage/smgr/md.c         |  25 ++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 184 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..d067afb420 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		/*
+		 * LocalBufferAlloc() will set the io_context to IOCONTEXT_NORMAL. We
+		 * do not use a BufferAccessStrategy for IO of temporary tables.
+		 * However, in some cases, the "strategy" may not be NULL, so we can't
+		 * rely on IOContextForStrategy() to set the right IOContext for us.
+		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
+		 */
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +881,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +996,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1014,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1041,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1136,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output parameter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1200,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1218,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1287,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1302,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, *io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, *io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2625,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2875,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2967,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3629,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3663,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3761,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3971,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3998,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..2108bbe7d8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -107,7 +108,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +128,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here (instead of after a buffer hit would have returned) for
+	 * convenience since we don't have to worry about the overhead of calling
+	 * IOContextForStrategy().
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +239,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +266,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..2115d7184a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer or bgwriter. Since other IO operations bypassing the
+		 * buffer manager could also be tracked in such an IOContext, wait
+		 * until these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	if (result >= 0)
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.38.1

v44-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchapplication/octet-stream; name=v44-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchDownload

From 3821676d2fcefab6d25292b84f652366c2c70710 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Dec 2022 18:23:19 -0800
Subject: [PATCH v44 1/4] pgindent and some manual cleanup in pgstat related
 code

---
 src/backend/storage/buffer/bufmgr.c          | 22 ++++++++++----------
 src/backend/storage/buffer/localbuf.c        |  4 ++--
 src/backend/utils/activity/pgstat.c          |  3 ++-
 src/backend/utils/activity/pgstat_relation.c |  1 +
 src/backend/utils/adt/pgstatfuncs.c          |  2 +-
 src/include/pgstat.h                         |  1 +
 src/include/utils/pgstat_internal.h          |  1 +
 7 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..8075828e8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -516,7 +516,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr_reln->smgr_rlocator.locator,
-				   forkNum, blockNum);
+				  forkNum, blockNum);
 
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
@@ -3297,8 +3297,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3425,8 +3425,8 @@ DropDatabaseBuffers(Oid dbid)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
@@ -3572,8 +3572,8 @@ FlushRelationBuffers(Relation rel)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator))
 			continue;
@@ -3645,8 +3645,8 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3880,8 +3880,8 @@ FlushDatabaseBuffers(Oid dbid)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b2720df6ea..8372acc383 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -610,8 +610,8 @@ AtProcExit_LocalBuffers(void)
 {
 	/*
 	 * We shouldn't be holding any remaining pins; if we are, and assertions
-	 * aren't enabled, we'll fail later in DropRelationBuffers while
-	 * trying to drop the temp rels.
+	 * aren't enabled, we'll fail later in DropRelationBuffers while trying to
+	 * drop the temp rels.
 	 */
 	CheckForLocalBufferLeaks();
 }
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7e9dc17e68..0fa5370bcd 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -426,7 +426,7 @@ pgstat_discard_stats(void)
 		ereport(DEBUG2,
 				(errcode_for_file_access(),
 				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-						PGSTAT_STAT_PERMANENT_FILENAME)));
+								 PGSTAT_STAT_PERMANENT_FILENAME)));
 	}
 
 	/*
@@ -986,6 +986,7 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 										 kind_info->shared_size);
+
 		/*
 		 * Acquire the LWLock directly instead of using
 		 * pg_stat_lock_entry_shared() which requires a reference.
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 1730425de1..2e20b93c20 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -783,6 +783,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	if (lstats->t_counts.t_numscans)
 	{
 		TimestampTz t = GetCurrentTransactionStopTimestamp();
+
 		if (t > tabentry->lastscan)
 			tabentry->lastscan = t;
 	}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..58bd1360b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -906,7 +906,7 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
 	clean_ipv6_addr(beentry->st_clientaddr.addr.ss_family, remote_host);
 
 	PG_RETURN_DATUM(DirectFunctionCall1(inet_in,
-										 CStringGetDatum(remote_host)));
+										CStringGetDatum(remote_host)));
 }
 
 Datum
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..5e3326a3b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -476,6 +476,7 @@ extern void pgstat_report_connect(Oid dboid);
 
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dboid);
 
+
 /*
  * Functions in pgstat_function.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 08412d6404..12fd51f1ae 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -626,6 +626,7 @@ extern void pgstat_wal_snapshot_cb(void);
 extern bool pgstat_subscription_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_subscription_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
 
+
 /*
  * Functions in pgstat_xact.c
  */
-- 
2.38.1

#110

melanieplageman@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#109)

5 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Attached is v45 of the patchset. I've done some additional code cleanup
and changes. The most significant change, however, is the docs. I've
separated the docs into its own patch for ease of review.

The docs patch here was edited and co-authored by Samay Sharma.
I'm not sure if the order of pg_stat_io in the docs is correct.

The significant changes are removal of all "correspondence" or
"equivalence"-related sections (those explaining how other IO stats were
the same or different from pg_stat_io columns).

I've tried to remove references to "strategies" and "Buffer Access
Strategy" as much as possible.

I've moved the advice and interpretation section to the bottom --
outside of the table of definitions. Since this page is primarily a
reference page, I agree with Samay that incorporating interpretation
into the column definitions adds clutter and confusion.

I think the best course would be to have an "Interpreting Statistics"
section.

I suggest a structure like the following for this section:
- Statistics Collection Configuration
- Viewing Statistics
- Statistics Views Reference
- Statistics Functions Reference
- Interpreting Statistics

As an aside, this section of the docs has some other structural issues
as well.

For example, I'm not sure it makes sense to have the dynamic statistics
views as sub-sections under 28.2, which is titled "The Cumulative
Statistics System."

In fact the docs say this under Section 28.2
https://www.postgresql.org/docs/current/monitoring-stats.html

"PostgreSQL also supports reporting dynamic information about exactly
what is going on in the system right now, such as the exact command
currently being executed by other server processes, and which other
connections exist in the system. This facility is independent of the
cumulative statistics system."

So, it is a bit weird that they are defined under the section titled
"The Cumulative Statistics System".

In this version of the patchset, I have not attempted a new structure
but instead moved the advice/interpretation for pg_stat_io to below the
table containing the column definitions.

- Melanie

Attachments:

v45-0002-pgstat-Infrastructure-to-track-IO-operations.patchtext/x-patch; charset=US-ASCII; name=v45-0002-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From e87831a0ffe94af54b91285630dd6f1c497c368a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:41 -0500
Subject: [PATCH v45 2/5] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOp "written" on an
IOObject IOOBJECT_RELATION in IOContext IOCONTEXT_NORMAL by BackendType
"checkpointer".

Each IOOp (evict, extend, fsync, read, reuse, and write) can be counted
per IOObject (relation, temp relation) per IOContext (normal, bulkread,
bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  26 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 400 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  67 +++
 src/include/utils/pgstat_internal.h           |  32 ++
 src/tools/pgindent/typedefs.list              |   6 +
 15 files changed, 572 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index cf220c3bcb..1691246e76 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5408,6 +5408,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..608c3b59da 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1516,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..8eac7d9e53
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,400 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_BackendIO PendingIOStats;
+bool		have_iostats = false;
+
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_context, io_object, io_op));
+
+	PendingIOStats.data[io_context][io_object][io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_BackendIO *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+			for (IOOp io_op = IOOP_FIRST;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_context][io_object][io_op] +=
+					PendingIOStats.data[io_context][io_object][io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evicted";
+		case IOOP_EXTEND:
+			return "extended";
+		case IOOP_FSYNC:
+			return "files synced";
+		case IOOP_READ:
+			return "read";
+		case IOOP_REUSE:
+			return "reused";
+		case IOOP_WRITE:
+			return "written";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BackendIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BackendIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BackendIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BackendIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BackendIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO in certain IOContexts. Some IOObjects
+ * are never operated on in some IOContexts. Check that the given BackendType
+ * is expected to do IO in the given IOContext and that the given IOObject is
+ * expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOContext io_context,
+						IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+					IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_context, io_object))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * register_dirty_segment() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BackendIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BackendIO, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still
+			 * need to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
+					backend_io->data[io_context][io_object][io_op] != 0)
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..42b890b806 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1593,6 +1593,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1601,7 +1603,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0ffeefc437..0aaf600a78 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..ea7e19c48d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO operations
+ */
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_FIRST IOOBJECT_RELATION
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_FIRST IOOP_EVICT
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_BackendIO
+{
+	PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_BackendIO stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +503,23 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOContext io_context, IOObject io_object);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+								IOObject io_object, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..bf8e4c3b8b 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,17 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* shared version of PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects stats.stats[i]. locks[0] also protects
+	 * stats.stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+	PgStat_IO	stats;
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +430,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +454,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +563,17 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+extern bool pgstat_flush_io(bool nowait);
+extern bool pgstat_bktype_io_stats_valid(PgStat_BackendIO *context_ops,
+										 BackendType bktype);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +668,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..7b66b1bc89 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2016,6 +2019,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2033,6 +2037,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
+PgStat_BackendIO
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v45-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v45-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From c19cd7aad51f75b4865b171a096d1ff1cbba414e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 9 Jan 2023 14:42:25 -0500
Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the "read",
"written", and "extended" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  31 ++++
 contrib/amcheck/sql/check_heap.sql      |  24 +++
 src/backend/catalog/system_views.sql    |  15 ++
 src/backend/utils/adt/pgstatfuncs.c     | 154 ++++++++++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 ++
 src/test/regress/expected/stats.out     | 225 ++++++++++++++++++++++++
 src/test/regress/sql/stats.sql          | 138 +++++++++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 9 files changed, 609 insertions(+)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..c44338fd6e 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,19 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +101,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +225,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..210f9b22e2 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,26 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(read) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +47,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +133,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..71646f5aef 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.read,
+       b.written,
+       b.extended,
+       b.op_bytes,
+       b.evicted,
+       b.reused,
+       b.files_synced,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 42b890b806..71c5ff9f1e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1234,6 +1234,160 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgstat_iszero_io_object(const PgStat_Counter *obj)
+{
+	for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+	{
+		if (obj[io_op] != 0)
+			return false;
+	}
+
+	return true;
+}
+#endif
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		bool		bktype_tracked;
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+		for (IOContext io_context = IOCONTEXT_BULKREAD;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			for (IOObject io_obj = IOOBJECT_RELATION;
+				 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+			{
+				const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_object(bktype, io_context, io_obj))
+				{
+					Assert(pgstat_iszero_io_object(bktype_stats->data[io_context][io_obj]));
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				/*
+				 * Some combinations of BackendType and IOOp, of IOContext and
+				 * IOOp, and of IOObject and IOOp are not tracked. Set these
+				 * cells in the view NULL and assert that these stats are zero
+				 * as expected.
+				 */
+				for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_context, io_obj, io_op);
+
+					if (!nulls[col_idx])
+						values[col_idx] =
+							Int64GetDatum(bktype_stats->data[io_context][io_obj][io_op]);
+					else
+						Assert(bktype_stats->data[io_context][io_obj][io_op] == 0);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..1994a4ce36 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5690,6 +5690,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,read,written,extended,op_bytes,evicted,reused,files_synced,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..2d0e7dc5c5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.read,
+    b.written,
+    b.extended,
+    b.op_bytes,
+    b.evicted,
+    b.reused,
+    b.files_synced,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, read, written, extended, op_bytes, evicted, reused, files_synced, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..01070a53a4 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,229 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..962ae5b281 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,142 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extended) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(written) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(written) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(files_synced) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+SELECT sum(read) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE test_io_shared_stats_tblspc;
+-- SELECT from the table so that it is read into shared buffers and io_context
+-- 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(read) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+DROP TABLE test_io_shared;
+DROP TABLESPACE test_io_shared_stats_tblspc;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extended) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evicted) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(written) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(read) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evicted) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(read) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(written) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extended) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reused) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(read) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extended) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evicted) + sum(reused) + sum(extended) + sum(files_synced) + sum(read) + sum(written) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7b66b1bc89..c4ecef2bf8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3377,6 +3377,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.34.1

v45-0003-pgstat-Count-IO-for-relations.patchtext/x-patch; charset=US-ASCII; name=v45-0003-pgstat-Count-IO-for-relations.patchDownload

From eb5aab5662eaa4194fd159cf227d0082d48bd515 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:50 -0500
Subject: [PATCH v45 3/5] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 109 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 ++++++++++----
 src/backend/storage/buffer/localbuf.c |  13 ++-
 src/backend/storage/smgr/md.c         |  25 ++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 184 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..d067afb420 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		/*
+		 * LocalBufferAlloc() will set the io_context to IOCONTEXT_NORMAL. We
+		 * do not use a BufferAccessStrategy for IO of temporary tables.
+		 * However, in some cases, the "strategy" may not be NULL, so we can't
+		 * rely on IOContextForStrategy() to set the right IOContext for us.
+		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
+		 */
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +881,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +996,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1014,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1041,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1136,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output parameter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1200,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1218,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1287,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1302,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, *io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, *io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2625,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2875,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2967,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3629,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3663,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3761,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3971,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3998,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..2108bbe7d8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -107,7 +108,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +128,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here (instead of after a buffer hit would have returned) for
+	 * convenience since we don't have to worry about the overhead of calling
+	 * IOContextForStrategy().
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +239,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +266,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..37bae4bf73 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer. Since other IO operations bypassing the buffer
+		 * manager could also be tracked in such an IOContext, wait until
+		 * these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	if (result >= 0)
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.34.1

v45-0005-pg_stat_io-documentation.patchtext/x-patch; charset=US-ASCII; name=v45-0005-pg_stat_io-documentation.patchDownload

From a2350adddce51f564a5f573b8b57f115bfd47ff4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 9 Jan 2023 14:42:53 -0500
Subject: [PATCH v45 5/5] pg_stat_io documentation

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml | 321 +++++++++++++++++++++++++++++++++--
 1 file changed, 307 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1691246e76..0f4d664516 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row per backend type, context, target object combination showing
+       cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +675,20 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are especially useful for
+   determining the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call. However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without
+   requiring a physical read. Users interested in obtaining more detailed
+   information on <productname>PostgreSQL</productname> I/O behavior are
+   advised to use the <productname>PostgreSQL</productname> statistics views in
+   combination with operating system utilities that allow insight into the
+   kernel's handling of I/O.
   </para>
 
  </sect2>
@@ -3633,6 +3643,289 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view will contain one row for each
+   backend type, I/O context, and target I/O object combination showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
+   relation I/O which bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) is currently not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>io_context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations done outside of shared
+          buffers incurred while vacuuming and analyzing permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Qualifying large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Qualifying large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: This includes permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: This includes temporary relations.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations in units of <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>written</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of write operations in units of <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extended</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of relation extend operations in units of
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of I/O read, written, or extended.
+       </para>
+       <para>
+        Relation data reads, writes, and extends are done in
+        <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evicted</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a block has been evicted from a shared or local buffer.
+       </para>
+       <para>
+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate size-limited ring buffer.
+        </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reused</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>files_synced</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of <literal>fsync</literal> calls. These are only tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some <varname>backend_type</varname>s never perform I/O operations in some
+   <varname>io_context</varname>s and/or on some <varname>io_object</varname>s.
+   These rows are omitted from the view. For example, the checkpointer does not
+   checkpoint temporary tables, so there will be no rows for
+   <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>io_object</varname> <literal>temp relation</literal>.
+  </para>
+
+  <para>
+   In addition, some I/O operations will never be performed either by certain
+   <varname>backend_type</varname>s or in certain
+   <varname>io_context</varname>s or on certain <varname>io_object</varname>s.
+   These cells will be NULL. For example, temporary tables are not
+   <literal>fsync</literal>ed, so <varname>files_synced</varname> will be NULL
+   for <varname>io_object</varname> <literal>temp relation</literal>. Also, the
+   background writer does not perform reads, so <varname>read</varname> will be
+   NULL in rows for <varname>backend_type</varname> <literal>background
+   writer</literal>.
+  </para>
+
+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evicted</varname> count can indicate that shared buffers
+      should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>files_synced</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of checkpointer. More information on checkpointer
+      configuration can be found in <xref linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and background writer to write out dirty data as
+      much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of checkpointer. More
+      information on checkpointer configuration can be found in <xref
+      linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
 
  </sect2>
 
-- 
2.34.1

v45-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchtext/x-patch; charset=US-ASCII; name=v45-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchDownload

From c825b764df58ce622fb10d1b846a6e7db184183a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Dec 2022 18:23:19 -0800
Subject: [PATCH v45 1/5] pgindent and some manual cleanup in pgstat related
 code

---
 src/backend/storage/buffer/bufmgr.c          | 22 ++++++++++----------
 src/backend/storage/buffer/localbuf.c        |  4 ++--
 src/backend/utils/activity/pgstat.c          |  3 ++-
 src/backend/utils/activity/pgstat_relation.c |  1 +
 src/backend/utils/adt/pgstatfuncs.c          |  2 +-
 src/include/pgstat.h                         |  1 +
 src/include/utils/pgstat_internal.h          |  1 +
 7 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..8075828e8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -516,7 +516,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr_reln->smgr_rlocator.locator,
-				   forkNum, blockNum);
+				  forkNum, blockNum);
 
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
@@ -3297,8 +3297,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3425,8 +3425,8 @@ DropDatabaseBuffers(Oid dbid)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
@@ -3572,8 +3572,8 @@ FlushRelationBuffers(Relation rel)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator))
 			continue;
@@ -3645,8 +3645,8 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3880,8 +3880,8 @@ FlushDatabaseBuffers(Oid dbid)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b2720df6ea..8372acc383 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -610,8 +610,8 @@ AtProcExit_LocalBuffers(void)
 {
 	/*
 	 * We shouldn't be holding any remaining pins; if we are, and assertions
-	 * aren't enabled, we'll fail later in DropRelationBuffers while
-	 * trying to drop the temp rels.
+	 * aren't enabled, we'll fail later in DropRelationBuffers while trying to
+	 * drop the temp rels.
 	 */
 	CheckForLocalBufferLeaks();
 }
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7e9dc17e68..0fa5370bcd 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -426,7 +426,7 @@ pgstat_discard_stats(void)
 		ereport(DEBUG2,
 				(errcode_for_file_access(),
 				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-						PGSTAT_STAT_PERMANENT_FILENAME)));
+								 PGSTAT_STAT_PERMANENT_FILENAME)));
 	}
 
 	/*
@@ -986,6 +986,7 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 										 kind_info->shared_size);
+
 		/*
 		 * Acquire the LWLock directly instead of using
 		 * pg_stat_lock_entry_shared() which requires a reference.
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 1730425de1..2e20b93c20 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -783,6 +783,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	if (lstats->t_counts.t_numscans)
 	{
 		TimestampTz t = GetCurrentTransactionStopTimestamp();
+
 		if (t > tabentry->lastscan)
 			tabentry->lastscan = t;
 	}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..58bd1360b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -906,7 +906,7 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
 	clean_ipv6_addr(beentry->st_clientaddr.addr.ss_family, remote_host);
 
 	PG_RETURN_DATUM(DirectFunctionCall1(inet_in,
-										 CStringGetDatum(remote_host)));
+										CStringGetDatum(remote_host)));
 }
 
 Datum
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..5e3326a3b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -476,6 +476,7 @@ extern void pgstat_report_connect(Oid dboid);
 
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dboid);
 
+
 /*
  * Functions in pgstat_function.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 08412d6404..12fd51f1ae 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -626,6 +626,7 @@ extern void pgstat_wal_snapshot_cb(void);
 extern bool pgstat_subscription_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_subscription_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
 
+
 /*
  * Functions in pgstat_xact.c
  */
-- 
2.34.1

#111

vignesh C

vignesh21@gmail.com

about 3 years ago

In reply to: Melanie Plageman (#110)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, 10 Jan 2023 at 02:41, Melanie Plageman
<melanieplageman@gmail.com> wrote:

Attached is v45 of the patchset. I've done some additional code cleanup
and changes. The most significant change, however, is the docs. I've
separated the docs into its own patch for ease of review.

The docs patch here was edited and co-authored by Samay Sharma.
I'm not sure if the order of pg_stat_io in the docs is correct.

The significant changes are removal of all "correspondence" or
"equivalence"-related sections (those explaining how other IO stats were
the same or different from pg_stat_io columns).

I've tried to remove references to "strategies" and "Buffer Access
Strategy" as much as possible.

I've moved the advice and interpretation section to the bottom --
outside of the table of definitions. Since this page is primarily a
reference page, I agree with Samay that incorporating interpretation
into the column definitions adds clutter and confusion.

I think the best course would be to have an "Interpreting Statistics"
section.

I suggest a structure like the following for this section:
- Statistics Collection Configuration
- Viewing Statistics
- Statistics Views Reference
- Statistics Functions Reference
- Interpreting Statistics

As an aside, this section of the docs has some other structural issues
as well.

For example, I'm not sure it makes sense to have the dynamic statistics
views as sub-sections under 28.2, which is titled "The Cumulative
Statistics System."

In fact the docs say this under Section 28.2
https://www.postgresql.org/docs/current/monitoring-stats.html

"PostgreSQL also supports reporting dynamic information about exactly
what is going on in the system right now, such as the exact command
currently being executed by other server processes, and which other
connections exist in the system. This facility is independent of the
cumulative statistics system."

So, it is a bit weird that they are defined under the section titled
"The Cumulative Statistics System".

In this version of the patchset, I have not attempted a new structure
but instead moved the advice/interpretation for pg_stat_io to below the
table containing the column definitions.

For some reason cfbot is not able to apply this patch as in [1]http://cfbot.cputube.org/patch_41_3272.log,
please have a look and post an updated patch if required:
=== Applying patches on top of PostgreSQL commit ID
3c6fc58209f24b959ee18f5d19ef96403d08f15c ===
=== applying patch
./v45-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patch
patching file src/backend/storage/buffer/bufmgr.c
patching file src/backend/storage/buffer/localbuf.c
patching file src/backend/utils/activity/pgstat.c
patching file src/backend/utils/activity/pgstat_relation.c
patching file src/backend/utils/adt/pgstatfuncs.c
patching file src/include/pgstat.h
patching file src/include/utils/pgstat_internal.h
=== applying patch ./v45-0002-pgstat-Infrastructure-to-track-IO-operations.patch
gpatch: **** Only garbage was found in the patch input.

[1]: http://cfbot.cputube.org/patch_41_3272.log

Regards,
Vignesh

#112

pryzby@telsasoft.com

about 3 years ago

In reply to: Melanie Plageman (#110)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type

The patch can/will fail with:

CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+WARNING: tablespaces created by regression test cases should have names starting with "regress_"

CREATE TABLESPACE test_stats LOCATION '';
+WARNING: tablespaces created by regression test cases should have names starting with "regress_"

(I already sent patches to address the omission in cirrus.yml)

1760 : errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
=> Do you want to put these in order?

pgstat_get_io_op_name() isn't currently being hit by tests; actually,
it's completely unused.

FlushRelationBuffers() isn't being hit for local buffers.

+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row per backend type, context, target object combination showing
+       cluster-wide I/O statistics.

I suggest: "One row for each combination of of .."

+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are especially useful for
+   determining the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call.

I would change this say "Postgres' own buffer cache is satisfying ..."

However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without

I suggest to refer to "the kernel's page cache"

+   The <structname>pg_stat_io</structname> view will contain one row for each
+   backend type, I/O context, and target I/O object combination showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.

"..for each combination of .."

+ <varname>io_context</varname> for a type of I/O operation. For

"for I/O operations"

+          <literal>vacuum</literal>: I/O operations done outside of shared
+          buffers incurred while vacuuming and analyzing permanent relations.

s/incurred/performed/

+          <literal>bulkread</literal>: Qualifying large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.

I don't think it's correct to say that it's "outside of" shared-buffers.
s/Qualifying/Certain/

+          <literal>bulkwrite</literal>: Qualifying large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.

Same

+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: This includes permanent relations.

It says "includes permanent" but what seems to mean is that it
"exclusive of temporary relations".

+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations in units of <varname>op_bytes</varname>.

This looks too much like it means "bytes".
Should say: "in number of blocks of size >op_bytes<"

But wait - is it the number of read operations "in units of op_bytes"
(which would means this already multiplied by op_bytes, and is in units
of bytes).

Or the "number of read operations" *of* op_bytes chunks ? Which would
mean this is a "pure" number, and could be multipled by op_bytes to
obtain a size in bytes.

+ Number of write operations in units of <varname>op_bytes</varname>.

+        Number of relation extend operations in units of
+        <varname>op_bytes</varname>.

same

+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate size-limited ring buffer.

This never defines what "evicted" means. Does it mea that a dirty
buffer was written out ?

+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.

Maybe say "as part of a bulk I/O operation (bulkread, bulkwrite, or
vacuum)."

+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.

+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evicted</varname> count can indicate that shared buffers
+      should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>files_synced</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of checkpointer. More information on checkpointer

of *the* checkpointer

+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and background writer to write out dirty data as

*the* bg writer

+      much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of checkpointer. More

*the* ckpointer

Should this link to various docs for checkpointer/bgwriter ?

Maybe the docs for ALTER/COPY/VACUUM/CREATE/etc should be updated to
refer to some central description of ring buffers. Maybe something
should be included to the appendix.

--
Justin

#113

[1]: /messages/by-id/CA+hUKGLiY1e+1=pB7hXJOyGj1dJOfgde+HmiSnv3gDKayUFJMA@mail.gmail.com

melanieplageman@gmail.com

almost 3 years ago

In reply to: Justin Pryzby (#112)

5 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Attached is v46.

On Wed, Dec 28, 2022 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-10-06 13:42:09 -0400, Melanie Plageman wrote:

Additionally, some minor notes:

- Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them in the past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realistically one would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which all use the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix)

I have changed the column names to be in the past tense.

For a while I was convinced by the consistency argument (after Melanie
pointing it out to me). But the more I look, the less convinced I am. The
existing IO related stats in pg_stat_database, pg_stat_bgwriter aren't past
tense, just the ones in pg_stat_statements. pg_stat_database uses past tense
for tup_*, but not xact_*, deadlocks, checksum_failures etc.

And even pg_stat_statements isn't consistent about it - otherwise it'd be
'planned' instead of 'plans', 'called' instead of 'calls' etc.

I started to look at the naming "tense" issue again, after I got "confused"
about "extended", because that somehow makes me think about more detailed
stats or such, rather than files getting extended.

ISTM that 'evictions', 'extends', 'fsyncs', 'reads', 'reuses', 'writes' are
clearer than the past tense versions, and about as consistent with existing
columns.

I have updated the column names to the above recommendation.

On Wed, Jan 11, 2023 at 11:32 AM vignesh C <vignesh21@gmail.com> wrote:

For some reason cfbot is not able to apply this patch as in [1],
please have a look and post an updated patch if required:
=== Applying patches on top of PostgreSQL commit ID
3c6fc58209f24b959ee18f5d19ef96403d08f15c ===
=== applying patch
./v45-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patch
patching file src/backend/storage/buffer/bufmgr.c
patching file src/backend/storage/buffer/localbuf.c
patching file src/backend/utils/activity/pgstat.c
patching file src/backend/utils/activity/pgstat_relation.c
patching file src/backend/utils/adt/pgstatfuncs.c
patching file src/include/pgstat.h
patching file src/include/utils/pgstat_internal.h
=== applying patch ./v45-0002-pgstat-Infrastructure-to-track-IO-operations.patch
gpatch: **** Only garbage was found in the patch input.

[1] - http://cfbot.cputube.org/patch_41_3272.log

This was an issue with cfbot that Thomas has now fixed as he describes
in [1]/messages/by-id/CA+hUKGLiY1e+1=pB7hXJOyGj1dJOfgde+HmiSnv3gDKayUFJMA@mail.gmail.com.

On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type

The patch can/will fail with:

CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+WARNING: tablespaces created by regression test cases should have names starting with "regress_"

CREATE TABLESPACE test_stats LOCATION '';
+WARNING: tablespaces created by regression test cases should have names starting with "regress_"

(I already sent patches to address the omission in cirrus.yml)

Thanks. I've fixed this
I make a tablespace in amcheck -- are there recommendations for naming
tablespaces in contrib also?

1760 : errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
=> Do you want to put these in order?

Thanks. Fixed.

pgstat_get_io_op_name() isn't currently being hit by tests; actually,
it's completely unused.

Deleted it.

FlushRelationBuffers() isn't being hit for local buffers.

I added a test.

+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row per backend type, context, target object combination showing
+       cluster-wide I/O statistics.

I suggest: "One row for each combination of of .."

I have made this change.

+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are especially useful for
+   determining the effectiveness of the buffer cache.  When the number of actual
+   disk reads is much smaller than the number of buffer hits, then the cache is
+   satisfying most read requests without invoking a kernel call.

I would change this say "Postgres' own buffer cache is satisfying ..."

So, this is existing copy to which I added the pg_stat_io view name and
re-flowed the indentation.
However, I think your suggestions are a good idea, so I've taken them
and just rewritten this paragraph altogether.

However, these
+   statistics do not give the entire story: due to the way in which
+   <productname>PostgreSQL</productname> handles disk I/O, data that is not in
+   the <productname>PostgreSQL</productname> buffer cache might still reside in
+   the kernel's I/O cache, and might therefore still be fetched without

I suggest to refer to "the kernel's page cache"

same applies here.

+   The <structname>pg_stat_io</structname> view will contain one row for each
+   backend type, I/O context, and target I/O object combination showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.

"..for each combination of .."

I have changed this.

+ <varname>io_context</varname> for a type of I/O operation. For

"for I/O operations"

So I actually mean for a type of I/O operation -- that is, relation data
is normally written to a shared buffer but sometimes we bypass shared
buffers and just call write and sometimes we use a buffer access
strategy and write it to a special ring buffer (made up of buffers
stolen from shared buffers, but still). So I don't want to say "for I/O
operations" because I think that would imply that writes of relation
data will always be in the same IO Context.

+          <literal>vacuum</literal>: I/O operations done outside of shared
+          buffers incurred while vacuuming and analyzing permanent relations.

s/incurred/performed/

I changed this.

+          <literal>bulkread</literal>: Qualifying large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.

I don't think it's correct to say that it's "outside of" shared-buffers.

I suppose "outside of" gives the wrong idea. But I need to make clear
that this I/O is to and from buffers which are not a part of shared
buffers right now -- they may still be accessible from the same data
structures which access shared buffers but they are currently being used
in a different way.

s/Qualifying/Certain/

I feel like qualifying is more specific than certain, but I would be open
to changing it if there was a specific reason you don't like it.

+          <literal>bulkwrite</literal>: Qualifying large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.

Same

+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: This includes permanent relations.

It says "includes permanent" but what seems to mean is that it
"exclusive of temporary relations".

I've changed this.

+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations in units of <varname>op_bytes</varname>.
This looks too much like it means "bytes".
Should say: "in number of blocks of size >op_bytes<"

But wait - is it the number of read operations "in units of op_bytes"
(which would means this already multiplied by op_bytes, and is in units
of bytes).

Or the "number of read operations" *of* op_bytes chunks ? Which would
mean this is a "pure" number, and could be multipled by op_bytes to
obtain a size in bytes.

It is the number of read operations of op_bytes size -- thanks so much
for pointing this out. The wording was really unclear.
The idea is that you can do something like:
SELECT pg_size_pretty(reads * op_bytes) FROM pg_stat_io;
and get it in bytes.

The view will contain other types of IO that are not in BLCKSZ chunks,
which is where this column will be handy.

+ Number of write operations in units of <varname>op_bytes</varname>.

+        Number of relation extend operations in units of
+        <varname>op_bytes</varname>.

same

+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate size-limited ring buffer.

This never defines what "evicted" means. Does it mea that a dirty
buffer was written out ?

Thanks. I've updated this.

+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.

Maybe say "as part of a bulk I/O operation (bulkread, bulkwrite, or
vacuum)."

I've changed this.

+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.

+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evicted</varname> count can indicate that shared buffers
+      should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>files_synced</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of checkpointer. More information on checkpointer

of *the* checkpointer

+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and background writer to write out dirty data as

*the* bg writer

+      much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of checkpointer. More

*the* ckpointer

I've made most of these changes.

Should this link to various docs for checkpointer/bgwriter ?

I couldn't find docs related to tuning checkpointer outside of the WAL
configuration docs. There is the docs page for the CHECKPOINT command --
but I don't think that is very relevant here.

Maybe the docs for ALTER/COPY/VACUUM/CREATE/etc should be updated to
refer to some central description of ring buffers. Maybe something
should be included to the appendix.

I agree it would be nice to explain Buffer Access Strategies in the docs.

- Melanie

Attachments:

v46-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchtext/x-patch; charset=US-ASCII; name=v46-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchDownload

From 8960d4b3902374a999b3c7e572995d70b2cb0557 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Dec 2022 18:23:19 -0800
Subject: [PATCH v46 1/5] pgindent and some manual cleanup in pgstat related
 code

---
 src/backend/storage/buffer/bufmgr.c          | 22 ++++++++++----------
 src/backend/storage/buffer/localbuf.c        |  4 ++--
 src/backend/utils/activity/pgstat.c          |  3 ++-
 src/backend/utils/activity/pgstat_relation.c |  1 +
 src/backend/utils/adt/pgstatfuncs.c          |  2 +-
 src/include/pgstat.h                         |  1 +
 src/include/utils/pgstat_internal.h          |  1 +
 7 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..8075828e8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -516,7 +516,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr_reln->smgr_rlocator.locator,
-				   forkNum, blockNum);
+				  forkNum, blockNum);
 
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
@@ -3297,8 +3297,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3425,8 +3425,8 @@ DropDatabaseBuffers(Oid dbid)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
@@ -3572,8 +3572,8 @@ FlushRelationBuffers(Relation rel)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator))
 			continue;
@@ -3645,8 +3645,8 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3880,8 +3880,8 @@ FlushDatabaseBuffers(Oid dbid)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b2720df6ea..8372acc383 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -610,8 +610,8 @@ AtProcExit_LocalBuffers(void)
 {
 	/*
 	 * We shouldn't be holding any remaining pins; if we are, and assertions
-	 * aren't enabled, we'll fail later in DropRelationBuffers while
-	 * trying to drop the temp rels.
+	 * aren't enabled, we'll fail later in DropRelationBuffers while trying to
+	 * drop the temp rels.
 	 */
 	CheckForLocalBufferLeaks();
 }
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7e9dc17e68..0fa5370bcd 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -426,7 +426,7 @@ pgstat_discard_stats(void)
 		ereport(DEBUG2,
 				(errcode_for_file_access(),
 				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-						PGSTAT_STAT_PERMANENT_FILENAME)));
+								 PGSTAT_STAT_PERMANENT_FILENAME)));
 	}
 
 	/*
@@ -986,6 +986,7 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 										 kind_info->shared_size);
+
 		/*
 		 * Acquire the LWLock directly instead of using
 		 * pg_stat_lock_entry_shared() which requires a reference.
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 1730425de1..2e20b93c20 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -783,6 +783,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	if (lstats->t_counts.t_numscans)
 	{
 		TimestampTz t = GetCurrentTransactionStopTimestamp();
+
 		if (t > tabentry->lastscan)
 			tabentry->lastscan = t;
 	}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..58bd1360b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -906,7 +906,7 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
 	clean_ipv6_addr(beentry->st_clientaddr.addr.ss_family, remote_host);
 
 	PG_RETURN_DATUM(DirectFunctionCall1(inet_in,
-										 CStringGetDatum(remote_host)));
+										CStringGetDatum(remote_host)));
 }
 
 Datum
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..5e3326a3b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -476,6 +476,7 @@ extern void pgstat_report_connect(Oid dboid);
 
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dboid);
 
+
 /*
  * Functions in pgstat_function.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 08412d6404..12fd51f1ae 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -626,6 +626,7 @@ extern void pgstat_wal_snapshot_cb(void);
 extern bool pgstat_subscription_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_subscription_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
 
+
 /*
  * Functions in pgstat_xact.c
  */
-- 
2.34.1

v46-0005-pg_stat_io-documentation.patchtext/x-patch; charset=US-ASCII; name=v46-0005-pg_stat_io-documentation.patchDownload

From 6792d761ebf5e9261dab452130dcc2cd5ac48d0e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 9 Jan 2023 14:42:53 -0500
Subject: [PATCH v46 5/5] pg_stat_io documentation

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml | 318 +++++++++++++++++++++++++++++++++--
 1 file changed, 304 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d51ca3773..12246d46f4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, and target object
+       containing cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +675,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are useful for determining
+   the effectiveness of the buffer cache. They can be used to calculate a cache
+   hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O
+   statistics capture most instances in which the kernel was invoked in order
+   to perform I/O, they do not differentiate between data which had to be
+   fetched from disk and that which already resided in the kernel page cache.
+   Users are advised to use the <productname>PostgreSQL</productname>
+   statistics views in combination with operating system utilities for a more
+   complete picture of their database's I/O performance.
   </para>
 
  </sect2>
@@ -3643,6 +3649,290 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view will contain one row for each
+   combination of backend type, I/O context, and target I/O object showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
+   relation I/O which bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) is currently not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>io_context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Qualifying large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Qualifying large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations of <varname>op_bytes</varname> size.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>written</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of write operations of <varname>op_bytes</varname> size.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extended</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of relation extend operations of <varname>op_bytes</varname>
+        size.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of I/O read, written, or extended.
+       </para>
+       <para>
+        Relation data reads, writes, and extends are done in
+        <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evicted</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a block has been written out from a shared or local
+        buffer in order to make it available for another use.
+       </para>
+       <para>
+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate size-limited ring buffer for use in a bulk I/O operation.
+        </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reused</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>files_synced</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of <literal>fsync</literal> calls. These are only tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some <varname>backend_type</varname>s never perform I/O operations in some
+   <varname>io_context</varname>s and/or on some <varname>io_object</varname>s.
+   These rows are omitted from the view. For example, the checkpointer does not
+   checkpoint temporary tables, so there will be no rows for
+   <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>io_object</varname> <literal>temp relation</literal>.
+  </para>
+
+  <para>
+   In addition, some I/O operations will never be performed either by certain
+   <varname>backend_type</varname>s or in certain
+   <varname>io_context</varname>s or on certain <varname>io_object</varname>s.
+   These cells will be NULL. For example, temporary tables are not
+   <literal>fsync</literal>ed, so <varname>files_synced</varname> will be NULL
+   for <varname>io_object</varname> <literal>temp relation</literal>. Also, the
+   background writer does not perform reads, so <varname>read</varname> will be
+   NULL in rows for <varname>backend_type</varname> <literal>background
+   writer</literal>.
+  </para>
+
+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evicted</varname> count can indicate that shared buffers
+      should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>files_synced</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of the checkpointer. More information on configuring
+      checkpointer can be found in <xref linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and the background writer to write out dirty data
+      as much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of the checkpointer.
+      More information on configuring the checkpointer can be found in <xref
+      linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
 
  </sect2>
 
-- 
2.34.1

v46-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v46-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 57c9a605e59175ad4ac8857e40ff9cddc7909812 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 9 Jan 2023 14:42:25 -0500
Subject: [PATCH v46 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the
"reads", "writes", and "extends" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  31 +++
 contrib/amcheck/sql/check_heap.sql      |  24 +++
 src/backend/catalog/system_views.sql    |  15 ++
 src/backend/utils/adt/pgstatfuncs.c     | 154 +++++++++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 ++
 src/test/regress/expected/stats.out     | 246 ++++++++++++++++++++++++
 src/test/regress/sql/stats.sql          | 150 +++++++++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 9 files changed, 642 insertions(+)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..2cf63302d5 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,19 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +101,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +225,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..a8e9c3c1e6 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,26 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE test_stats LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE test_stats;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +47,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +133,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE test_stats;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..494a8791a5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.reads,
+       b.writes,
+       b.extends,
+       b.op_bytes,
+       b.evictions,
+       b.reuses,
+       b.fsyncs,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5c8e8336bf..51b941d4c8 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1234,6 +1234,160 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgstat_iszero_io_object(const PgStat_Counter *obj)
+{
+	for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+	{
+		if (obj[io_op] != 0)
+			return false;
+	}
+
+	return true;
+}
+#endif
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		bool		bktype_tracked;
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+		for (IOContext io_context = IOCONTEXT_BULKREAD;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			for (IOObject io_obj = IOOBJECT_RELATION;
+				 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+			{
+				const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_object(bktype, io_context, io_obj))
+				{
+					Assert(pgstat_iszero_io_object(bktype_stats->data[io_context][io_obj]));
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				/*
+				 * Some combinations of BackendType and IOOp, of IOContext and
+				 * IOOp, and of IOObject and IOOp are not tracked. Set these
+				 * cells in the view NULL and assert that these stats are zero
+				 * as expected.
+				 */
+				for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_context, io_obj, io_op);
+
+					if (!nulls[col_idx])
+						values[col_idx] =
+							Int64GetDatum(bktype_stats->data[io_context][io_obj][io_op]);
+					else
+						Assert(bktype_stats->data[io_context][io_obj][io_op] == 0);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..57a889cf49 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5690,6 +5690,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..6ae7882864 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.reads,
+    b.writes,
+    b.extends,
+    b.op_bytes,
+    b.evictions,
+    b.reuses,
+    b.fsyncs,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..a66fe86b05 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,250 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(writes) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(fsyncs) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(fsyncs) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
+SELECT sum(reads) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_io_stats_tblspc;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evictions) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(writes) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(reads) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evictions) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(reads) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(writes) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extends) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_io_stats_tblspc;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(writes) AS io_sum_local_writes_new_tblspc
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_writes_new_tblspc > :io_sum_local_writes_after;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_local;
+RESET temp_buffers;
+DROP TABLESPACE regress_io_stats_tblspc;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..64b3da2765 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,154 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(writes) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(fsyncs) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(fsyncs) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
+SELECT sum(reads) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_io_stats_tblspc;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evictions) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(writes) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(reads) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evictions) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(reads) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(writes) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extends) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_io_stats_tblspc;
+SELECT pg_stat_force_next_flush();
+SELECT sum(writes) AS io_sum_local_writes_new_tblspc
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_writes_new_tblspc > :io_sum_local_writes_after;
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_local;
+RESET temp_buffers;
+DROP TABLESPACE regress_io_stats_tblspc;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7b66b1bc89..c4ecef2bf8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3377,6 +3377,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.34.1

v46-0003-pgstat-Count-IO-for-relations.patchtext/x-patch; charset=US-ASCII; name=v46-0003-pgstat-Count-IO-for-relations.patchDownload

From 89ab692edfcf2af9e81fe6a45c3cad63158c8e20 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:50 -0500
Subject: [PATCH v46 3/5] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 109 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 ++++++++++----
 src/backend/storage/buffer/localbuf.c |  13 ++-
 src/backend/storage/smgr/md.c         |  25 ++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 184 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..47d2b4c522 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		/*
+		 * LocalBufferAlloc() will set the io_context to IOCONTEXT_NORMAL. We
+		 * do not use a BufferAccessStrategy for I/O of temporary tables.
+		 * However, in some cases, the "strategy" may not be NULL, so we can't
+		 * rely on IOContextForStrategy() to set the right IOContext for us.
+		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
+		 */
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +881,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +996,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1014,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1041,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1136,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output parameter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1200,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1218,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1287,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1302,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, *io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, *io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2625,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2875,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2967,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3629,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3663,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3761,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3971,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3998,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..2108bbe7d8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -107,7 +108,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +128,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here (instead of after a buffer hit would have returned) for
+	 * convenience since we don't have to worry about the overhead of calling
+	 * IOContextForStrategy().
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +239,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +266,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..37bae4bf73 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer. Since other IO operations bypassing the buffer
+		 * manager could also be tracked in such an IOContext, wait until
+		 * these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	if (result >= 0)
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.34.1

v46-0002-pgstat-Infrastructure-to-track-IO-operations.patchtext/x-patch; charset=US-ASCII; name=v46-0002-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From 67a4cee981382eb63dd19c32458ef23da9eab5b9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:41 -0500
Subject: [PATCH v46 2/5] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOP_WRITE IOOp on an
IOOBJECT_RELATION IOObject in the IOCONTEXT_NORMAL IOContext by
BackendType B_CHECKPOINTER.

Each IOOp (evict, extend, fsync, read, reuse, and write) can be counted
per IOObject (relation, temp relation) per IOContext (normal, bulkread,
bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  26 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 377 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  66 +++
 src/include/utils/pgstat_internal.h           |  32 ++
 src/tools/pgindent/typedefs.list              |   6 +
 15 files changed, 548 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..8d51ca3773 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5418,6 +5418,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..608c3b59da 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1516,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..f95ab9c94d
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,377 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_BackendIO PendingIOStats;
+bool		have_iostats = false;
+
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_context, io_object, io_op));
+
+	PendingIOStats.data[io_context][io_object][io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_BackendIO *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+			for (IOOp io_op = IOOP_FIRST;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_context][io_object][io_op] +=
+					PendingIOStats.data[io_context][io_object][io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BackendIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BackendIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BackendIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BackendIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BackendIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO in certain IOContexts. Some IOObjects
+ * are never operated on in some IOContexts. Check that the given BackendType
+ * is expected to do IO in the given IOContext and that the given IOObject is
+ * expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOContext io_context,
+						IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+					IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_context, io_object))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * register_dirty_segment() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BackendIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BackendIO, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still
+			 * need to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
+					backend_io->data[io_context][io_object][io_op] != 0)
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..5c8e8336bf 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1593,6 +1593,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1601,7 +1603,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"io\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 96b3a1e1a0..c309e0233d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,6 +332,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..f2a66ed7fb 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO operations
+ */
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_FIRST IOOBJECT_RELATION
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_FIRST IOOP_EVICT
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_BackendIO
+{
+	PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_BackendIO stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +503,22 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOContext io_context, IOObject io_object);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+								IOObject io_object, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..bf8e4c3b8b 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,17 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* shared version of PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects stats.stats[i]. locks[0] also protects
+	 * stats.stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+	PgStat_IO	stats;
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +430,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +454,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +563,17 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+extern bool pgstat_flush_io(bool nowait);
+extern bool pgstat_bktype_io_stats_valid(PgStat_BackendIO *context_ops,
+										 BackendType bktype);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +668,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..7b66b1bc89 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2016,6 +2019,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2033,6 +2037,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
+PgStat_BackendIO
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

#114

pryzby@telsasoft.com

almost 3 years ago

In reply to: Melanie Plageman (#113)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Jan 12, 2023 at 09:19:36PM -0500, Melanie Plageman wrote:

On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type

The patch can/will fail with:

CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+WARNING: tablespaces created by regression test cases should have names starting with "regress_"

CREATE TABLESPACE test_stats LOCATION '';
+WARNING: tablespaces created by regression test cases should have names starting with "regress_"

(I already sent patches to address the omission in cirrus.yml)

Thanks. I've fixed this
I make a tablespace in amcheck -- are there recommendations for naming
tablespaces in contrib also?

That's the test_stats one I mentioned.

Check with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

+          <literal>bulkread</literal>: Qualifying large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
I don't think it's correct to say that it's "outside of" shared-buffers.
I suppose "outside of" gives the wrong idea. But I need to make clear
that this I/O is to and from buffers which are not a part of shared
buffers right now -- they may still be accessible from the same data
structures which access shared buffers but they are currently being used
in a different way.

This would be a good place to link to a description of the ringbuffer,
if we had one.

s/Qualifying/Certain/

I feel like qualifying is more specific than certain, but I would be open
to changing it if there was a specific reason you don't like it.

I suggested to change it because at first I started to interpret it as
"The act of qualifying large I/O ops .." rather than "Large I/O ops that
qualify..".

+ Number of read operations of <varname>op_bytes</varname> size.

This is still a bit too easy to misinterpret as being in units of bytes.
I suggest: Number of read operations (which are each of the size
specified in >op_bytes<).

+ in order to add the shared buffer to a separate size-limited ring buffer

separate comma

+ More information on configuring checkpointer can be found in Section 30.5.

*the* checkpointer (as in the following paragraph)

+   <varname>backend_type</varname> <literal>checkpointer</literal> and                                                                                                                                                            
+   <varname>io_object</varname> <literal>temp relation</literal>.                                                                                                                                                                 
+  </para>

I still think it's a bit hard to understand the <varname>s adjacent to
<literal>s.

+ Some backend_types
+ in some io_contexts
+ on some io_objects
+ in certain io_contexts
+ on certain io_objects

Maybe these should not use underscores: Some backend types never
perform I/O operations in some I/O contexts and/or on some i/o objects.

+ for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+ for (IOContext io_context = IOCONTEXT_BULKREAD; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+ for (IOObject io_obj = IOOBJECT_RELATION; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+ for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)

These look a bit fragile due to starting at some hardcoded "first"
value. In other places you use symbols "FIRST" symbols:

+       for (IOContext io_context = IOCONTEXT_FIRST; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+               for (IOObject io_object = IOOBJECT_FIRST; io_object < IOOBJECT_NUM_TYPES; io_object++)
+                       for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)

I think that's marginally better, but I think having to define both
FIRST and NUM is excessive and doesn't make it less fragile. Not sure
what anyone else will say, but I'd prefer if it started at "0".

Thanks for working on this - I'm looking forward to updating my rrdtool
script for this soon. It'll be nice to finally distinguish huge number
of "backend ringbuffer writes during ALTER" from other backend writes.
Currently, that makes it look like something is terribly wrong.

--
Justin

#115

melanieplageman@gmail.com

almost 3 years ago

In reply to: Justin Pryzby (#114)

5 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Attached is v47.

On Fri, Jan 13, 2023 at 12:23 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Jan 12, 2023 at 09:19:36PM -0500, Melanie Plageman wrote:

On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type

The patch can/will fail with:

CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION '';
+WARNING: tablespaces created by regression test cases should have names starting with "regress_"

CREATE TABLESPACE test_stats LOCATION '';
+WARNING: tablespaces created by regression test cases should have names starting with "regress_"

(I already sent patches to address the omission in cirrus.yml)

Thanks. I've fixed this
I make a tablespace in amcheck -- are there recommendations for naming
tablespaces in contrib also?

That's the test_stats one I mentioned.

Check with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

Thanks. I have now changed both tablespace names and checked using that
macro.

+          <literal>bulkread</literal>: Qualifying large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
I don't think it's correct to say that it's "outside of" shared-buffers.
I suppose "outside of" gives the wrong idea. But I need to make clear
that this I/O is to and from buffers which are not a part of shared
buffers right now -- they may still be accessible from the same data
structures which access shared buffers but they are currently being used
in a different way.
This would be a good place to link to a description of the ringbuffer,
if we had one.

Indeed.

s/Qualifying/Certain/

I feel like qualifying is more specific than certain, but I would be open
to changing it if there was a specific reason you don't like it.

I suggested to change it because at first I started to interpret it as
"The act of qualifying large I/O ops .." rather than "Large I/O ops that
qualify..".

I have changed it to "certain".

+ Number of read operations of <varname>op_bytes</varname> size.

This is still a bit too easy to misinterpret as being in units of bytes.
I suggest: Number of read operations (which are each of the size
specified in >op_bytes<).

I have changed this.

+ in order to add the shared buffer to a separate size-limited ring buffer

separate comma

+ More information on configuring checkpointer can be found in Section 30.5.

*the* checkpointer (as in the following paragraph)

above items changed.

+   <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>io_object</varname> <literal>temp relation</literal>.
+  </para>
I still think it's a bit hard to understand the <varname>s adjacent to
<literal>s.

I agree it isn't great -- is there a different XML tag you suggest
instead of literal?

+ Some backend_types
+ in some io_contexts
+ on some io_objects
+ in certain io_contexts
+ on certain io_objects
Maybe these should not use underscores: Some backend types never
perform I/O operations in some I/O contexts and/or on some i/o objects.

I've changed this.

Also, taking another look, I forgot to update the docs' column name
tenses in the last version. That is now done.

+ for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+ for (IOContext io_context = IOCONTEXT_BULKREAD; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+ for (IOObject io_obj = IOOBJECT_RELATION; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+ for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
These look a bit fragile due to starting at some hardcoded "first"
value. In other places you use symbols "FIRST" symbols:
+       for (IOContext io_context = IOCONTEXT_FIRST; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+               for (IOObject io_object = IOOBJECT_FIRST; io_object < IOOBJECT_NUM_TYPES; io_object++)
+                       for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
I think that's marginally better, but I think having to define both
FIRST and NUM is excessive and doesn't make it less fragile. Not sure
what anyone else will say, but I'd prefer if it started at "0".

Thanks for catching the discrepancy in pg_stat_get_io(). I have changed
those instances to use _FIRST.

I think that having the loop start from the first enum value (except
when that value is something special like _INVALID like with
BackendType) is confusing. I agree that having multiple macros to allow
iteration through all enum values introduces some fragility. I'm not
sure about using the number 0 with the enum as the loop variable
data type. Is that a common pattern?

In this version, I have updated the loops in pg_stat_get_io() to use
_FIRST.

Thanks for working on this - I'm looking forward to updating my rrdtool
script for this soon. It'll be nice to finally distinguish huge number
of "backend ringbuffer writes during ALTER" from other backend writes.
Currently, that makes it look like something is terribly wrong.

Cool! I'm glad to know you will use it.

- Melanie

Attachments:

v47-0003-pgstat-Count-IO-for-relations.patchtext/x-patch; charset=US-ASCII; name=v47-0003-pgstat-Count-IO-for-relations.patchDownload

From ed7b2975732d599d8c809e21c2e33554c77ffc43 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:50 -0500
Subject: [PATCH v47 3/5] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 109 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 ++++++++++----
 src/backend/storage/buffer/localbuf.c |  13 ++-
 src/backend/storage/smgr/md.c         |  25 ++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 184 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..47d2b4c522 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		/*
+		 * LocalBufferAlloc() will set the io_context to IOCONTEXT_NORMAL. We
+		 * do not use a BufferAccessStrategy for I/O of temporary tables.
+		 * However, in some cases, the "strategy" may not be NULL, so we can't
+		 * rely on IOContextForStrategy() to set the right IOContext for us.
+		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
+		 */
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +881,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +996,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1014,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1041,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1136,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output parameter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1200,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1218,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1287,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1302,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, *io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, *io_context);
+	}
+
 	if (oldPartitionLock != NULL)
 	{
 		BufTableDelete(&oldTag, oldHash);
@@ -2570,7 +2625,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2875,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2967,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3629,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3663,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3761,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3971,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3998,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..2108bbe7d8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -107,7 +108,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +128,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here (instead of after a buffer hit would have returned) for
+	 * convenience since we don't have to worry about the overhead of calling
+	 * IOContextForStrategy().
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +239,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +266,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..37bae4bf73 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer. Since other IO operations bypassing the buffer
+		 * manager could also be tracked in such an IOContext, wait until
+		 * these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	if (result >= 0)
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.34.1

v47-0002-pgstat-Infrastructure-to-track-IO-operations.patchtext/x-patch; charset=US-ASCII; name=v47-0002-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From 28a57a119e2d74e00b329da850c75584178e6967 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:41 -0500
Subject: [PATCH v47 2/5] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOP_WRITE IOOp on an
IOOBJECT_RELATION IOObject in the IOCONTEXT_NORMAL IOContext by
BackendType B_CHECKPOINTER.

Each IOOp (evict, extend, fsync, read, reuse, and write) can be counted
per IOObject (relation, temp relation) per IOContext (normal, bulkread,
bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  26 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 377 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  66 +++
 src/include/utils/pgstat_internal.h           |  32 ++
 src/tools/pgindent/typedefs.list              |   6 +
 15 files changed, 548 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..8d51ca3773 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5418,6 +5418,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..608c3b59da 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1516,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..f95ab9c94d
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,377 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_BackendIO PendingIOStats;
+bool		have_iostats = false;
+
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_context, io_object, io_op));
+
+	PendingIOStats.data[io_context][io_object][io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_BackendIO *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+			for (IOOp io_op = IOOP_FIRST;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_context][io_object][io_op] +=
+					PendingIOStats.data[io_context][io_object][io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BackendIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BackendIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BackendIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BackendIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BackendIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO in certain IOContexts. Some IOObjects
+ * are never operated on in some IOContexts. Check that the given BackendType
+ * is expected to do IO in the given IOContext and that the given IOObject is
+ * expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOContext io_context,
+						IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts. Check that the given IOOp is valid for the
+ * given BackendType in the given IOContext. Note that there are currently no
+ * cases of an IOOp being invalid for a particular BackendType only within a
+ * certain IOContext.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+					IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_context, io_object))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * register_dirty_segment() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BackendIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BackendIO, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still
+			 * need to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
+					backend_io->data[io_context][io_object][io_op] != 0)
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..5c8e8336bf 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1593,6 +1593,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1601,7 +1603,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"io\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 96b3a1e1a0..c309e0233d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,6 +332,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..f2a66ed7fb 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO operations
+ */
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_FIRST IOOBJECT_RELATION
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_FIRST IOOP_EVICT
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_BackendIO
+{
+	PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_BackendIO stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +503,22 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOContext io_context, IOObject io_object);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+								IOObject io_object, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..bf8e4c3b8b 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,17 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* shared version of PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects stats.stats[i]. locks[0] also protects
+	 * stats.stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+	PgStat_IO	stats;
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +430,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +454,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +563,17 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+extern bool pgstat_flush_io(bool nowait);
+extern bool pgstat_bktype_io_stats_valid(PgStat_BackendIO *context_ops,
+										 BackendType bktype);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +668,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..7b66b1bc89 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2016,6 +2019,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2033,6 +2037,8 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
+PgStat_BackendIO
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v47-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v47-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From b112a4114b9b61b19e6a62ed5fc82a01d8b1d1a2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 9 Jan 2023 14:42:25 -0500
Subject: [PATCH v47 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the
"reads", "writes", and "extends" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  31 +++
 contrib/amcheck/sql/check_heap.sql      |  24 +++
 src/backend/catalog/system_views.sql    |  15 ++
 src/backend/utils/adt/pgstatfuncs.c     | 154 +++++++++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 ++
 src/test/regress/expected/stats.out     | 246 ++++++++++++++++++++++++
 src/test/regress/sql/stats.sql          | 150 +++++++++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 9 files changed, 642 insertions(+)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..055107f6b5 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,19 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +101,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +225,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..1cfd52bd13 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,26 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +47,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +133,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d2a8c82900..f875742068 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1116,6 +1116,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.reads,
+       b.writes,
+       b.extends,
+       b.op_bytes,
+       b.evictions,
+       b.reuses,
+       b.fsyncs,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5c8e8336bf..6a5a1dc79f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1234,6 +1234,160 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgstat_iszero_io_object(const PgStat_Counter *obj)
+{
+	for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++)
+	{
+		if (obj[io_op] != 0)
+			return false;
+	}
+
+	return true;
+}
+#endif
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		bool		bktype_tracked;
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+		for (IOContext io_context = IOCONTEXT_FIRST;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			for (IOObject io_obj = IOOBJECT_FIRST;
+				 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+			{
+				const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_object(bktype, io_context, io_obj))
+				{
+					Assert(pgstat_iszero_io_object(bktype_stats->data[io_context][io_obj]));
+					continue;
+				}
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				/*
+				 * Some combinations of BackendType and IOOp, of IOContext and
+				 * IOOp, and of IOObject and IOOp are not tracked. Set these
+				 * cells in the view NULL and assert that these stats are zero
+				 * as expected.
+				 */
+				for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_context, io_obj, io_op);
+
+					if (!nulls[col_idx])
+						values[col_idx] =
+							Int64GetDatum(bktype_stats->data[io_context][io_obj][io_op]);
+					else
+						Assert(bktype_stats->data[io_context][io_obj][io_op] == 0);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..57a889cf49 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5690,6 +5690,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a969ae63eb..dd5ddffc4d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.reads,
+    b.writes,
+    b.extends,
+    b.op_bytes,
+    b.evictions,
+    b.reuses,
+    b.fsyncs,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..a66fe86b05 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,250 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(writes) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(fsyncs) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(fsyncs) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
+SELECT sum(reads) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_io_stats_tblspc;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evictions) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(writes) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(reads) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  8000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evictions) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(reads) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(writes) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extends) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_io_stats_tblspc;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(writes) AS io_sum_local_writes_new_tblspc
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_writes_new_tblspc > :io_sum_local_writes_after;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_local;
+RESET temp_buffers;
+DROP TABLESPACE regress_io_stats_tblspc;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..64b3da2765 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,154 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_shared_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_extends_after > :io_sum_shared_extends_before;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
+SELECT sum(writes) AS io_sum_shared_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(fsyncs) AS io_sum_shared_fsyncs_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS io_sum_shared_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT sum(fsyncs) AS io_sum_shared_fsyncs_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+SELECT current_setting('fsync') = 'off' OR :io_sum_shared_fsyncs_after > :io_sum_shared_fsyncs_before;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
+SELECT sum(reads) AS io_sum_shared_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_io_stats_tblspc;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS io_sum_shared_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_reads_after > :io_sum_shared_reads_before;
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evictions) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(writes) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+
+SELECT sum(reads) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evictions) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(reads) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(writes) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extends) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_io_stats_tblspc;
+SELECT pg_stat_force_next_flush();
+SELECT sum(writes) AS io_sum_local_writes_new_tblspc
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_writes_new_tblspc > :io_sum_local_writes_after;
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_local;
+RESET temp_buffers;
+DROP TABLESPACE regress_io_stats_tblspc;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_after FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT :io_sum_vac_strategy_reads_after > :io_sum_vac_strategy_reads_before;
+SELECT :io_sum_vac_strategy_reuses_after > :io_sum_vac_strategy_reuses_before;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7b66b1bc89..c4ecef2bf8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3377,6 +3377,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.34.1

v47-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchtext/x-patch; charset=US-ASCII; name=v47-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patchDownload

From f8c9077631169a778c893fd16b7a973ad5725f2a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Dec 2022 18:23:19 -0800
Subject: [PATCH v47 1/5] pgindent and some manual cleanup in pgstat related
 code

---
 src/backend/storage/buffer/bufmgr.c          | 22 ++++++++++----------
 src/backend/storage/buffer/localbuf.c        |  4 ++--
 src/backend/utils/activity/pgstat.c          |  3 ++-
 src/backend/utils/activity/pgstat_relation.c |  1 +
 src/backend/utils/adt/pgstatfuncs.c          |  2 +-
 src/include/pgstat.h                         |  1 +
 src/include/utils/pgstat_internal.h          |  1 +
 7 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..8075828e8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -516,7 +516,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr_reln->smgr_rlocator.locator,
-				   forkNum, blockNum);
+				  forkNum, blockNum);
 
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
@@ -3297,8 +3297,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3425,8 +3425,8 @@ DropDatabaseBuffers(Oid dbid)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
@@ -3572,8 +3572,8 @@ FlushRelationBuffers(Relation rel)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator))
 			continue;
@@ -3645,8 +3645,8 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		uint32		buf_state;
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 
 		if (!use_bsearch)
@@ -3880,8 +3880,8 @@ FlushDatabaseBuffers(Oid dbid)
 		bufHdr = GetBufferDescriptor(i);
 
 		/*
-		 * As in DropRelationBuffers, an unlocked precheck should be
-		 * safe and saves some cycles.
+		 * As in DropRelationBuffers, an unlocked precheck should be safe and
+		 * saves some cycles.
 		 */
 		if (bufHdr->tag.dbOid != dbid)
 			continue;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b2720df6ea..8372acc383 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -610,8 +610,8 @@ AtProcExit_LocalBuffers(void)
 {
 	/*
 	 * We shouldn't be holding any remaining pins; if we are, and assertions
-	 * aren't enabled, we'll fail later in DropRelationBuffers while
-	 * trying to drop the temp rels.
+	 * aren't enabled, we'll fail later in DropRelationBuffers while trying to
+	 * drop the temp rels.
 	 */
 	CheckForLocalBufferLeaks();
 }
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7e9dc17e68..0fa5370bcd 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -426,7 +426,7 @@ pgstat_discard_stats(void)
 		ereport(DEBUG2,
 				(errcode_for_file_access(),
 				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-						PGSTAT_STAT_PERMANENT_FILENAME)));
+								 PGSTAT_STAT_PERMANENT_FILENAME)));
 	}
 
 	/*
@@ -986,6 +986,7 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 										 kind_info->shared_size);
+
 		/*
 		 * Acquire the LWLock directly instead of using
 		 * pg_stat_lock_entry_shared() which requires a reference.
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 1730425de1..2e20b93c20 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -783,6 +783,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	if (lstats->t_counts.t_numscans)
 	{
 		TimestampTz t = GetCurrentTransactionStopTimestamp();
+
 		if (t > tabentry->lastscan)
 			tabentry->lastscan = t;
 	}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6cddd74aa7..58bd1360b9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -906,7 +906,7 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
 	clean_ipv6_addr(beentry->st_clientaddr.addr.ss_family, remote_host);
 
 	PG_RETURN_DATUM(DirectFunctionCall1(inet_in,
-										 CStringGetDatum(remote_host)));
+										CStringGetDatum(remote_host)));
 }
 
 Datum
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d3e965d744..5e3326a3b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -476,6 +476,7 @@ extern void pgstat_report_connect(Oid dboid);
 
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dboid);
 
+
 /*
  * Functions in pgstat_function.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 08412d6404..12fd51f1ae 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -626,6 +626,7 @@ extern void pgstat_wal_snapshot_cb(void);
 extern bool pgstat_subscription_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_subscription_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
 
+
 /*
  * Functions in pgstat_xact.c
  */
-- 
2.34.1

v47-0005-pg_stat_io-documentation.patchtext/x-patch; charset=US-ASCII; name=v47-0005-pg_stat_io-documentation.patchDownload

From 62beac2b02e1b724f96533d81515b2e9b3d1f5d9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 9 Jan 2023 14:42:53 -0500
Subject: [PATCH v47 5/5] pg_stat_io documentation

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml | 318 +++++++++++++++++++++++++++++++++--
 1 file changed, 304 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d51ca3773..ec8dba00f9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, and target object
+       containing cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +675,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are useful for determining
+   the effectiveness of the buffer cache. They can be used to calculate a cache
+   hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O
+   statistics capture most instances in which the kernel was invoked in order
+   to perform I/O, they do not differentiate between data which had to be
+   fetched from disk and that which already resided in the kernel page cache.
+   Users are advised to use the <productname>PostgreSQL</productname>
+   statistics views in combination with operating system utilities for a more
+   complete picture of their database's I/O performance.
   </para>
 
  </sect2>
@@ -3643,6 +3649,290 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view will contain one row for each
+   combination of backend type, I/O context, and target I/O object showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
+   relation I/O which bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) is currently not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>io_context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reads</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of write operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extends</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of relation extend operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of I/O read, written, or extended.
+       </para>
+       <para>
+        Relation data reads, writes, and extends are done in
+        <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evictions</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a block has been written out from a shared or local
+        buffer in order to make it available for another use.
+       </para>
+       <para>
+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate, size-limited ring buffer for use in a bulk I/O operation.
+        </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reuses</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsyncs</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of <literal>fsync</literal> calls. These are only tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations in some I/O contexts and/or
+   on some I/O objects. These rows are omitted from the view. For example, the
+   checkpointer does not checkpoint temporary tables, so there will be no rows
+   for <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>io_object</varname> <literal>temp relation</literal>.
+  </para>
+
+  <para>
+   In addition, some I/O operations will never be performed either by certain
+   backend types or in certain I/O contexts or on certain I/O objects. These
+   cells will be NULL. For example, temporary tables are not
+   <literal>fsync</literal>ed, so <varname>fsyncs</varname> will be NULL for
+   <varname>io_object</varname> <literal>temp relation</literal>. Also, the
+   background writer does not perform reads, so <varname>reads</varname> will
+   be NULL in rows for <varname>backend_type</varname> <literal>background
+   writer</literal>.
+  </para>
+
+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evictions</varname> count can indicate that shared
+      buffers should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>fsyncs</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of the checkpointer. More information on configuring
+      the checkpointer can be found in <xref linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and the background writer to write out dirty data
+      as much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of the checkpointer.
+      More information on configuring the checkpointer can be found in <xref
+      linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
 
  </sect2>
 
-- 
2.34.1

#116

andres@anarazel.de

almost 3 years ago

In reply to: Melanie Plageman (#115)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-01-13 13:38:15 -0500, Melanie Plageman wrote:

I think that's marginally better, but I think having to define both
FIRST and NUM is excessive and doesn't make it less fragile. Not sure
what anyone else will say, but I'd prefer if it started at "0".

The reason for using FIRST is to be able to define the loop variable as the
enum type, without assigning numeric values to an enum var. I prefer it
slightly.

From f8c9077631169a778c893fd16b7a973ad5725f2a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Dec 2022 18:23:19 -0800
Subject: [PATCH v47 1/5] pgindent and some manual cleanup in pgstat related

Applied.

Subject: [PATCH v47 2/5] pgstat: Infrastructure to track IO operations

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..608c3b59da 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c

Reminder to self: Need to bump PGSTAT_FILE_FORMAT_ID before commit.

Perhaps you could add a note about that to the commit message?

@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
.snapshot_cb = pgstat_checkpointer_snapshot_cb,
},

+ [PGSTAT_KIND_IO] = {
+ .name = "io_ops",

That should be "io" now I think?

+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BackendIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BackendIO, if needed.
+ */

Other PgStat_Backend* structs are just for pending data. Perhaps we could
rename it slightly to make that clearer? PgStat_BktypeIO?
PgStat_IOForBackendType? or a similar variation?

+bool
+pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still
+			 * need to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
+					backend_io->data[io_context][io_object][io_op] != 0)
+					return false;

Hm, perhaps this could be broken up into multiple lines? Something like

/* no stats, so nothing to validate */
if (backend_io->data[io_context][io_object][io_op] == 0)
continue;

/* something went wrong if have stats for something not tracked */
if (!bktype_tracked ||
!pgstat_tracks_io_op(bktype, io_context, io_object, io_op))
return false;

+typedef struct PgStat_BackendIO
+{
+	PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;

Would it bother you if we swapped the order of iocontext and iobject here and
related places? It makes more sense to me semantically, and should now be
pretty easy, code wise.

+/* shared version of PgStat_IO */
+typedef struct PgStatShared_IO
+{

Maybe /* PgStat_IO in shared memory */?

Subject: [PATCH v47 3/5] pgstat: Count IO for relations

Nearly happy with this now. See one minor nit below.

I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but
I don't have a better idea, and it doesn't seem too horrible.

@@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,

UnlockBufHdr(buf, buf_state);

+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, *io_context);
+	}
+
if (oldPartitionLock != NULL)
{
BufTableDelete(&oldTag, oldHash);

There's no reason to do this while we still hold the buffer partition lock,
right? That's a highly contended lock, and we can just move the counting a few
lines down.

@@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
if (need_to_close)
FileClose(file);
+	if (result >= 0)
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+

I'd lean towards doing this unconditionally, it's still an fsync if it
failed... Not that it matters.

Subject: [PATCH v47 4/5] Add system view tracking IO ops per backend type

Note to self + commit message: Remember the need to do a catversion bump.

+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy.

Maybe add that "whereas a sequential scan does not, see ..."?

This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.

CREATE / DROP TABLESPACE is also pretty expensive, but I don't have a better
idea.

+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.

IIRC the point of that is that it reliably evicts all the buffers from s_b,
correct? If so, mention that?

+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		bool		bktype_tracked;
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether. We still loop through
+		 * their counters so that we can assert that all values are zero.
+		 */
+		bktype_tracked = pgstat_tracks_io_bktype(bktype);

How about instead just doing Assert(pgstat_bktype_io_stats_valid(...))? That
deduplicates the logic for the asserts, and avoids doing the full loop when
assertions aren't enabled anyway?

Otherwise, see also the suggestion aout formatting the assertions as I
suggested for 0002.

+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.

There's a comment about the subsequent checkpoints earlier in the file, and I
think the comment is slightly more precise. Mybe just reference the earlier comment?

+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';

Perhaps worth doing this in tablespace.sql, to avoid the additional
checkpoints done as part of CREATE/DROP TABLESPACE?

Or, at least combine this with the CHECKPOINTs above?

+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';

I'd set it to the actual minimum '100' (in pages). Perhaps that'd allow to
make test_io_local a bit smaller?

+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evictions) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(writes) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(reads) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset

Maybe add something like

SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;

Better toast compression or such could easily make test_io_local smaller than
it's today. Seeing that it's too small would make it easier to understand the
failure.

+SELECT sum(evictions) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(reads) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(writes) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extends) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset

This could just be one select with multiple columns?

I think if you use something like \gset io_sum_local_after_ you can also avoid
the need to repeat "io_sum_local_" so many times.

+SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_reads_after > :io_sum_local_reads_before;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_writes_after > :io_sum_local_writes_before;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column?
+----------
+ t
+(1 row)

Similar.

+SELECT sum(reuses) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset
+SELECT sum(reads) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset

There's quite a few more instances of this, so I'll now omit further mentions.

Greetings,

Andres Freund

#117

m.sakrejda@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#115)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Jan 13, 2023 at 10:38 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Attached is v47.

I missed a couple of versions, but I think the docs are clearer now.
I'm torn on losing some of the detail, but overall I do think it's a
good trade-off. Moving some details out to after the table does keep
the bulk of the view documentation more readable, and the "inform
database tuning" part is great. I really like the idea of a separate
Interpreting Statistics section, but for now this works.

+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations.

Why only permanent relations? Are temporary relations treated
differently? I imagine if someone has a temp-table-heavy workload that
requires regularly vacuuming and analyzing those relations, this point
may be confusing without some additional explanation.

Other than that, this looks great.

Thanks,
Maciek

#118

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#116)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v48 attached.

On Fri, Jan 13, 2023 at 6:36 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-01-13 13:38:15 -0500, Melanie Plageman wrote:

From f8c9077631169a778c893fd16b7a973ad5725f2a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Dec 2022 18:23:19 -0800
Subject: [PATCH v47 2/5] pgstat: Infrastructure to track IO operations
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..608c3b59da 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c

Reminder to self: Need to bump PGSTAT_FILE_FORMAT_ID before commit.

Perhaps you could add a note about that to the commit message?

done

@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
.snapshot_cb = pgstat_checkpointer_snapshot_cb,
},

+ [PGSTAT_KIND_IO] = {
+ .name = "io_ops",

That should be "io" now I think?

Oh no! I didn't notice this was broken. I've added pg_stat_have_stats()
to the IO stats tests now.

It would be nice if pgstat_get_kind_from_str() could be used in
pg_stat_reset_shared() to avoid having to remember to change both. It
doesn't really work because we want to be able to throw the error
message in pg_stat_reset_shared() when the user input is wrong -- not
the one in pgstat_get_kind_from_str().
Also:
- Since recovery_prefetch doesn't have a statistic kind, it doesn't fit
well into this paradigm
- Only a subset of the statistics kinds are reset through this function
- bgwriter and checkpointer share a reset target
I added a comment -- perhaps that's all I can do?

On a separate note, should we be setting have_[io/slru/etc]stats to
false in the reset all functions?

+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BackendIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BackendIO, if needed.
+ */
Other PgStat_Backend* structs are just for pending data. Perhaps we could
rename it slightly to make that clearer? PgStat_BktypeIO?
PgStat_IOForBackendType? or a similar variation?

I've done this.

+bool
+pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
+                                                      BackendType bktype)
+{
+     bool            bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+     for (IOContext io_context = IOCONTEXT_FIRST;
+              io_context < IOCONTEXT_NUM_TYPES; io_context++)
+     {
+             for (IOObject io_object = IOOBJECT_FIRST;
+                      io_object < IOOBJECT_NUM_TYPES; io_object++)
+             {
+                     /*
+                      * Don't bother trying to skip to the next loop iteration if
+                      * pgstat_tracks_io_object() would return false here. We still
+                      * need to validate that each counter is zero anyway.
+                      */
+                     for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+                     {
+                             if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
+                                     backend_io->data[io_context][io_object][io_op] != 0)
+                                     return false;

Hm, perhaps this could be broken up into multiple lines? Something like

/* no stats, so nothing to validate */
if (backend_io->data[io_context][io_object][io_op] == 0)
continue;

/* something went wrong if have stats for something not tracked */
if (!bktype_tracked ||
!pgstat_tracks_io_op(bktype, io_context, io_object, io_op))
return false;

I've done this.

+typedef struct PgStat_BackendIO
+{
+     PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;
Would it bother you if we swapped the order of iocontext and iobject here and
related places? It makes more sense to me semantically, and should now be
pretty easy, code wise.

So, thinking about this I started noticing inconsistencies in other
areas around this order:
For example: ordering of objects mentioned in commit messages and comments,
ordering of parameters (like in pgstat_count_io_op() [currently in
reverse order]).

I think we should make a final decision about this ordering and then
make everywhere consistent (including ordering in the view).

Currently the order is:
BackendType
IOContext
IOObject
IOOp

You are suggesting this order:
BackendType
IOObject
IOContext
IOOp

Could you explain what you find more natural about this ordering (as I
find the other more natural)?

This is one possible natural sentence with these objects:

During COPY, a client backend may read in data from a permanent
relation.
This order is:
IOContext
BackendType
IOOp
IOObject

I think English sentences are often structured subject, verb, object --
but in our case, we have an extra thing that doesn't fit neatly
(IOContext). Also, IOOp in a sentence would be in the middle (as the
verb). I made it last because a) it feels like the smallest unit b) it
would make the code a lot more annoying if it wasn't last.

WRT IOObject and IOContext, is there a future case for which having
IOObject first will be better or lead to fewer mistakes?

I actually see loads of places where this needs to be made consistent.

+/* shared version of PgStat_IO */
+typedef struct PgStatShared_IO
+{
Maybe /* PgStat_IO in shared memory */?

updated.

Subject: [PATCH v47 3/5] pgstat: Count IO for relations

Nearly happy with this now. See one minor nit below.

I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but
I don't have a better idea, and it doesn't seem too horrible.

You don't like it because such things shouldn't be in md.c -- since we
went to the trouble of having function pointers and making it general?

@@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,

UnlockBufHdr(buf, buf_state);

+     if (oldFlags & BM_VALID)
+     {
+             /*
+              * When a BufferAccessStrategy is in use, blocks evicted from shared
+              * buffers are counted as IOOP_EVICT in the corresponding context
+              * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+              * strategy in two cases: 1) while initially claiming buffers for the
+              * strategy ring 2) to replace an existing strategy ring buffer
+              * because it is pinned or in use and cannot be reused.
+              *
+              * Blocks evicted from buffers already in the strategy ring are
+              * counted as IOOP_REUSE in the corresponding strategy context.
+              *
+              * At this point, we can accurately count evictions and reuses,
+              * because we have successfully claimed the valid buffer. Previously,
+              * we may have been forced to release the buffer due to concurrent
+              * pinners or erroring out.
+              */
+             pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+                                                IOOBJECT_RELATION, *io_context);
+     }
+
if (oldPartitionLock != NULL)
{
BufTableDelete(&oldTag, oldHash);

There's no reason to do this while we still hold the buffer partition lock,
right? That's a highly contended lock, and we can just move the counting a few
lines down.

Thanks, I've done this.

@@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path)
if (need_to_close)
FileClose(file);
+     if (result >= 0)
+             pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
I'd lean towards doing this unconditionally, it's still an fsync if it
failed... Not that it matters.

Good point. We still incurred the costs if not benefited from the
effects. I've updated this.

Subject: [PATCH v47 4/5] Add system view tracking IO ops per backend type

Note to self + commit message: Remember the need to do a catversion bump.

Noted.

+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy.
Maybe add that "whereas a sequential scan does not, see ..."?

Updated.

This allows
+-- us to reliably test that pg_stat_io BULKREAD reads are being captured
+-- without relying on the size of shared buffers or on an expensive operation
+-- like CREATE DATABASE.

CREATE / DROP TABLESPACE is also pretty expensive, but I don't have a better
idea.

I've added a comment.

+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten.
IIRC the point of that is that it reliably evicts all the buffers from s_b,
correct? If so, mention that?

Done.

+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+     ReturnSetInfo *rsinfo;
+     PgStat_IO  *backends_io_stats;
+     Datum           reset_time;
+
+     InitMaterializedSRF(fcinfo, 0);
+     rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+     backends_io_stats = pgstat_fetch_stat_io();
+
+     reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+     for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+     {
+             bool            bktype_tracked;
+             Datum           bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+             PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+             /*
+              * For those BackendTypes without IO Operation stats, skip
+              * representing them in the view altogether. We still loop through
+              * their counters so that we can assert that all values are zero.
+              */
+             bktype_tracked = pgstat_tracks_io_bktype(bktype);

How about instead just doing Assert(pgstat_bktype_io_stats_valid(...))? That
deduplicates the logic for the asserts, and avoids doing the full loop when
assertions aren't enabled anyway?

I've done this and added a comment.

+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+-- The second checkpoint ensures that stats from the first checkpoint have been
+-- reported and protects against any potential races amongst the table
+-- creation, a possible timing-triggered checkpoint, and the explicit
+-- checkpoint in the test.
There's a comment about the subsequent checkpoints earlier in the file, and I
think the comment is slightly more precise. Mybe just reference the earlier comment?
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
Perhaps worth doing this in tablespace.sql, to avoid the additional
checkpoints done as part of CREATE/DROP TABLESPACE?

Or, at least combine this with the CHECKPOINTs above?

I see a checkpoint is requested when dropping the tablespace if not all
the files in it are deleted. It seems like if the DROP TABLE for the
permanent table is before the explicit checkpoints in the test, then the
DROP TABLESPACE will not cause an additional checkpoint. Is this what
you are suggesting? Dropping the temporary table should not have an
effect on this.

+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to a low value so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO '1MB';

I'd set it to the actual minimum '100' (in pages). Perhaps that'd allow to
make test_io_local a bit smaller?

I've done this.

+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS io_sum_local_extends_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(evictions) AS io_sum_local_evictions_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+SELECT sum(writes) AS io_sum_local_writes_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100);
+SELECT sum(reads) AS io_sum_local_reads_before
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset

Maybe add something like

SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;

Better toast compression or such could easily make test_io_local smaller than
it's today. Seeing that it's too small would make it easier to understand the
failure.

Good idea. So, I used pg_table_size() because it seems like
pg_relation_size() does not include the toast relations. However, I'm
not sure this is a good idea, because pg_table_size() includes FSM and
visibility map. Should I write a query to get the toast relation name
and add pg_relation_size() of that relation and the main relation?

+SELECT sum(evictions) AS io_sum_local_evictions_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(reads) AS io_sum_local_reads_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(writes) AS io_sum_local_writes_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT sum(extends) AS io_sum_local_extends_after
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset

This could just be one select with multiple columns?

I think if you use something like \gset io_sum_local_after_ you can also avoid
the need to repeat "io_sum_local_" so many times.

Thanks. I didn't realize. I've fixed this throughout the test file.

On Mon, Jan 16, 2023 at 4:42 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

I missed a couple of versions, but I think the docs are clearer now.
I'm torn on losing some of the detail, but overall I do think it's a
good trade-off. Moving some details out to after the table does keep
the bulk of the view documentation more readable, and the "inform
database tuning" part is great. I really like the idea of a separate
Interpreting Statistics section, but for now this works.
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations.
Why only permanent relations? Are temporary relations treated
differently? I imagine if someone has a temp-table-heavy workload that
requires regularly vacuuming and analyzing those relations, this point
may be confusing without some additional explanation.

Ah, yes. This is a bit confusing. We don't use buffer access strategies
when operating on temp relations, so vacuuming them is counted in IO
Context normal. I've added this information to the docs but now that
definition is a bit long. Perhaps it should be a note? That seems like
it would draw too much attention to this detail, though...

- Melanie

Attachments:

v48-0002-pgstat-Count-IO-for-relations.patchtext/x-patch; charset=US-ASCII; name=v48-0002-pgstat-Count-IO-for-relations.patchDownload

From 7a7a254c218df04f513cef4b7b2c38725d58a8a4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:50 -0500
Subject: [PATCH v48 2/4] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 109 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 ++++++++++----
 src/backend/storage/buffer/localbuf.c |  13 ++-
 src/backend/storage/smgr/md.c         |  24 ++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 183 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..484f422b72 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOContext io_context, IOObject io_object);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		/*
+		 * LocalBufferAlloc() will set the io_context to IOCONTEXT_NORMAL. We
+		 * do not use a BufferAccessStrategy for I/O of temporary tables.
+		 * However, in some cases, the "strategy" may not be NULL, so we can't
+		 * rely on IOContextForStrategy() to set the right IOContext for us.
+		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
+		 */
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +881,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +996,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1014,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(IOOP_EXTEND, io_object, io_context);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1041,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_object, io_context);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1136,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output parameter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1200,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1218,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1287,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1302,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, *io_context, IOOBJECT_RELATION);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1450,6 +1483,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	LWLockRelease(newPartitionLock);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT,
+						   IOOBJECT_RELATION, *io_context);
+	}
+
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
 	 * start I/O.  If StartBufferIO returns false, then someone else managed
@@ -2570,7 +2625,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2875,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context, IOObject io_object)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2967,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOP_WRITE, IOOBJECT_RELATION, io_context);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3629,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3663,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3761,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3971,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3998,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOCONTEXT_NORMAL, IOOBJECT_RELATION);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..2108bbe7d8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -107,7 +108,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +128,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here (instead of after a buffer hit would have returned) for
+	 * convenience since we don't have to worry about the overhead of calling
+	 * IOContextForStrategy().
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +239,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOP_WRITE, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +266,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOP_EVICT, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..58ae2af55a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer. Since other IO operations bypassing the buffer
+		 * manager could also be tracked in such an IOContext, wait until
+		 * these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.34.1

v48-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v48-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 879cce732348d599445802f00e8650ee521fb239 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 9 Jan 2023 14:42:25 -0500
Subject: [PATCH v48 3/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the
"reads", "writes", and "extends" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Catalog version should be bumped.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  34 ++++
 contrib/amcheck/sql/check_heap.sql      |  27 +++
 src/backend/catalog/system_views.sql    |  15 ++
 src/backend/utils/adt/pgstatfuncs.c     | 140 ++++++++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 ++
 src/test/regress/expected/stats.out     | 234 ++++++++++++++++++++++++
 src/test/regress/sql/stats.sql          | 148 +++++++++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 9 files changed, 620 insertions(+)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..e4785141a6 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,22 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy, whereas a
+-- sequential scan does so only if the table is large enough when compared to
+-- shared buffers (see initscan()). CREATE DATABASE ... also unconditionally
+-- uses a BAS_BULKREAD strategy, but we have chosen to use a tablespace and
+-- verify_heapam to provide coverage instead of adding another expensive
+-- operation to the main regression test suite.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten and all the blocks to reliably evicted from shared
+-- buffers -- guaranteeing actual reads when we next select from it.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +104,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +228,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..6794ca4eb0 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,29 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy, whereas a
+-- sequential scan does so only if the table is large enough when compared to
+-- shared buffers (see initscan()). CREATE DATABASE ... also unconditionally
+-- uses a BAS_BULKREAD strategy, but we have chosen to use a tablespace and
+-- verify_heapam to provide coverage instead of adding another expensive
+-- operation to the main regression test suite.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten and all the blocks to reliably evicted from shared
+-- buffers -- guaranteeing actual reads when we next select from it.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +50,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +136,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d2a8c82900..f875742068 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1116,6 +1116,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_context,
+       b.io_object,
+       b.reads,
+       b.writes,
+       b.extends,
+       b.op_bytes,
+       b.evictions,
+       b.reuses,
+       b.fsyncs,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6df9f06a20..284bb2a698 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1234,6 +1234,146 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_CONTEXT,
+	IO_COL_IO_OBJECT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters checking that only expected stats are non-zero, since it
+		 * keeps the non-Assert code cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		for (IOContext io_context = IOCONTEXT_FIRST;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			for (IOObject io_obj = IOOBJECT_FIRST;
+				 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+			{
+				const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of IOContext, IOObject, and BackendType
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!pgstat_tracks_io_object(bktype, io_context, io_obj))
+					continue;
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					/*
+					 * Some combinations of BackendType and IOOp, of IOContext
+					 * and IOOp, and of IOObject and IOOp are not tracked. Set
+					 * these cells in the view NULL.
+					 */
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_context, io_obj, io_op);
+
+					if (nulls[col_idx])
+						continue;
+
+					values[col_idx] =
+						Int64GetDatum(bktype_stats->data[io_context][io_obj][io_op]);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..57a889cf49 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5690,6 +5690,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_context,io_object,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a969ae63eb..dd5ddffc4d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_context,
+    b.io_object,
+    b.reads,
+    b.writes,
+    b.extends,
+    b.op_bytes,
+    b.evictions,
+    b.reuses,
+    b.fsyncs,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_context, io_object, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..3bd4e66fa8 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,238 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_before_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_shared_after_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
+-- See comment above for rationale for two explicit CHECKPOINTs.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
+SELECT sum(reads) AS io_sum_shared_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_io_stats_tblspc;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS io_sum_shared_after_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to its minimum so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO 100;
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS extends, sum(evictions) AS evictions, sum(writes) AS writes
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation' \gset io_sum_local_before_
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 5000) as id, repeat('a', 200);
+-- Ensure the table is large enough to exceed our temp_buffers setting.
+SELECT pg_table_size('test_io_local') / current_setting('block_size')::int8 > 100;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT sum(reads) AS io_sum_local_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  5000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evictions) AS evictions,
+       sum(reads) AS reads,
+       sum(writes) AS writes,
+       sum(extends) AS extends
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset io_sum_local_after_
+SELECT :io_sum_local_after_evictions > :io_sum_local_before_evictions,
+       :io_sum_local_after_reads > :io_sum_local_before_reads,
+       :io_sum_local_after_writes > :io_sum_local_before_writes,
+       :io_sum_local_after_extends > :io_sum_local_before_extends;
+ ?column? | ?column? | ?column? | ?column? 
+----------+----------+----------+----------
+ t        | t        | t        | t
+(1 row)
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_io_stats_tblspc;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(writes) AS io_sum_local_new_tblspc_writes
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_local;
+RESET temp_buffers;
+DROP TABLESPACE regress_io_stats_tblspc;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_before_
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_after_
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
+       :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT pg_stat_have_stats('io', 0, 0);
+ pg_stat_have_stats 
+--------------------
+ t
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset
+  FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset
+  FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..163ed38faf 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,152 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_before_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_shared_after_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
+-- See comment above for rationale for two explicit CHECKPOINTs.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+
+SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
+SELECT sum(reads) AS io_sum_shared_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_io_stats_tblspc;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS io_sum_shared_after_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to its minimum so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO 100;
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS extends, sum(evictions) AS evictions, sum(writes) AS writes
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation' \gset io_sum_local_before_
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 5000) as id, repeat('a', 200);
+-- Ensure the table is large enough to exceed our temp_buffers setting.
+SELECT pg_table_size('test_io_local') / current_setting('block_size')::int8 > 100;
+
+SELECT sum(reads) AS io_sum_local_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evictions) AS evictions,
+       sum(reads) AS reads,
+       sum(writes) AS writes,
+       sum(extends) AS extends
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset io_sum_local_after_
+SELECT :io_sum_local_after_evictions > :io_sum_local_before_evictions,
+       :io_sum_local_after_reads > :io_sum_local_before_reads,
+       :io_sum_local_after_writes > :io_sum_local_before_writes,
+       :io_sum_local_after_extends > :io_sum_local_before_extends;
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_io_stats_tblspc;
+SELECT pg_stat_force_next_flush();
+SELECT sum(writes) AS io_sum_local_new_tblspc_writes
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_local;
+RESET temp_buffers;
+DROP TABLESPACE regress_io_stats_tblspc;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_before_
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_after_
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
+       :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT pg_stat_have_stats('io', 0, 0);
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset
+  FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset
+  FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1be6e07980..a399e0a5e4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3377,6 +3377,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.34.1

v48-0004-pg_stat_io-documentation.patchtext/x-patch; charset=US-ASCII; name=v48-0004-pg_stat_io-documentation.patchDownload

From 21d63527d9e62afb5649bcbf162c3be860408f66 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 9 Jan 2023 14:42:53 -0500
Subject: [PATCH v48 4/4] pg_stat_io documentation

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml | 321 +++++++++++++++++++++++++++++++++--
 1 file changed, 307 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d51ca3773..d0ca41e204 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, and target object
+       containing cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +675,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are useful for determining
+   the effectiveness of the buffer cache. They can be used to calculate a cache
+   hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O
+   statistics capture most instances in which the kernel was invoked in order
+   to perform I/O, they do not differentiate between data which had to be
+   fetched from disk and that which already resided in the kernel page cache.
+   Users are advised to use the <productname>PostgreSQL</productname>
+   statistics views in combination with operating system utilities for a more
+   complete picture of their database's I/O performance.
   </para>
 
  </sect2>
@@ -3643,6 +3649,293 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view will contain one row for each
+   combination of backend type, I/O context, and target I/O object showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
+   relation I/O which bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) is currently not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>io_context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          IO operations and are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reads</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of write operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extends</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of relation extend operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of I/O read, written, or extended.
+       </para>
+       <para>
+        Relation data reads, writes, and extends are done in
+        <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evictions</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a block has been written out from a shared or local
+        buffer in order to make it available for another use.
+       </para>
+       <para>
+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate, size-limited ring buffer for use in a bulk I/O operation.
+        </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reuses</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsyncs</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of <literal>fsync</literal> calls. These are only tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations in some I/O contexts and/or
+   on some I/O objects. These rows are omitted from the view. For example, the
+   checkpointer does not checkpoint temporary tables, so there will be no rows
+   for <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>io_object</varname> <literal>temp relation</literal>.
+  </para>
+
+  <para>
+   In addition, some I/O operations will never be performed either by certain
+   backend types or in certain I/O contexts or on certain I/O objects. These
+   cells will be NULL. For example, temporary tables are not
+   <literal>fsync</literal>ed, so <varname>fsyncs</varname> will be NULL for
+   <varname>io_object</varname> <literal>temp relation</literal>. Also, the
+   background writer does not perform reads, so <varname>reads</varname> will
+   be NULL in rows for <varname>backend_type</varname> <literal>background
+   writer</literal>.
+  </para>
+
+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evictions</varname> count can indicate that shared
+      buffers should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>fsyncs</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of the checkpointer. More information on configuring
+      the checkpointer can be found in <xref linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and the background writer to write out dirty data
+      as much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of the checkpointer.
+      More information on configuring the checkpointer can be found in <xref
+      linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
 
  </sect2>
 
-- 
2.34.1

v48-0001-pgstat-Infrastructure-to-track-IO-operations.patchtext/x-patch; charset=US-ASCII; name=v48-0001-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From f77f8e0eeb4377cbffd2d52c9455f05e41468bec Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Jan 2023 17:20:41 -0500
Subject: [PATCH v48 1/4] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOP_WRITE IOOp on an
IOOBJECT_RELATION IOObject in the IOCONTEXT_NORMAL IOContext by
BackendType B_CHECKPOINTER.

Each IOOp (evict, extend, fsync, read, reuse, and write) can be counted
per IOObject (relation, temp relation) per IOContext (normal, bulkread,
bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOObject, IOContext combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

PGSTAT_FILE_FORMAT_ID should be bumped with this commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  26 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 385 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |  11 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  68 ++++
 src/include/utils/pgstat_internal.h           |  30 ++
 src/tools/pgindent/typedefs.list              |   6 +
 15 files changed, 562 insertions(+), 7 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..8d51ca3773 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5418,6 +5418,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..60fc4e761f 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1516,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..9859db0581
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,385 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_BktypeIO PendingIOStats;
+bool		have_iostats = false;
+
+/*
+ * Check that stats have not been counted for any combination of IOContext,
+ * IOObject, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BktypeIO, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+	{
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still
+			 * need to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				/* No stats, so nothing to validate */
+				if (backend_io->data[io_context][io_object][io_op] == 0)
+					continue;
+
+				/* There are stats and there shouldn't be */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_op(bktype, io_context, io_object, io_op))
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
+
+void
+pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context)
+{
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_context, io_object, io_op));
+
+	PendingIOStats.data[io_context][io_object][io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_BktypeIO *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOContext io_context = IOCONTEXT_FIRST;
+		 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (IOObject io_object = IOOBJECT_FIRST;
+			 io_object < IOOBJECT_NUM_TYPES; io_object++)
+			for (IOOp io_op = IOOP_FIRST;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_context][io_object][io_op] +=
+					PendingIOStats.data[io_context][io_object][io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*
+* When adding a new BackendType, also consider adding relevant restrictions to
+* pgstat_tracks_io_object() and pgstat_tracks_io_op().
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO in certain IOContexts. Some IOObjects
+ * are never operated on in some IOContexts. Check that the given BackendType
+ * is expected to do IO in the given IOContext and that the given IOObject is
+ * expected to be operated on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOContext io_context,
+						IOObject io_object)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts or on certain IOObjects. Check that the given
+ * IOOp is valid for the given BackendType in the given IOContext and on the
+ * given IOObject. Note that there are currently no cases of an IOOp being
+ * invalid for a particular BackendType only within a certain IOContext and/or
+ * only on a certain IOObject.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+					IOObject io_object, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_context, io_object))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * register_dirty_segment() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..6df9f06a20 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1576,7 +1576,12 @@ pg_stat_reset(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-/* Reset some shared cluster-wide counters */
+/*
+ * Reset some shared cluster-wide counters
+ *
+ * When adding a new reset target, ideally the name should match that in
+ * pgstat_kind_infos, if relevant.
+ */
 Datum
 pg_stat_reset_shared(PG_FUNCTION_ARGS)
 {
@@ -1593,6 +1598,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1601,7 +1608,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"io\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 96b3a1e1a0..c309e0233d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,6 +332,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..d0fa47ec07 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO operations
+ */
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_FIRST IOOBJECT_RELATION
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_FIRST IOOP_EVICT
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_BktypeIO
+{
+	PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BktypeIO;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +503,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
+										 BackendType bktype);
+extern void pgstat_count_io_op(IOOp io_op, IOObject io_object, IOContext io_context);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOContext io_context, IOObject io_object);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOContext io_context,
+								IOObject io_object, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..6badb2fde4 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,17 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* Shared-memory ready PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects stats.stats[i]. locks[0] also protects
+	 * stats.stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+	PgStat_IO	stats;
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +430,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +454,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +563,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern bool pgstat_flush_io(bool nowait);
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +666,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..1be6e07980 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2016,6 +2019,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2025,6 +2029,7 @@ PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
+PgStat_BktypeIO
 PgStat_CheckpointerStats
 PgStat_Counter
 PgStat_EntryRef
@@ -2033,6 +2038,7 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

#119

andres@anarazel.de

almost 3 years ago

In reply to: Melanie Plageman (#118)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:

@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
.snapshot_cb = pgstat_checkpointer_snapshot_cb,
},

+ [PGSTAT_KIND_IO] = {
+ .name = "io_ops",

That should be "io" now I think?

Oh no! I didn't notice this was broken. I've added pg_stat_have_stats()
to the IO stats tests now.

It would be nice if pgstat_get_kind_from_str() could be used in
pg_stat_reset_shared() to avoid having to remember to change both.

It's hard to make that work, because of the historical behaviour of that
function :(

Also:
- Since recovery_prefetch doesn't have a statistic kind, it doesn't fit
well into this paradigm

I think that needs a rework anyway - it went in at about the same time as the
shared mem stats patch, so it doesn't quite cohere.

On a separate note, should we be setting have_[io/slru/etc]stats to
false in the reset all functions?

That'd not work reliably, because other backends won't do the same. I don't
see a benefit in doing it differently in the local connection than the other
connections.

+typedef struct PgStat_BackendIO
+{
+     PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;
Would it bother you if we swapped the order of iocontext and iobject here and
related places? It makes more sense to me semantically, and should now be
pretty easy, code wise.
So, thinking about this I started noticing inconsistencies in other
areas around this order:
For example: ordering of objects mentioned in commit messages and comments,
ordering of parameters (like in pgstat_count_io_op() [currently in
reverse order]).

I think we should make a final decision about this ordering and then
make everywhere consistent (including ordering in the view).

Currently the order is:
BackendType
IOContext
IOObject
IOOp

You are suggesting this order:
BackendType
IOObject
IOContext
IOOp

Could you explain what you find more natural about this ordering (as I
find the other more natural)?

The object we're performing IO on determines more things than the context. So
it just seems like the natural hierarchical fit. The context is a sub-category
of the object. Consider how it'll look like if we also have objects for 'wal',
'temp files'. It'll make sense to group by just the object, but it won't make
sense to group by just the context.

If it were trivial to do I'd use a different IOContext for each IOObject. But
it'd make it much harder. So there'll just be a bunch of values of IOContext
that'll only be used for one or a subset of the IOObjects.

The reason to put BackendType at the top is pragmatic - one backend is of a
single type, but can do IO for all kinds of objects/contexts. So any other
hierarchy would make the locking etc much harder.

This is one possible natural sentence with these objects:

During COPY, a client backend may read in data from a permanent
relation.
This order is:
IOContext
BackendType
IOOp
IOObject

I think English sentences are often structured subject, verb, object --
but in our case, we have an extra thing that doesn't fit neatly
(IOContext).

"..., to avoid polluting the buffer cache it uses the bulk (read|write)
strategy".

Also, IOOp in a sentence would be in the middle (as the
verb). I made it last because a) it feels like the smallest unit b) it
would make the code a lot more annoying if it wasn't last.

Yea, I think pragmatically that is the right choice.

Subject: [PATCH v47 3/5] pgstat: Count IO for relations

Nearly happy with this now. See one minor nit below.

I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but
I don't have a better idea, and it doesn't seem too horrible.

You don't like it because such things shouldn't be in md.c -- since we
went to the trouble of having function pointers and making it general?

It's more of a gut feeling than well reasoned ;)

+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
Perhaps worth doing this in tablespace.sql, to avoid the additional
checkpoints done as part of CREATE/DROP TABLESPACE?

Or, at least combine this with the CHECKPOINTs above?
I see a checkpoint is requested when dropping the tablespace if not all
the files in it are deleted. It seems like if the DROP TABLE for the
permanent table is before the explicit checkpoints in the test, then the
DROP TABLESPACE will not cause an additional checkpoint.

Unfortunately, that's not how it works :(. See the comment above mdunlink():

* For regular relations, we don't unlink the first segment file of the rel,
* but just truncate it to zero length, and record a request to unlink it after
* the next checkpoint. Additional segments can be unlinked immediately,
* however. Leaving the empty file in place prevents that relfilenumber
* from being reused. The scenario this protects us from is:
...

Is this what you are suggesting? Dropping the temporary table should not
have an effect on this.

I was wondering about simply moving that portion of the test to
tablespace.sql, where we already created a tablespace.

An alternative would be to propose splitting tablespace.sql into one portion
running at the start of parallel_schedule, and one at the end. Historically,
we needed tablespace.sql to be optional due to causing problems when
replicating to another instance on the same machine, but now we have
allow_in_place_tablespaces.

SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;

Better toast compression or such could easily make test_io_local smaller than
it's today. Seeing that it's too small would make it easier to understand the
failure.

Good idea. So, I used pg_table_size() because it seems like
pg_relation_size() does not include the toast relations. However, I'm
not sure this is a good idea, because pg_table_size() includes FSM and
visibility map. Should I write a query to get the toast relation name
and add pg_relation_size() of that relation and the main relation?

I think it's the right thing to just include the relation size. Your queries
IIRC won't use the toast table or other forks. So I'd leave it at just
pg_relation_size().

Greetings,

Andres Freund

#120

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#119)

4 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

v49 attached

On Tue, Jan 17, 2023 at 2:12 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
+typedef struct PgStat_BackendIO
+{
+     PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;
Would it bother you if we swapped the order of iocontext and iobject here and
related places? It makes more sense to me semantically, and should now be
pretty easy, code wise.
So, thinking about this I started noticing inconsistencies in other
areas around this order:
For example: ordering of objects mentioned in commit messages and comments,
ordering of parameters (like in pgstat_count_io_op() [currently in
reverse order]).

I think we should make a final decision about this ordering and then
make everywhere consistent (including ordering in the view).

Currently the order is:
BackendType
IOContext
IOObject
IOOp

You are suggesting this order:
BackendType
IOObject
IOContext
IOOp

Could you explain what you find more natural about this ordering (as I
find the other more natural)?
The object we're performing IO on determines more things than the context. So
it just seems like the natural hierarchical fit. The context is a sub-category
of the object. Consider how it'll look like if we also have objects for 'wal',
'temp files'. It'll make sense to group by just the object, but it won't make
sense to group by just the context.

If it were trivial to do I'd use a different IOContext for each IOObject. But
it'd make it much harder. So there'll just be a bunch of values of IOContext
that'll only be used for one or a subset of the IOObjects.

The reason to put BackendType at the top is pragmatic - one backend is of a
single type, but can do IO for all kinds of objects/contexts. So any other
hierarchy would make the locking etc much harder.

This is one possible natural sentence with these objects:

During COPY, a client backend may read in data from a permanent
relation.
This order is:
IOContext
BackendType
IOOp
IOObject

I think English sentences are often structured subject, verb, object --
but in our case, we have an extra thing that doesn't fit neatly
(IOContext).

"..., to avoid polluting the buffer cache it uses the bulk (read|write)
strategy".

Also, IOOp in a sentence would be in the middle (as the
verb). I made it last because a) it feels like the smallest unit b) it
would make the code a lot more annoying if it wasn't last.

Yea, I think pragmatically that is the right choice.

I have changed the order and updated all the places using
PgStat_BktypeIO as well as in all locations in which it should be
ordered for consistency (that I could find in the pass I did) -- e.g.
the view definition, function signatures, comments, commit messages,
etc.

+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
Perhaps worth doing this in tablespace.sql, to avoid the additional
checkpoints done as part of CREATE/DROP TABLESPACE?

Or, at least combine this with the CHECKPOINTs above?
I see a checkpoint is requested when dropping the tablespace if not all
the files in it are deleted. It seems like if the DROP TABLE for the
permanent table is before the explicit checkpoints in the test, then the
DROP TABLESPACE will not cause an additional checkpoint.
Unfortunately, that's not how it works :(. See the comment above mdunlink():

* For regular relations, we don't unlink the first segment file of the rel,
* but just truncate it to zero length, and record a request to unlink it after
* the next checkpoint. Additional segments can be unlinked immediately,
* however. Leaving the empty file in place prevents that relfilenumber
* from being reused. The scenario this protects us from is:
...

Is this what you are suggesting? Dropping the temporary table should not
have an effect on this.

I was wondering about simply moving that portion of the test to
tablespace.sql, where we already created a tablespace.

An alternative would be to propose splitting tablespace.sql into one portion
running at the start of parallel_schedule, and one at the end. Historically,
we needed tablespace.sql to be optional due to causing problems when
replicating to another instance on the same machine, but now we have
allow_in_place_tablespaces.

It seems like the best way would be to split up the tablespace test file
as you suggested and drop the tablespace at the end of the regression
test suite. There could be other tests that could use a tablespace.
Though what I wrote is kind of tablespace test coverage, if this
rewriting behavior no longer happened when doing alter table set
tablespace, we would want to come up with a new test which exercised
that code to count those IO stats, not simply delete it from the
tablespace tests.

SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;

Better toast compression or such could easily make test_io_local smaller than
it's today. Seeing that it's too small would make it easier to understand the
failure.

Good idea. So, I used pg_table_size() because it seems like
pg_relation_size() does not include the toast relations. However, I'm
not sure this is a good idea, because pg_table_size() includes FSM and
visibility map. Should I write a query to get the toast relation name
and add pg_relation_size() of that relation and the main relation?

I think it's the right thing to just include the relation size. Your queries
IIRC won't use the toast table or other forks. So I'd leave it at just
pg_relation_size().

I did notice that this test wasn't using the toast table for the
toastable column -- but you mentioned better toast compression affecting
the future test stability, so I'm confused.

- Melanie

Attachments:

v49-0001-pgstat-Infrastructure-to-track-IO-operations.patchtext/x-patch; charset=US-ASCII; name=v49-0001-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From 2e29ec2d41fee3fd299c271ade82f8270a16474b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 17 Jan 2023 16:10:34 -0500
Subject: [PATCH v49 1/4] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOP_WRITE IOOp on an
IOOBJECT_RELATION IOObject in the IOCONTEXT_NORMAL IOContext by
BackendType B_CHECKPOINTER.

Each BackendType counts IOOps (evict, extend, fsync, read, reuse, and
write) per IOObject (relation, temp relation) per IOContext (normal,
bulkread, bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOObject, IOContext, IOOp combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

PGSTAT_FILE_FORMAT_ID should be bumped with this commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  26 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 386 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |  11 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  68 +++
 src/include/utils/pgstat_internal.h           |  30 ++
 src/tools/pgindent/typedefs.list              |   6 +
 15 files changed, 563 insertions(+), 7 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..8d51ca3773 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5418,6 +5418,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..60fc4e761f 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1516,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..b606f23eb8
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,386 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_BktypeIO PendingIOStats;
+bool		have_iostats = false;
+
+/*
+ * Check that stats have not been counted for any combination of IOObject,
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BktypeIO, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOObject io_object = IOOBJECT_FIRST;
+		 io_object < IOOBJECT_NUM_TYPES; io_object++)
+	{
+		for (IOContext io_context = IOCONTEXT_FIRST;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still
+			 * need to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				/* No stats, so nothing to validate */
+				if (backend_io->data[io_object][io_context][io_op] == 0)
+					continue;
+
+				/* There are stats and there shouldn't be */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
+
+void
+pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+
+	PendingIOStats.data[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_BktypeIO *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOObject io_object = IOOBJECT_FIRST;
+		 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		for (IOContext io_context = IOCONTEXT_FIRST;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			for (IOOp io_op = IOOP_FIRST;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_object][io_context][io_op] +=
+					PendingIOStats.data[io_object][io_context][io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*
+* When adding a new BackendType, also consider adding relevant restrictions to
+* pgstat_tracks_io_object() and pgstat_tracks_io_op().
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO on certain IOObjects or in certain
+ * IOContexts. Some IOObjects are never operated on in some IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext and
+ * on the given IOObject and that the given IOObject is expected to be operated
+ * on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
+						IOContext io_context)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts or on certain IOObjects. Check that the given
+ * IOOp is valid for the given BackendType in the given IOContext and on the
+ * given IOObject. Note that there are currently no cases of an IOOp being
+ * invalid for a particular BackendType only within a certain IOContext and/or
+ * only on a certain IOObject.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
+					IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_object, io_context))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * register_dirty_segment() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 58bd1360b9..6df9f06a20 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1576,7 +1576,12 @@ pg_stat_reset(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-/* Reset some shared cluster-wide counters */
+/*
+ * Reset some shared cluster-wide counters
+ *
+ * When adding a new reset target, ideally the name should match that in
+ * pgstat_kind_infos, if relevant.
+ */
 Datum
 pg_stat_reset_shared(PG_FUNCTION_ARGS)
 {
@@ -1593,6 +1598,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1601,7 +1608,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"io\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 96b3a1e1a0..c309e0233d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,6 +332,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..9f09caa05f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO operations
+ */
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_FIRST IOOBJECT_RELATION
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_FIRST IOOP_EVICT
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_BktypeIO
+{
+	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BktypeIO;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +503,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
+										 BackendType bktype);
+extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOObject io_object, IOContext io_context);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
+								IOContext io_context, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..6badb2fde4 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,17 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* Shared-memory ready PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects stats.stats[i]. locks[0] also protects
+	 * stats.stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+	PgStat_IO	stats;
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +430,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +454,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +563,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern bool pgstat_flush_io(bool nowait);
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +666,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..1be6e07980 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2016,6 +2019,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2025,6 +2029,7 @@ PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
+PgStat_BktypeIO
 PgStat_CheckpointerStats
 PgStat_Counter
 PgStat_EntryRef
@@ -2033,6 +2038,7 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v49-0004-pg_stat_io-documentation.patchtext/x-patch; charset=US-ASCII; name=v49-0004-pg_stat_io-documentation.patchDownload

From 86be2a8ef4e800061ca57f0ba42ac4ebc0c4ac91 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 17 Jan 2023 16:34:27 -0500
Subject: [PATCH v49 4/4] pg_stat_io documentation

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml | 321 +++++++++++++++++++++++++++++++++--
 1 file changed, 307 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d51ca3773..b875fc3f12 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, and target object
+       containing cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +675,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are useful for determining
+   the effectiveness of the buffer cache. They can be used to calculate a cache
+   hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O
+   statistics capture most instances in which the kernel was invoked in order
+   to perform I/O, they do not differentiate between data which had to be
+   fetched from disk and that which already resided in the kernel page cache.
+   Users are advised to use the <productname>PostgreSQL</productname>
+   statistics views in combination with operating system utilities for a more
+   complete picture of their database's I/O performance.
   </para>
 
  </sect2>
@@ -3643,6 +3649,293 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view will contain one row for each
+   combination of backend type, target I/O object, and I/O context, showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
+   relation I/O which bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) is currently not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>io_context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          IO operations and are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reads</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of write operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extends</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of relation extend operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of I/O read, written, or extended.
+       </para>
+       <para>
+        Relation data reads, writes, and extends are done in
+        <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evictions</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a block has been written out from a shared or local
+        buffer in order to make it available for another use.
+       </para>
+       <para>
+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate, size-limited ring buffer for use in a bulk I/O operation.
+        </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reuses</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsyncs</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of <literal>fsync</literal> calls. These are only tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations on some I/O objects and/or
+   in some I/O contexts. These rows are omitted from the view. For example, the
+   checkpointer does not checkpoint temporary tables, so there will be no rows
+   for <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>io_object</varname> <literal>temp relation</literal>.
+  </para>
+
+  <para>
+   In addition, some I/O operations will never be performed either by certain
+   backend types or on certain I/O objects and/or in certain I/O contexts.
+   These cells will be NULL. For example, temporary tables are not
+   <literal>fsync</literal>ed, so <varname>fsyncs</varname> will be NULL for
+   <varname>io_object</varname> <literal>temp relation</literal>. Also, the
+   background writer does not perform reads, so <varname>reads</varname> will
+   be NULL in rows for <varname>backend_type</varname> <literal>background
+   writer</literal>.
+  </para>
+
+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evictions</varname> count can indicate that shared
+      buffers should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>fsyncs</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of the checkpointer. More information on configuring
+      the checkpointer can be found in <xref linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and the background writer to write out dirty data
+      as much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of the checkpointer.
+      More information on configuring the checkpointer can be found in <xref
+      linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
 
  </sect2>
 
-- 
2.34.1

v49-0002-pgstat-Count-IO-for-relations.patchtext/x-patch; charset=US-ASCII; name=v49-0002-pgstat-Count-IO-for-relations.patchDownload

From cb2dd852c8435537ed9a9a148c719e81c0dc22ce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 17 Jan 2023 16:25:31 -0500
Subject: [PATCH v49 2/4] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 111 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 ++++++++++----
 src/backend/storage/buffer/localbuf.c |  13 ++-
 src/backend/storage/smgr/md.c         |  24 ++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 185 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..ff12bc2ba6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		/*
+		 * LocalBufferAlloc() will set the io_context to IOCONTEXT_NORMAL. We
+		 * do not use a BufferAccessStrategy for I/O of temporary tables.
+		 * However, in some cases, the "strategy" may not be NULL, so we can't
+		 * rely on IOContextForStrategy() to set the right IOContext for us.
+		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
+		 */
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +881,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +996,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1014,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1041,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(io_object, io_context, IOOP_READ);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1136,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output parameter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1200,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1218,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1287,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1302,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1450,6 +1483,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	LWLockRelease(newPartitionLock);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, *io_context,
+						   from_ring ? IOOP_REUSE : IOOP_EVICT);
+	}
+
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
 	 * start I/O.  If StartBufferIO returns false, then someone else managed
@@ -2570,7 +2625,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2875,8 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2968,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3630,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3664,8 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3763,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3973,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +4000,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..8e286db5df 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -107,7 +108,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +128,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here (instead of after a buffer hit would have returned) for
+	 * convenience since we don't have to worry about the overhead of calling
+	 * IOContextForStrategy().
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +239,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +266,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..8da813600c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer. Since other IO operations bypassing the buffer
+		 * manager could also be tracked in such an IOContext, wait until
+		 * these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.34.1

v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From d40934679b00fd1e157bd6942d7f3faf8be5ea8e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 17 Jan 2023 16:28:27 -0500
Subject: [PATCH v49 3/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the
"reads", "writes", and "extends" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Catalog version should be bumped.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  34 ++++
 contrib/amcheck/sql/check_heap.sql      |  27 +++
 src/backend/catalog/system_views.sql    |  15 ++
 src/backend/utils/adt/pgstatfuncs.c     | 141 ++++++++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 ++
 src/test/regress/expected/stats.out     | 234 ++++++++++++++++++++++++
 src/test/regress/sql/stats.sql          | 148 +++++++++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 9 files changed, 621 insertions(+)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..e4785141a6 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,22 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy, whereas a
+-- sequential scan does so only if the table is large enough when compared to
+-- shared buffers (see initscan()). CREATE DATABASE ... also unconditionally
+-- uses a BAS_BULKREAD strategy, but we have chosen to use a tablespace and
+-- verify_heapam to provide coverage instead of adding another expensive
+-- operation to the main regression test suite.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten and all the blocks to reliably evicted from shared
+-- buffers -- guaranteeing actual reads when we next select from it.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +104,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +228,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..6794ca4eb0 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,29 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy, whereas a
+-- sequential scan does so only if the table is large enough when compared to
+-- shared buffers (see initscan()). CREATE DATABASE ... also unconditionally
+-- uses a BAS_BULKREAD strategy, but we have chosen to use a tablespace and
+-- verify_heapam to provide coverage instead of adding another expensive
+-- operation to the main regression test suite.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten and all the blocks to reliably evicted from shared
+-- buffers -- guaranteeing actual reads when we next select from it.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +50,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +136,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d2a8c82900..70699f4b85 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1116,6 +1116,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_object,
+       b.io_context,
+       b.reads,
+       b.writes,
+       b.extends,
+       b.op_bytes,
+       b.evictions,
+       b.reuses,
+       b.fsyncs,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6df9f06a20..5b79d703b7 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1234,6 +1234,147 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_OBJECT,
+	IO_COL_IO_CONTEXT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters checking that only expected stats are non-zero, since it
+		 * keeps the non-Assert code cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		for (IOObject io_obj = IOOBJECT_FIRST;
+			 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+		{
+			const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+			for (IOContext io_context = IOCONTEXT_FIRST;
+				 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				const char *context_name = pgstat_get_io_context_name(io_context);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of BackendType, IOObject, and IOContext
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!pgstat_tracks_io_object(bktype, io_obj, io_context))
+					continue;
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					/*
+					 * Some combinations of BackendType and IOOp, of IOContext
+					 * and IOOp, and of IOObject and IOOp are not tracked. Set
+					 * these cells in the view NULL.
+					 */
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+
+					if (nulls[col_idx])
+						continue;
+
+					values[col_idx] =
+						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3810de7b22..2155d93b44 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5690,6 +5690,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a969ae63eb..8a7ed673c2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_object,
+    b.io_context,
+    b.reads,
+    b.writes,
+    b.extends,
+    b.op_bytes,
+    b.evictions,
+    b.reuses,
+    b.fsyncs,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..46bc79e740 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,238 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_before_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_shared_after_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
+-- See comment above for rationale for two explicit CHECKPOINTs.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
+SELECT sum(reads) AS io_sum_shared_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_io_stats_tblspc;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS io_sum_shared_after_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to its minimum so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO 100;
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS extends, sum(evictions) AS evictions, sum(writes) AS writes
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation' \gset io_sum_local_before_
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 5000) as id, repeat('a', 200);
+-- Ensure the table is large enough to exceed our temp_buffers setting.
+SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT sum(reads) AS io_sum_local_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  5000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evictions) AS evictions,
+       sum(reads) AS reads,
+       sum(writes) AS writes,
+       sum(extends) AS extends
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset io_sum_local_after_
+SELECT :io_sum_local_after_evictions > :io_sum_local_before_evictions,
+       :io_sum_local_after_reads > :io_sum_local_before_reads,
+       :io_sum_local_after_writes > :io_sum_local_before_writes,
+       :io_sum_local_after_extends > :io_sum_local_before_extends;
+ ?column? | ?column? | ?column? | ?column? 
+----------+----------+----------+----------
+ t        | t        | t        | t
+(1 row)
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_io_stats_tblspc;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(writes) AS io_sum_local_new_tblspc_writes
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_local;
+RESET temp_buffers;
+DROP TABLESPACE regress_io_stats_tblspc;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_before_
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_after_
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
+       :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT pg_stat_have_stats('io', 0, 0);
+ pg_stat_have_stats 
+--------------------
+ t
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset
+  FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset
+  FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..4465649211 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,152 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_before_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_shared_after_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
+-- See comment above for rationale for two explicit CHECKPOINTs.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+
+SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
+SELECT sum(reads) AS io_sum_shared_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_io_stats_tblspc;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS io_sum_shared_after_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_shared;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to its minimum so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO 100;
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS extends, sum(evictions) AS evictions, sum(writes) AS writes
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation' \gset io_sum_local_before_
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 5000) as id, repeat('a', 200);
+-- Ensure the table is large enough to exceed our temp_buffers setting.
+SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
+
+SELECT sum(reads) AS io_sum_local_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evictions) AS evictions,
+       sum(reads) AS reads,
+       sum(writes) AS writes,
+       sum(extends) AS extends
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset io_sum_local_after_
+SELECT :io_sum_local_after_evictions > :io_sum_local_before_evictions,
+       :io_sum_local_after_reads > :io_sum_local_before_reads,
+       :io_sum_local_after_writes > :io_sum_local_before_writes,
+       :io_sum_local_after_extends > :io_sum_local_before_extends;
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_io_stats_tblspc;
+SELECT pg_stat_force_next_flush();
+SELECT sum(writes) AS io_sum_local_new_tblspc_writes
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
+-- Drop the table so we can drop the tablespace later.
+DROP TABLE test_io_local;
+RESET temp_buffers;
+DROP TABLESPACE regress_io_stats_tblspc;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_before_
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_after_
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
+       :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT pg_stat_have_stats('io', 0, 0);
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset
+  FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset
+  FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1be6e07980..a399e0a5e4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3377,6 +3377,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.34.1

#121

m.sakrejda@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#118)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Jan 17, 2023 at 9:22 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jan 16, 2023 at 4:42 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
I missed a couple of versions, but I think the docs are clearer now.
I'm torn on losing some of the detail, but overall I do think it's a
good trade-off. Moving some details out to after the table does keep
the bulk of the view documentation more readable, and the "inform
database tuning" part is great. I really like the idea of a separate
Interpreting Statistics section, but for now this works.
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations.
Why only permanent relations? Are temporary relations treated
differently? I imagine if someone has a temp-table-heavy workload that
requires regularly vacuuming and analyzing those relations, this point
may be confusing without some additional explanation.
Ah, yes. This is a bit confusing. We don't use buffer access strategies
when operating on temp relations, so vacuuming them is counted in IO
Context normal. I've added this information to the docs but now that
definition is a bit long. Perhaps it should be a note? That seems like
it would draw too much attention to this detail, though...

Thanks for clarifying. I think the updated definition still works:
it's still shorter than the `normal` context definition.

#122

vignesh C

vignesh21@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#120)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Wed, 18 Jan 2023 at 03:30, Melanie Plageman
<melanieplageman@gmail.com> wrote:

v49 attached

On Tue, Jan 17, 2023 at 2:12 PM Andres Freund <andres@anarazel.de> wrote:
On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
+typedef struct PgStat_BackendIO
+{
+     PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendIO;
Would it bother you if we swapped the order of iocontext and iobject here and
related places? It makes more sense to me semantically, and should now be
pretty easy, code wise.
So, thinking about this I started noticing inconsistencies in other
areas around this order:
For example: ordering of objects mentioned in commit messages and comments,
ordering of parameters (like in pgstat_count_io_op() [currently in
reverse order]).

I think we should make a final decision about this ordering and then
make everywhere consistent (including ordering in the view).

Currently the order is:
BackendType
IOContext
IOObject
IOOp

You are suggesting this order:
BackendType
IOObject
IOContext
IOOp

Could you explain what you find more natural about this ordering (as I
find the other more natural)?
The object we're performing IO on determines more things than the context. So
it just seems like the natural hierarchical fit. The context is a sub-category
of the object. Consider how it'll look like if we also have objects for 'wal',
'temp files'. It'll make sense to group by just the object, but it won't make
sense to group by just the context.

If it were trivial to do I'd use a different IOContext for each IOObject. But
it'd make it much harder. So there'll just be a bunch of values of IOContext
that'll only be used for one or a subset of the IOObjects.

The reason to put BackendType at the top is pragmatic - one backend is of a
single type, but can do IO for all kinds of objects/contexts. So any other
hierarchy would make the locking etc much harder.

This is one possible natural sentence with these objects:

During COPY, a client backend may read in data from a permanent
relation.
This order is:
IOContext
BackendType
IOOp
IOObject

I think English sentences are often structured subject, verb, object --
but in our case, we have an extra thing that doesn't fit neatly
(IOContext).

"..., to avoid polluting the buffer cache it uses the bulk (read|write)
strategy".

Also, IOOp in a sentence would be in the middle (as the
verb). I made it last because a) it feels like the smallest unit b) it
would make the code a lot more annoying if it wasn't last.

Yea, I think pragmatically that is the right choice.
I have changed the order and updated all the places using
PgStat_BktypeIO as well as in all locations in which it should be
ordered for consistency (that I could find in the pass I did) -- e.g.
the view definition, function signatures, comments, commit messages,
etc.
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
Perhaps worth doing this in tablespace.sql, to avoid the additional
checkpoints done as part of CREATE/DROP TABLESPACE?

Or, at least combine this with the CHECKPOINTs above?
I see a checkpoint is requested when dropping the tablespace if not all
the files in it are deleted. It seems like if the DROP TABLE for the
permanent table is before the explicit checkpoints in the test, then the
DROP TABLESPACE will not cause an additional checkpoint.
Unfortunately, that's not how it works :(. See the comment above mdunlink():

* For regular relations, we don't unlink the first segment file of the rel,
* but just truncate it to zero length, and record a request to unlink it after
* the next checkpoint. Additional segments can be unlinked immediately,
* however. Leaving the empty file in place prevents that relfilenumber
* from being reused. The scenario this protects us from is:
...

Is this what you are suggesting? Dropping the temporary table should not
have an effect on this.

I was wondering about simply moving that portion of the test to
tablespace.sql, where we already created a tablespace.

An alternative would be to propose splitting tablespace.sql into one portion
running at the start of parallel_schedule, and one at the end. Historically,
we needed tablespace.sql to be optional due to causing problems when
replicating to another instance on the same machine, but now we have
allow_in_place_tablespaces.
It seems like the best way would be to split up the tablespace test file
as you suggested and drop the tablespace at the end of the regression
test suite. There could be other tests that could use a tablespace.
Though what I wrote is kind of tablespace test coverage, if this
rewriting behavior no longer happened when doing alter table set
tablespace, we would want to come up with a new test which exercised
that code to count those IO stats, not simply delete it from the
tablespace tests.

SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;

Better toast compression or such could easily make test_io_local smaller than
it's today. Seeing that it's too small would make it easier to understand the
failure.

Good idea. So, I used pg_table_size() because it seems like
pg_relation_size() does not include the toast relations. However, I'm
not sure this is a good idea, because pg_table_size() includes FSM and
visibility map. Should I write a query to get the toast relation name
and add pg_relation_size() of that relation and the main relation?

I think it's the right thing to just include the relation size. Your queries
IIRC won't use the toast table or other forks. So I'd leave it at just
pg_relation_size().

I did notice that this test wasn't using the toast table for the
toastable column -- but you mentioned better toast compression affecting
the future test stability, so I'm confused.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_3272.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
4f74f5641d53559ec44e74d5bf552e167fdd5d20 ===
=== applying patch
./v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch
....
patching file src/test/regress/expected/rules.out
Hunk #1 FAILED at 1876.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/rules.out.rej

[1]: http://cfbot.cputube.org/patch_41_3272.log

Regards,
Vignesh

#123

melanieplageman@gmail.com

almost 3 years ago

In reply to: vignesh C (#122)

5 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Jan 19, 2023 at 6:18 AM vignesh C <vignesh21@gmail.com> wrote:

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
4f74f5641d53559ec44e74d5bf552e167fdd5d20 ===
=== applying patch
./v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch
....
patching file src/test/regress/expected/rules.out
Hunk #1 FAILED at 1876.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/rules.out.rej

[1] - http://cfbot.cputube.org/patch_41_3272.log

Yes, it conflicted with 47bb9db75996232. rebased v50 is attached.

On Tue, Jan 17, 2023 at 5:00 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
Perhaps worth doing this in tablespace.sql, to avoid the additional
checkpoints done as part of CREATE/DROP TABLESPACE?

Or, at least combine this with the CHECKPOINTs above?
I see a checkpoint is requested when dropping the tablespace if not all
the files in it are deleted. It seems like if the DROP TABLE for the
permanent table is before the explicit checkpoints in the test, then the
DROP TABLESPACE will not cause an additional checkpoint.
Unfortunately, that's not how it works :(. See the comment above mdunlink():

* For regular relations, we don't unlink the first segment file of the rel,
* but just truncate it to zero length, and record a request to unlink it after
* the next checkpoint. Additional segments can be unlinked immediately,
* however. Leaving the empty file in place prevents that relfilenumber
* from being reused. The scenario this protects us from is:
...

Is this what you are suggesting? Dropping the temporary table should not
have an effect on this.

I was wondering about simply moving that portion of the test to
tablespace.sql, where we already created a tablespace.

An alternative would be to propose splitting tablespace.sql into one portion
running at the start of parallel_schedule, and one at the end. Historically,
we needed tablespace.sql to be optional due to causing problems when
replicating to another instance on the same machine, but now we have
allow_in_place_tablespaces.
It seems like the best way would be to split up the tablespace test file
as you suggested and drop the tablespace at the end of the regression
test suite. There could be other tests that could use a tablespace.
Though what I wrote is kind of tablespace test coverage, if this
rewriting behavior no longer happened when doing alter table set
tablespace, we would want to come up with a new test which exercised
that code to count those IO stats, not simply delete it from the
tablespace tests.

I have added a patch to the set which creates the regress_tblspace
(formerly created in tablespace.sq1) in test_setup.sql. I then moved the
tablespace test to the end of the parallel schedule so that my test (and
others) could use the regress_tblspace.

I modified some of the tablespace.sql tests to be more specific in terms
of the objects they are looking for so that tests using the tablespace
are not forced to drop all of the objects they make in the tablespace.

Note that I did not proactively change all tests in tablespace.sql that
may fail in this way -- only those that failed because of the tables I
created (and did not drop) from regress_tblspace.

- Melanie

Attachments:

v50-0001-Create-regress_tblspc-in-test_setup.patchtext/x-patch; charset=US-ASCII; name=v50-0001-Create-regress_tblspc-in-test_setup.patchDownload

From 3976128fab1467fde8ee7e1bc1f54b023e96e35d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 19 Jan 2023 15:45:54 -0500
Subject: [PATCH v50 1/5] Create regress_tblspc in test_setup

Other tests may want to use a tablespace. Now that we have
allow_in_place_tablespaces, move the tablespace test to the end of the
parallel schedule and create the main tablespace it uses in test setup.

Modify the tablespace tests a bit to check for specific relations in the
test and not simply for the absence or presence of any objects in the
tablespace in case other tests leave objects around in the tablespace.
---
 src/test/regress/expected/tablespace.out | 63 +++++++++++++-----------
 src/test/regress/expected/test_setup.out |  3 ++
 src/test/regress/parallel_schedule       |  9 ++--
 src/test/regress/sql/tablespace.sql      | 40 ++++++++++-----
 src/test/regress/sql/test_setup.sql      |  4 ++
 5 files changed, 73 insertions(+), 46 deletions(-)

diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index c52cf1cfcf..a2aa95bd97 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -22,8 +22,6 @@ SELECT spcoptions FROM pg_tablespace WHERE spcname = 'regress_tblspacewith';
 
 -- drop the tablespace so we can re-use the location
 DROP TABLESPACE regress_tblspacewith;
--- create a tablespace we can use
-CREATE TABLESPACE regress_tblspace LOCATION '';
 -- This returns a relative path as of an effect of allow_in_place_tablespaces,
 -- masking the tablespace OID used in the path name.
 SELECT regexp_replace(pg_tablespace_location(oid), '(pg_tblspc)/(\d+)', '\1/NNN')
@@ -83,11 +81,14 @@ REINDEX (TABLESPACE regress_tblspace) INDEX regress_tblspace_test_tbl_idx;
 REINDEX (TABLESPACE regress_tblspace) TABLE regress_tblspace_test_tbl;
 ROLLBACK;
 -- no relation moved to the new tablespace
-SELECT c.relname FROM pg_class c, pg_tablespace s
+SELECT c.relname <> 'regress_tblspace_test_tbl_idx',
+       c.relname <> 'regress_tblspace_test_tbl'
+  FROM pg_class c, pg_tablespace s
   WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace';
- relname 
----------
-(0 rows)
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
 
 -- check that all indexes are moved to a new tablespace with different
 -- relfilenode.
@@ -102,40 +103,46 @@ SELECT relfilenode as toast_filenode FROM pg_class
        WHERE i.indrelid = c.reltoastrelid AND
              c.relname = 'regress_tblspace_test_tbl') \gset
 REINDEX (TABLESPACE regress_tblspace) TABLE regress_tblspace_test_tbl;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
-            relname            
--------------------------------
- regress_tblspace_test_tbl_idx
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+ ?column? 
+----------
+ t
 (1 row)
 
 ALTER TABLE regress_tblspace_test_tbl SET TABLESPACE regress_tblspace;
 ALTER TABLE regress_tblspace_test_tbl SET TABLESPACE pg_default;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
-            relname            
--------------------------------
- regress_tblspace_test_tbl_idx
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+ ?column? 
+----------
+ t
 (1 row)
 
 -- Move back to the default tablespace.
 ALTER INDEX regress_tblspace_test_tbl_idx SET TABLESPACE pg_default;
-SELECT c.relname FROM pg_class c, pg_tablespace s
+SELECT c.relname <> 'regress_tblspace_test_tbl_idx',
+       c.relname <> 'regress_tblspace_test_tbl'
+  FROM pg_class c, pg_tablespace s
   WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
   ORDER BY c.relname;
- relname 
----------
-(0 rows)
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
 
 REINDEX (TABLESPACE regress_tblspace, CONCURRENTLY) TABLE regress_tblspace_test_tbl;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
-            relname            
--------------------------------
- regress_tblspace_test_tbl_idx
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+ ?column? 
+----------
+ t
 (1 row)
 
 SELECT relfilenode = :main_filenode AS main_same FROM pg_class
diff --git a/src/test/regress/expected/test_setup.out b/src/test/regress/expected/test_setup.out
index 391b36d131..4f54fe20ec 100644
--- a/src/test/regress/expected/test_setup.out
+++ b/src/test/regress/expected/test_setup.out
@@ -18,6 +18,9 @@ SET synchronous_commit = on;
 -- and most of the core regression tests still expect that.
 --
 GRANT ALL ON SCHEMA public TO public;
+-- Create a tablespace we can use in tests.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_tblspace LOCATION '';
 --
 -- These tables have traditionally been referenced by many tests,
 -- so create and populate them.  Insert only non-error values here.
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index a930dfe48c..15e015b3d6 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -11,11 +11,6 @@
 # required setup steps
 test: test_setup
 
-# run tablespace by itself, and early, because it forces a checkpoint;
-# we'd prefer not to have checkpoints later in the tests because that
-# interferes with crash-recovery testing.
-test: tablespace
-
 # ----------
 # The first group of parallel tests
 # ----------
@@ -132,3 +127,7 @@ test: event_trigger oidjoins
 
 # this test also uses event triggers, so likewise run it by itself
 test: fast_default
+
+# run tablespace test at the end because it drops the tablespace created during
+# setup that other tests may use.
+test: tablespace
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 21db433f2a..1e03d679b2 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -20,8 +20,6 @@ SELECT spcoptions FROM pg_tablespace WHERE spcname = 'regress_tblspacewith';
 -- drop the tablespace so we can re-use the location
 DROP TABLESPACE regress_tblspacewith;
 
--- create a tablespace we can use
-CREATE TABLESPACE regress_tblspace LOCATION '';
 -- This returns a relative path as of an effect of allow_in_place_tablespaces,
 -- masking the tablespace OID used in the path name.
 SELECT regexp_replace(pg_tablespace_location(oid), '(pg_tblspc)/(\d+)', '\1/NNN')
@@ -66,7 +64,9 @@ REINDEX (TABLESPACE regress_tblspace) INDEX regress_tblspace_test_tbl_idx;
 REINDEX (TABLESPACE regress_tblspace) TABLE regress_tblspace_test_tbl;
 ROLLBACK;
 -- no relation moved to the new tablespace
-SELECT c.relname FROM pg_class c, pg_tablespace s
+SELECT c.relname <> 'regress_tblspace_test_tbl_idx',
+       c.relname <> 'regress_tblspace_test_tbl'
+  FROM pg_class c, pg_tablespace s
   WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace';
 
 -- check that all indexes are moved to a new tablespace with different
@@ -74,6 +74,7 @@ SELECT c.relname FROM pg_class c, pg_tablespace s
 -- Save first the existing relfilenode for the toast and main relations.
 SELECT relfilenode as main_filenode FROM pg_class
   WHERE relname = 'regress_tblspace_test_tbl_idx' \gset
+
 SELECT relfilenode as toast_filenode FROM pg_class
   WHERE oid =
     (SELECT i.indexrelid
@@ -81,24 +82,37 @@ SELECT relfilenode as toast_filenode FROM pg_class
             pg_index i
        WHERE i.indrelid = c.reltoastrelid AND
              c.relname = 'regress_tblspace_test_tbl') \gset
+
 REINDEX (TABLESPACE regress_tblspace) TABLE regress_tblspace_test_tbl;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
+
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+
 ALTER TABLE regress_tblspace_test_tbl SET TABLESPACE regress_tblspace;
 ALTER TABLE regress_tblspace_test_tbl SET TABLESPACE pg_default;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
+
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+
 -- Move back to the default tablespace.
 ALTER INDEX regress_tblspace_test_tbl_idx SET TABLESPACE pg_default;
-SELECT c.relname FROM pg_class c, pg_tablespace s
+SELECT c.relname <> 'regress_tblspace_test_tbl_idx',
+       c.relname <> 'regress_tblspace_test_tbl'
+  FROM pg_class c, pg_tablespace s
   WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
   ORDER BY c.relname;
+
 REINDEX (TABLESPACE regress_tblspace, CONCURRENTLY) TABLE regress_tblspace_test_tbl;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
+
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+
 SELECT relfilenode = :main_filenode AS main_same FROM pg_class
   WHERE relname = 'regress_tblspace_test_tbl_idx';
 SELECT relfilenode = :toast_filenode as toast_same FROM pg_class
diff --git a/src/test/regress/sql/test_setup.sql b/src/test/regress/sql/test_setup.sql
index 02c0c84c3a..8439b38d21 100644
--- a/src/test/regress/sql/test_setup.sql
+++ b/src/test/regress/sql/test_setup.sql
@@ -23,6 +23,10 @@ SET synchronous_commit = on;
 --
 GRANT ALL ON SCHEMA public TO public;
 
+-- Create a tablespace we can use in tests.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_tblspace LOCATION '';
+
 --
 -- These tables have traditionally been referenced by many tests,
 -- so create and populate them.  Insert only non-error values here.
-- 
2.34.1

v50-0002-pgstat-Infrastructure-to-track-IO-operations.patchtext/x-patch; charset=US-ASCII; name=v50-0002-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From e01f99ceb59fbc85654cf552e4bb527f8076c07a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 17 Jan 2023 16:10:34 -0500
Subject: [PATCH v50 2/5] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOP_WRITE IOOp on an
IOOBJECT_RELATION IOObject in the IOCONTEXT_NORMAL IOContext by
BackendType B_CHECKPOINTER.

Each BackendType counts IOOps (evict, extend, fsync, read, reuse, and
write) per IOObject (relation, temp relation) per IOContext (normal,
bulkread, bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOObject, IOContext, IOOp combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

PGSTAT_FILE_FORMAT_ID should be bumped with this commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  26 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 386 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |  11 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  68 +++
 src/include/utils/pgstat_internal.h           |  30 ++
 src/tools/pgindent/typedefs.list              |   6 +
 15 files changed, 563 insertions(+), 7 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e3a783abd0..c47d057a1d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5434,6 +5434,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..60fc4e761f 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1516,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..b606f23eb8
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,386 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_BktypeIO PendingIOStats;
+bool		have_iostats = false;
+
+/*
+ * Check that stats have not been counted for any combination of IOObject,
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BktypeIO, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOObject io_object = IOOBJECT_FIRST;
+		 io_object < IOOBJECT_NUM_TYPES; io_object++)
+	{
+		for (IOContext io_context = IOCONTEXT_FIRST;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still
+			 * need to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				/* No stats, so nothing to validate */
+				if (backend_io->data[io_object][io_context][io_op] == 0)
+					continue;
+
+				/* There are stats and there shouldn't be */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
+
+void
+pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+
+	PendingIOStats.data[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_BktypeIO *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOObject io_object = IOOBJECT_FIRST;
+		 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		for (IOContext io_context = IOCONTEXT_FIRST;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			for (IOOp io_op = IOOP_FIRST;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_object][io_context][io_op] +=
+					PendingIOStats.data[io_object][io_context][io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*
+* When adding a new BackendType, also consider adding relevant restrictions to
+* pgstat_tracks_io_object() and pgstat_tracks_io_op().
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO on certain IOObjects or in certain
+ * IOContexts. Some IOObjects are never operated on in some IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext and
+ * on the given IOObject and that the given IOObject is expected to be operated
+ * on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
+						IOContext io_context)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts or on certain IOObjects. Check that the given
+ * IOOp is valid for the given BackendType in the given IOContext and on the
+ * given IOObject. Note that there are currently no cases of an IOOp being
+ * invalid for a particular BackendType only within a certain IOContext and/or
+ * only on a certain IOObject.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
+					IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_object, io_context))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * register_dirty_segment() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..924698e6ae 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1587,7 +1587,12 @@ pg_stat_reset(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-/* Reset some shared cluster-wide counters */
+/*
+ * Reset some shared cluster-wide counters
+ *
+ * When adding a new reset target, ideally the name should match that in
+ * pgstat_kind_infos, if relevant.
+ */
 Datum
 pg_stat_reset_shared(PG_FUNCTION_ARGS)
 {
@@ -1604,6 +1609,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1612,7 +1619,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"io\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 96b3a1e1a0..c309e0233d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,6 +332,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..9f09caa05f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO operations
+ */
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_FIRST IOOBJECT_RELATION
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_FIRST IOOP_EVICT
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_BktypeIO
+{
+	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BktypeIO;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +503,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
+										 BackendType bktype);
+extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOObject io_object, IOContext io_context);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
+								IOContext io_context, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..6badb2fde4 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,17 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* Shared-memory ready PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects stats.stats[i]. locks[0] also protects
+	 * stats.stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+	PgStat_IO	stats;
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +430,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +454,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +563,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern bool pgstat_flush_io(bool nowait);
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +666,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 09316039e4..65be0dea1b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2015,6 +2018,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2024,6 +2028,7 @@ PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
+PgStat_BktypeIO
 PgStat_CheckpointerStats
 PgStat_Counter
 PgStat_EntryRef
@@ -2032,6 +2037,7 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v50-0005-pg_stat_io-documentation.patchtext/x-patch; charset=US-ASCII; name=v50-0005-pg_stat_io-documentation.patchDownload

From a6a90045ae3b68b0fc627e27f740e27d77ea3810 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 17 Jan 2023 16:34:27 -0500
Subject: [PATCH v50 5/5] pg_stat_io documentation

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml | 321 +++++++++++++++++++++++++++++++++--
 1 file changed, 307 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c47d057a1d..2f4e6e89bc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, and target object
+       containing cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +675,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are useful for determining
+   the effectiveness of the buffer cache. They can be used to calculate a cache
+   hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O
+   statistics capture most instances in which the kernel was invoked in order
+   to perform I/O, they do not differentiate between data which had to be
+   fetched from disk and that which already resided in the kernel page cache.
+   Users are advised to use the <productname>PostgreSQL</productname>
+   statistics views in combination with operating system utilities for a more
+   complete picture of their database's I/O performance.
   </para>
 
  </sect2>
@@ -3659,6 +3665,293 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view will contain one row for each
+   combination of backend type, target I/O object, and I/O context, showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
+   relation I/O which bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) is currently not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>io_context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          IO operations and are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reads</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of write operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extends</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of relation extend operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of I/O read, written, or extended.
+       </para>
+       <para>
+        Relation data reads, writes, and extends are done in
+        <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evictions</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a block has been written out from a shared or local
+        buffer in order to make it available for another use.
+       </para>
+       <para>
+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate, size-limited ring buffer for use in a bulk I/O operation.
+        </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reuses</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsyncs</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of <literal>fsync</literal> calls. These are only tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations on some I/O objects and/or
+   in some I/O contexts. These rows are omitted from the view. For example, the
+   checkpointer does not checkpoint temporary tables, so there will be no rows
+   for <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>io_object</varname> <literal>temp relation</literal>.
+  </para>
+
+  <para>
+   In addition, some I/O operations will never be performed either by certain
+   backend types or on certain I/O objects and/or in certain I/O contexts.
+   These cells will be NULL. For example, temporary tables are not
+   <literal>fsync</literal>ed, so <varname>fsyncs</varname> will be NULL for
+   <varname>io_object</varname> <literal>temp relation</literal>. Also, the
+   background writer does not perform reads, so <varname>reads</varname> will
+   be NULL in rows for <varname>backend_type</varname> <literal>background
+   writer</literal>.
+  </para>
+
+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evictions</varname> count can indicate that shared
+      buffers should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>fsyncs</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of the checkpointer. More information on configuring
+      the checkpointer can be found in <xref linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and the background writer to write out dirty data
+      as much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of the checkpointer.
+      More information on configuring the checkpointer can be found in <xref
+      linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
 
  </sect2>
 
-- 
2.34.1

v50-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v50-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From 471d98971d39d33a9e75bc1463f8f7d6c2973dfe Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 17 Jan 2023 16:28:27 -0500
Subject: [PATCH v50 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the
"reads", "writes", and "extends" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Catalog version should be bumped.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  34 ++++
 contrib/amcheck/sql/check_heap.sql      |  27 +++
 src/backend/catalog/system_views.sql    |  15 ++
 src/backend/utils/adt/pgstatfuncs.c     | 141 +++++++++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 ++
 src/test/regress/expected/stats.out     | 227 ++++++++++++++++++++++++
 src/test/regress/sql/stats.sql          | 141 +++++++++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 9 files changed, 607 insertions(+)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..e4785141a6 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,22 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy, whereas a
+-- sequential scan does so only if the table is large enough when compared to
+-- shared buffers (see initscan()). CREATE DATABASE ... also unconditionally
+-- uses a BAS_BULKREAD strategy, but we have chosen to use a tablespace and
+-- verify_heapam to provide coverage instead of adding another expensive
+-- operation to the main regression test suite.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten and all the blocks to reliably evicted from shared
+-- buffers -- guaranteeing actual reads when we next select from it.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +104,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +228,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..6794ca4eb0 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,29 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy, whereas a
+-- sequential scan does so only if the table is large enough when compared to
+-- shared buffers (see initscan()). CREATE DATABASE ... also unconditionally
+-- uses a BAS_BULKREAD strategy, but we have chosen to use a tablespace and
+-- verify_heapam to provide coverage instead of adding another expensive
+-- operation to the main regression test suite.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten and all the blocks to reliably evicted from shared
+-- buffers -- guaranteeing actual reads when we next select from it.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +50,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +136,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..34ca0e739f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_object,
+       b.io_context,
+       b.reads,
+       b.writes,
+       b.extends,
+       b.op_bytes,
+       b.evictions,
+       b.reuses,
+       b.fsyncs,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 924698e6ae..9d707c3521 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1245,6 +1245,147 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_OBJECT,
+	IO_COL_IO_CONTEXT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters checking that only expected stats are non-zero, since it
+		 * keeps the non-Assert code cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		for (IOObject io_obj = IOOBJECT_FIRST;
+			 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+		{
+			const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+			for (IOContext io_context = IOCONTEXT_FIRST;
+				 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				const char *context_name = pgstat_get_io_context_name(io_context);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of BackendType, IOObject, and IOContext
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!pgstat_tracks_io_object(bktype, io_obj, io_context))
+					continue;
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					/*
+					 * Some combinations of BackendType and IOOp, of IOContext
+					 * and IOOp, and of IOObject and IOOp are not tracked. Set
+					 * these cells in the view NULL.
+					 */
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+
+					if (nulls[col_idx])
+						continue;
+
+					values[col_idx] =
+						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 86eb8e8c58..2e804c5bd4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5690,6 +5690,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..174b725fff 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT pid,
     gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (client_port IS NOT NULL);
+pg_stat_io| SELECT backend_type,
+    io_object,
+    io_context,
+    reads,
+    writes,
+    extends,
+    op_bytes,
+    evictions,
+    reuses,
+    fsyncs,
+    stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..3ad38da0dd 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,231 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_before_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_shared_after_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
+-- See comment above for rationale for two explicit CHECKPOINTs.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SELECT sum(reads) AS io_sum_shared_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS io_sum_shared_after_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to its minimum so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO 100;
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS extends, sum(evictions) AS evictions, sum(writes) AS writes
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation' \gset io_sum_local_before_
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 5000) as id, repeat('a', 200);
+-- Ensure the table is large enough to exceed our temp_buffers setting.
+SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT sum(reads) AS io_sum_local_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  5000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evictions) AS evictions,
+       sum(reads) AS reads,
+       sum(writes) AS writes,
+       sum(extends) AS extends
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset io_sum_local_after_
+SELECT :io_sum_local_after_evictions > :io_sum_local_before_evictions,
+       :io_sum_local_after_reads > :io_sum_local_before_reads,
+       :io_sum_local_after_writes > :io_sum_local_before_writes,
+       :io_sum_local_after_extends > :io_sum_local_before_extends;
+ ?column? | ?column? | ?column? | ?column? 
+----------+----------+----------+----------
+ t        | t        | t        | t
+(1 row)
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_tblspace;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(writes) AS io_sum_local_new_tblspc_writes
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_before_
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_after_
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
+       :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT pg_stat_have_stats('io', 0, 0);
+ pg_stat_have_stats 
+--------------------
+ t
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset
+  FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset
+  FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..5badd09a1c 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,145 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_before_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_shared_after_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
+-- See comment above for rationale for two explicit CHECKPOINTs.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+
+SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SELECT sum(reads) AS io_sum_shared_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS io_sum_shared_after_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to its minimum so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO 100;
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS extends, sum(evictions) AS evictions, sum(writes) AS writes
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation' \gset io_sum_local_before_
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 5000) as id, repeat('a', 200);
+-- Ensure the table is large enough to exceed our temp_buffers setting.
+SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
+
+SELECT sum(reads) AS io_sum_local_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evictions) AS evictions,
+       sum(reads) AS reads,
+       sum(writes) AS writes,
+       sum(extends) AS extends
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset io_sum_local_after_
+SELECT :io_sum_local_after_evictions > :io_sum_local_before_evictions,
+       :io_sum_local_after_reads > :io_sum_local_before_reads,
+       :io_sum_local_after_writes > :io_sum_local_before_writes,
+       :io_sum_local_after_extends > :io_sum_local_before_extends;
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_tblspace;
+SELECT pg_stat_force_next_flush();
+SELECT sum(writes) AS io_sum_local_new_tblspc_writes
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_before_
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_after_
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
+       :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT pg_stat_have_stats('io', 0, 0);
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset
+  FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset
+  FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 65be0dea1b..970a0cfd1d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3376,6 +3376,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.34.1

v50-0003-pgstat-Count-IO-for-relations.patchtext/x-patch; charset=US-ASCII; name=v50-0003-pgstat-Count-IO-for-relations.patchDownload

From 2c138faa33ff93756b6bda70e68798c6ab4afbe6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 17 Jan 2023 16:25:31 -0500
Subject: [PATCH v50 3/5] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 111 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 ++++++++++----
 src/backend/storage/buffer/localbuf.c |  13 ++-
 src/backend/storage/smgr/md.c         |  24 ++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 185 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..ff12bc2ba6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		/*
+		 * LocalBufferAlloc() will set the io_context to IOCONTEXT_NORMAL. We
+		 * do not use a BufferAccessStrategy for I/O of temporary tables.
+		 * However, in some cases, the "strategy" may not be NULL, so we can't
+		 * rely on IOContextForStrategy() to set the right IOContext for us.
+		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
+		 */
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +881,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +996,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1014,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1041,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(io_object, io_context, IOOP_READ);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1136,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output parameter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1200,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1218,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1287,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1302,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1450,6 +1483,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	LWLockRelease(newPartitionLock);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, *io_context,
+						   from_ring ? IOOP_REUSE : IOOP_EVICT);
+	}
+
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
 	 * start I/O.  If StartBufferIO returns false, then someone else managed
@@ -2570,7 +2625,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2875,8 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2968,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3630,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3664,8 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOCONTEXT_NORMAL, IOOBJECT_RELATION);
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3763,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3973,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +4000,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..8e286db5df 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -107,7 +108,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +128,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here (instead of after a buffer hit would have returned) for
+	 * convenience since we don't have to worry about the overhead of calling
+	 * IOContextForStrategy().
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +239,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +266,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..8da813600c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer. Since other IO operations bypassing the buffer
+		 * manager could also be tracked in such an IOContext, wait until
+		 * these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.34.1

#124

melanieplageman@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#123)

5 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Jan 19, 2023 at 4:28 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Jan 19, 2023 at 6:18 AM vignesh C <vignesh21@gmail.com> wrote:

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
4f74f5641d53559ec44e74d5bf552e167fdd5d20 ===
=== applying patch
./v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch
....
patching file src/test/regress/expected/rules.out
Hunk #1 FAILED at 1876.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/rules.out.rej

[1] - http://cfbot.cputube.org/patch_41_3272.log

Yes, it conflicted with 47bb9db75996232. rebased v50 is attached.

Oh dear-- an extra FlushBuffer() snuck in there somehow.
Removed it in attached v51.
Also, I fixed an issue in my tablespace.sql updates

- Melanie

Attachments:

v51-0005-pg_stat_io-documentation.patchtext/x-patch; charset=US-ASCII; name=v51-0005-pg_stat_io-documentation.patchDownload

From f477dfb566a47350bc78dcfb925db429e79fd657 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 17 Jan 2023 16:34:27 -0500
Subject: [PATCH v51 5/5] pg_stat_io documentation

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml | 321 +++++++++++++++++++++++++++++++++--
 1 file changed, 307 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c47d057a1d..2f4e6e89bc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -469,6 +469,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, and target object
+       containing cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -665,20 +675,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The <structname>pg_statio_</structname> views are primarily useful to
-   determine the effectiveness of the buffer cache.  When the number
-   of actual disk reads is much smaller than the number of buffer
-   hits, then the cache is satisfying most read requests without
-   invoking a kernel call. However, these statistics do not give the
-   entire story: due to the way in which <productname>PostgreSQL</productname>
-   handles disk I/O, data that is not in the
-   <productname>PostgreSQL</productname> buffer cache might still reside in the
-   kernel's I/O cache, and might therefore still be fetched without
-   requiring a physical read. Users interested in obtaining more
-   detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics views
-   in combination with operating system utilities that allow insight
-   into the kernel's handling of I/O.
+   The <structname>pg_stat_io</structname> and
+   <structname>pg_statio_</structname> set of views are useful for determining
+   the effectiveness of the buffer cache. They can be used to calculate a cache
+   hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O
+   statistics capture most instances in which the kernel was invoked in order
+   to perform I/O, they do not differentiate between data which had to be
+   fetched from disk and that which already resided in the kernel page cache.
+   Users are advised to use the <productname>PostgreSQL</productname>
+   statistics views in combination with operating system utilities for a more
+   complete picture of their database's I/O performance.
   </para>
 
  </sect2>
@@ -3659,6 +3665,293 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <structfield>last_archived_wal</structfield> have also been successfully
     archived.
   </para>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view will contain one row for each
+   combination of backend type, target I/O object, and I/O context, showing
+   cluster-wide I/O statistics. Combinations which do not make sense are
+   omitted.
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
+   relation I/O which bypasses shared buffers (e.g. when moving a table from one
+   tablespace to another) is currently not tracked.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>io_context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          IO operations and are tracked in <varname>io_context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reads</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of read operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of write operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extends</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of relation extend operations, each of the size specified in
+        <varname>op_bytes</varname>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>op_bytes</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of bytes per unit of I/O read, written, or extended.
+       </para>
+       <para>
+        Relation data reads, writes, and extends are done in
+        <varname>block_size</varname> units, derived from the build-time
+        parameter <symbol>BLCKSZ</symbol>, which is <literal>8192</literal> by
+        default.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>evictions</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times a block has been written out from a shared or local
+        buffer in order to make it available for another use.
+       </para>
+       <para>
+        In <varname>io_context</varname> <literal>normal</literal>, this counts
+        the number of times a block was evicted from a buffer and replaced with
+        another block. In <varname>io_context</varname>s
+        <literal>bulkwrite</literal>, <literal>bulkread</literal>, and
+        <literal>vacuum</literal>, this counts the number of times a block was
+        evicted from shared buffers in order to add the shared buffer to a
+        separate, size-limited ring buffer for use in a bulk I/O operation.
+        </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>reuses</structfield> <type>bigint</type>
+       </para>
+       <para>
+        The number of times an existing buffer in a size-limited ring buffer
+        outside of shared buffers was reused as part of an I/O operation in the
+        <literal>bulkread</literal>, <literal>bulkwrite</literal>, or
+        <literal>vacuum</literal> <varname>io_context</varname>s.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsyncs</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of <literal>fsync</literal> calls. These are only tracked in
+        <varname>io_context</varname> <literal>normal</literal>.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations on some I/O objects and/or
+   in some I/O contexts. These rows are omitted from the view. For example, the
+   checkpointer does not checkpoint temporary tables, so there will be no rows
+   for <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>io_object</varname> <literal>temp relation</literal>.
+  </para>
+
+  <para>
+   In addition, some I/O operations will never be performed either by certain
+   backend types or on certain I/O objects and/or in certain I/O contexts.
+   These cells will be NULL. For example, temporary tables are not
+   <literal>fsync</literal>ed, so <varname>fsyncs</varname> will be NULL for
+   <varname>io_object</varname> <literal>temp relation</literal>. Also, the
+   background writer does not perform reads, so <varname>reads</varname> will
+   be NULL in rows for <varname>backend_type</varname> <literal>background
+   writer</literal>.
+  </para>
+
+  <para>
+   <structname>pg_stat_io</structname> can be used to inform database tuning.
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      A high <varname>evictions</varname> count can indicate that shared
+      buffers should be increased.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Client backends rely on the checkpointer to ensure data is persisted to
+      permanent storage. Large numbers of <varname>fsyncs</varname> by
+      <literal>client backend</literal>s could indicate a misconfiguration of
+      shared buffers or of the checkpointer. More information on configuring
+      the checkpointer can be found in <xref linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Normally, client backends should be able to rely on auxiliary processes
+      like the checkpointer and the background writer to write out dirty data
+      as much as possible. Large numbers of writes by client backends could
+      indicate a misconfiguration of shared buffers or of the checkpointer.
+      More information on configuring the checkpointer can be found in <xref
+      linkend="wal-configuration"/>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
 
  </sect2>
 
-- 
2.34.1

v51-0001-Create-regress_tblspc-in-test_setup.patchtext/x-patch; charset=US-ASCII; name=v51-0001-Create-regress_tblspc-in-test_setup.patchDownload

From bc2a91fbe68180d47179388b8303badcfdd5259c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 19 Jan 2023 15:45:54 -0500
Subject: [PATCH v51 1/5] Create regress_tblspc in test_setup

Other tests may want to use a tablespace. Now that we have
allow_in_place_tablespaces, move the tablespace test to the end of the
parallel schedule and create the main tablespace it uses in test setup.

Modify the tablespace tests a bit to check for specific relations in the
test and not simply for the absence or presence of any objects in the
tablespace in case other tests leave objects around in the tablespace.
---
 src/test/regress/expected/tablespace.out | 65 +++++++++++++-----------
 src/test/regress/expected/test_setup.out |  3 ++
 src/test/regress/parallel_schedule       |  9 ++--
 src/test/regress/sql/tablespace.sql      | 42 ++++++++++-----
 src/test/regress/sql/test_setup.sql      |  4 ++
 5 files changed, 75 insertions(+), 48 deletions(-)

diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index c52cf1cfcf..007c00bfff 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -22,8 +22,6 @@ SELECT spcoptions FROM pg_tablespace WHERE spcname = 'regress_tblspacewith';
 
 -- drop the tablespace so we can re-use the location
 DROP TABLESPACE regress_tblspacewith;
--- create a tablespace we can use
-CREATE TABLESPACE regress_tblspace LOCATION '';
 -- This returns a relative path as of an effect of allow_in_place_tablespaces,
 -- masking the tablespace OID used in the path name.
 SELECT regexp_replace(pg_tablespace_location(oid), '(pg_tblspc)/(\d+)', '\1/NNN')
@@ -83,11 +81,14 @@ REINDEX (TABLESPACE regress_tblspace) INDEX regress_tblspace_test_tbl_idx;
 REINDEX (TABLESPACE regress_tblspace) TABLE regress_tblspace_test_tbl;
 ROLLBACK;
 -- no relation moved to the new tablespace
-SELECT c.relname FROM pg_class c, pg_tablespace s
+SELECT c.relname <> 'regress_tblspace_test_tbl_idx',
+       c.relname <> 'regress_tblspace_test_tbl'
+  FROM pg_class c, pg_tablespace s
   WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace';
- relname 
----------
-(0 rows)
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
 
 -- check that all indexes are moved to a new tablespace with different
 -- relfilenode.
@@ -102,40 +103,46 @@ SELECT relfilenode as toast_filenode FROM pg_class
        WHERE i.indrelid = c.reltoastrelid AND
              c.relname = 'regress_tblspace_test_tbl') \gset
 REINDEX (TABLESPACE regress_tblspace) TABLE regress_tblspace_test_tbl;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
-            relname            
--------------------------------
- regress_tblspace_test_tbl_idx
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+ ?column? 
+----------
+ t
 (1 row)
 
 ALTER TABLE regress_tblspace_test_tbl SET TABLESPACE regress_tblspace;
 ALTER TABLE regress_tblspace_test_tbl SET TABLESPACE pg_default;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
-            relname            
--------------------------------
- regress_tblspace_test_tbl_idx
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+ ?column? 
+----------
+ t
 (1 row)
 
 -- Move back to the default tablespace.
 ALTER INDEX regress_tblspace_test_tbl_idx SET TABLESPACE pg_default;
-SELECT c.relname FROM pg_class c, pg_tablespace s
+SELECT c.relname <> 'regress_tblspace_test_tbl_idx',
+       c.relname <> 'regress_tblspace_test_tbl'
+  FROM pg_class c, pg_tablespace s
   WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
   ORDER BY c.relname;
- relname 
----------
-(0 rows)
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
 
 REINDEX (TABLESPACE regress_tblspace, CONCURRENTLY) TABLE regress_tblspace_test_tbl;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
-            relname            
--------------------------------
- regress_tblspace_test_tbl_idx
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+ ?column? 
+----------
+ t
 (1 row)
 
 SELECT relfilenode = :main_filenode AS main_same FROM pg_class
@@ -331,7 +338,7 @@ CREATE TABLE testschema.part1 PARTITION OF testschema.part FOR VALUES IN (1);
 CREATE INDEX part_a_idx ON testschema.part (a) TABLESPACE regress_tblspace;
 CREATE TABLE testschema.part2 PARTITION OF testschema.part FOR VALUES IN (2);
 SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
-    where c.reltablespace = t.oid AND c.relname LIKE 'part%_idx';
+    where c.reltablespace = t.oid AND c.relname LIKE 'part%_idx' ORDER BY relname;
    relname   |     spcname      
 -------------+------------------
  part1_a_idx | regress_tblspace
diff --git a/src/test/regress/expected/test_setup.out b/src/test/regress/expected/test_setup.out
index 391b36d131..4f54fe20ec 100644
--- a/src/test/regress/expected/test_setup.out
+++ b/src/test/regress/expected/test_setup.out
@@ -18,6 +18,9 @@ SET synchronous_commit = on;
 -- and most of the core regression tests still expect that.
 --
 GRANT ALL ON SCHEMA public TO public;
+-- Create a tablespace we can use in tests.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_tblspace LOCATION '';
 --
 -- These tables have traditionally been referenced by many tests,
 -- so create and populate them.  Insert only non-error values here.
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index a930dfe48c..15e015b3d6 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -11,11 +11,6 @@
 # required setup steps
 test: test_setup
 
-# run tablespace by itself, and early, because it forces a checkpoint;
-# we'd prefer not to have checkpoints later in the tests because that
-# interferes with crash-recovery testing.
-test: tablespace
-
 # ----------
 # The first group of parallel tests
 # ----------
@@ -132,3 +127,7 @@ test: event_trigger oidjoins
 
 # this test also uses event triggers, so likewise run it by itself
 test: fast_default
+
+# run tablespace test at the end because it drops the tablespace created during
+# setup that other tests may use.
+test: tablespace
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 21db433f2a..58a279e2f9 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -20,8 +20,6 @@ SELECT spcoptions FROM pg_tablespace WHERE spcname = 'regress_tblspacewith';
 -- drop the tablespace so we can re-use the location
 DROP TABLESPACE regress_tblspacewith;
 
--- create a tablespace we can use
-CREATE TABLESPACE regress_tblspace LOCATION '';
 -- This returns a relative path as of an effect of allow_in_place_tablespaces,
 -- masking the tablespace OID used in the path name.
 SELECT regexp_replace(pg_tablespace_location(oid), '(pg_tblspc)/(\d+)', '\1/NNN')
@@ -66,7 +64,9 @@ REINDEX (TABLESPACE regress_tblspace) INDEX regress_tblspace_test_tbl_idx;
 REINDEX (TABLESPACE regress_tblspace) TABLE regress_tblspace_test_tbl;
 ROLLBACK;
 -- no relation moved to the new tablespace
-SELECT c.relname FROM pg_class c, pg_tablespace s
+SELECT c.relname <> 'regress_tblspace_test_tbl_idx',
+       c.relname <> 'regress_tblspace_test_tbl'
+  FROM pg_class c, pg_tablespace s
   WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace';
 
 -- check that all indexes are moved to a new tablespace with different
@@ -74,6 +74,7 @@ SELECT c.relname FROM pg_class c, pg_tablespace s
 -- Save first the existing relfilenode for the toast and main relations.
 SELECT relfilenode as main_filenode FROM pg_class
   WHERE relname = 'regress_tblspace_test_tbl_idx' \gset
+
 SELECT relfilenode as toast_filenode FROM pg_class
   WHERE oid =
     (SELECT i.indexrelid
@@ -81,24 +82,37 @@ SELECT relfilenode as toast_filenode FROM pg_class
             pg_index i
        WHERE i.indrelid = c.reltoastrelid AND
              c.relname = 'regress_tblspace_test_tbl') \gset
+
 REINDEX (TABLESPACE regress_tblspace) TABLE regress_tblspace_test_tbl;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
+
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+
 ALTER TABLE regress_tblspace_test_tbl SET TABLESPACE regress_tblspace;
 ALTER TABLE regress_tblspace_test_tbl SET TABLESPACE pg_default;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
+
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+
 -- Move back to the default tablespace.
 ALTER INDEX regress_tblspace_test_tbl_idx SET TABLESPACE pg_default;
-SELECT c.relname FROM pg_class c, pg_tablespace s
+SELECT c.relname <> 'regress_tblspace_test_tbl_idx',
+       c.relname <> 'regress_tblspace_test_tbl'
+  FROM pg_class c, pg_tablespace s
   WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
   ORDER BY c.relname;
+
 REINDEX (TABLESPACE regress_tblspace, CONCURRENTLY) TABLE regress_tblspace_test_tbl;
-SELECT c.relname FROM pg_class c, pg_tablespace s
-  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace'
-  ORDER BY c.relname;
+
+SELECT 'regress_tblspace_test_tbl_idx' IN
+  (SELECT c.relname
+  FROM pg_class c, pg_tablespace s
+  WHERE c.reltablespace = s.oid AND s.spcname = 'regress_tblspace');
+
 SELECT relfilenode = :main_filenode AS main_same FROM pg_class
   WHERE relname = 'regress_tblspace_test_tbl_idx';
 SELECT relfilenode = :toast_filenode as toast_same FROM pg_class
@@ -225,7 +239,7 @@ CREATE TABLE testschema.part1 PARTITION OF testschema.part FOR VALUES IN (1);
 CREATE INDEX part_a_idx ON testschema.part (a) TABLESPACE regress_tblspace;
 CREATE TABLE testschema.part2 PARTITION OF testschema.part FOR VALUES IN (2);
 SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
-    where c.reltablespace = t.oid AND c.relname LIKE 'part%_idx';
+    where c.reltablespace = t.oid AND c.relname LIKE 'part%_idx' ORDER BY relname;
 \d testschema.part
 \d+ testschema.part
 \d testschema.part1
diff --git a/src/test/regress/sql/test_setup.sql b/src/test/regress/sql/test_setup.sql
index 02c0c84c3a..8439b38d21 100644
--- a/src/test/regress/sql/test_setup.sql
+++ b/src/test/regress/sql/test_setup.sql
@@ -23,6 +23,10 @@ SET synchronous_commit = on;
 --
 GRANT ALL ON SCHEMA public TO public;
 
+-- Create a tablespace we can use in tests.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_tblspace LOCATION '';
+
 --
 -- These tables have traditionally been referenced by many tests,
 -- so create and populate them.  Insert only non-error values here.
-- 
2.34.1

v51-0002-pgstat-Infrastructure-to-track-IO-operations.patchtext/x-patch; charset=US-ASCII; name=v51-0002-pgstat-Infrastructure-to-track-IO-operations.patchDownload

From 1921e23c4650f497266c0f73114248a57e58778e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 17 Jan 2023 16:10:34 -0500
Subject: [PATCH v51 2/5] pgstat: Infrastructure to track IO operations

Introduce "IOOp", an IO operation done by a backend, "IOObject", the
target object of the IO, and "IOContext", the context or location of the
IO operations on that object. For example, the checkpointer may write a
shared buffer out. This would be considered an IOOP_WRITE IOOp on an
IOOBJECT_RELATION IOObject in the IOCONTEXT_NORMAL IOContext by
BackendType B_CHECKPOINTER.

Each BackendType counts IOOps (evict, extend, fsync, read, reuse, and
write) per IOObject (relation, temp relation) per IOContext (normal,
bulkread, bulkwrite, or vacuum) through a call to pgstat_count_io_op().

Note that this commit introduces the infrastructure to count IO
Operation statistics. A subsequent commit will add calls to
pgstat_count_io_op() in the appropriate locations.

IOObject IOOBJECT_TEMP_RELATION concerns IO Operations on buffers
containing temporary table data, while IOObject IOOBJECT_RELATION
concerns IO Operations on buffers containing permanent relation data.

IOContext IOCONTEXT_NORMAL concerns operations on local and shared
buffers, while IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, and
IOCONTEXT_VACUUM IOContexts concern IO operations on buffers as part of
a BufferAccessStrategy.

Stats on IOOps on all IOObjects in all IOContexts for a given backend
are first counted in a backend's local memory and then flushed to shared
memory and accumulated with those from all other backends, exited and
live.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOObject, IOContext, IOOp combinations are
valid, the validity of the stats is checked before flushing pending
stats and before reading in the existing stats file to shared memory.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

PGSTAT_FILE_FORMAT_ID should be bumped with this commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml                  |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/meson.build        |   1 +
 src/backend/utils/activity/pgstat.c           |  26 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io.c        | 386 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c     |   4 +
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |  11 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  68 +++
 src/include/utils/pgstat_internal.h           |  30 ++
 src/tools/pgindent/typedefs.list              |   6 +
 15 files changed, 563 insertions(+), 7 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.c

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e3a783abd0..c47d057a1d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5434,6 +5434,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a80eda3cf4..7d7482dde0 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index a2b872c24b..518ee3f798 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
   'pgstat_checkpointer.c',
   'pgstat_database.c',
   'pgstat_function.c',
+  'pgstat_io.c',
   'pgstat_relation.c',
   'pgstat_replslot.c',
   'pgstat_shmem.c',
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0fa5370bcd..60fc4e761f 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -72,6 +72,7 @@
  * - pgstat_checkpointer.c
  * - pgstat_database.c
  * - pgstat_function.c
+ * - pgstat_io.c
  * - pgstat_relation.c
  * - pgstat_replslot.c
  * - pgstat_slru.c
@@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IO] = {
+		.name = "io",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_reset_all_cb,
+		.snapshot_cb = pgstat_io_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +592,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty(&pgStatPending) &&
+		!have_iostats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +639,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO stats */
+	partial_flush |= pgstat_flush_io(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1322,6 +1336,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IO);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1496,6 +1516,12 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO stats struct
+	 */
+	if (!read_chunk_s(fpin, &shmem->io.stats))
+		goto error;
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 9247f2dda2..92be384b0d 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 3e9ab45103..26dec112f6 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO statistics
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
new file mode 100644
index 0000000000..b606f23eb8
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -0,0 +1,386 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io.c
+ *	  Implementation of IO statistics.
+ *
+ * This file contains the implementation of IO statistics. It is kept separate
+ * from pgstat.c to enforce the line between the statistics access / storage
+ * implementation and the details about individual types of statistics.
+ *
+ * Copyright (c) 2021-2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+
+static PgStat_BktypeIO PendingIOStats;
+bool		have_iostats = false;
+
+/*
+ * Check that stats have not been counted for any combination of IOObject,
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
+ * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
+ * by the second parameter. Caller is responsible for locking the passed-in
+ * PgStat_BktypeIO, if needed.
+ */
+bool
+pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
+							 BackendType bktype)
+{
+	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
+
+	for (IOObject io_object = IOOBJECT_FIRST;
+		 io_object < IOOBJECT_NUM_TYPES; io_object++)
+	{
+		for (IOContext io_context = IOCONTEXT_FIRST;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			/*
+			 * Don't bother trying to skip to the next loop iteration if
+			 * pgstat_tracks_io_object() would return false here. We still
+			 * need to validate that each counter is zero anyway.
+			 */
+			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				/* No stats, so nothing to validate */
+				if (backend_io->data[io_object][io_context][io_op] == 0)
+					continue;
+
+				/* There are stats and there shouldn't be */
+				if (!bktype_tracked ||
+					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+					return false;
+			}
+		}
+	}
+
+	return true;
+}
+
+void
+pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+
+	PendingIOStats.data[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+PgStat_IO *
+pgstat_fetch_stat_io(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
+
+	return &pgStatLocal.snapshot.io;
+}
+
+/*
+ * Flush out locally pending IO statistics
+ *
+ * If no stats have been recorded, this function returns false.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise, return false.
+ */
+bool
+pgstat_flush_io(bool nowait)
+{
+	LWLock	   *bktype_lock;
+	PgStat_BktypeIO *bktype_shstats;
+
+	if (!have_iostats)
+		return false;
+
+	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
+	bktype_shstats =
+		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+
+	if (!nowait)
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
+		return true;
+
+	for (IOObject io_object = IOOBJECT_FIRST;
+		 io_object < IOOBJECT_NUM_TYPES; io_object++)
+		for (IOContext io_context = IOCONTEXT_FIRST;
+			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			for (IOOp io_op = IOOP_FIRST;
+				 io_op < IOOP_NUM_TYPES; io_op++)
+				bktype_shstats->data[io_object][io_context][io_op] +=
+					PendingIOStats.data[io_object][io_context][io_op];
+
+	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
+
+	LWLockRelease(bktype_lock);
+
+	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+
+	have_iostats = false;
+
+	return false;
+}
+
+const char *
+pgstat_get_io_context_name(IOContext io_context)
+{
+	switch (io_context)
+	{
+		case IOCONTEXT_BULKREAD:
+			return "bulkread";
+		case IOCONTEXT_BULKWRITE:
+			return "bulkwrite";
+		case IOCONTEXT_NORMAL:
+			return "normal";
+		case IOCONTEXT_VACUUM:
+			return "vacuum";
+	}
+
+	elog(ERROR, "unrecognized IOContext value: %d", io_context);
+	pg_unreachable();
+}
+
+const char *
+pgstat_get_io_object_name(IOObject io_object)
+{
+	switch (io_object)
+	{
+		case IOOBJECT_RELATION:
+			return "relation";
+		case IOOBJECT_TEMP_RELATION:
+			return "temp relation";
+	}
+
+	elog(ERROR, "unrecognized IOObject value: %d", io_object);
+	pg_unreachable();
+}
+
+void
+pgstat_io_reset_all_cb(TimestampTz ts)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+
+		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		LWLockRelease(bktype_lock);
+	}
+}
+
+void
+pgstat_io_snapshot_cb(void)
+{
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
+		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+
+		LWLockAcquire(bktype_lock, LW_SHARED);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
+		 * the reset timestamp as well.
+		 */
+		if (i == 0)
+			pgStatLocal.snapshot.io.stat_reset_timestamp =
+				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+
+		/* using struct assignment due to better type safety */
+		*bktype_snap = *bktype_shstats;
+		LWLockRelease(bktype_lock);
+	}
+}
+
+/*
+* IO statistics are not collected for all BackendTypes.
+*
+* The following BackendTypes do not participate in the cumulative stats
+* subsystem or do not perform IO on which we currently track:
+* - Syslogger because it is not connected to shared memory
+* - Archiver because most relevant archiving IO is delegated to a
+*   specialized command or module
+* - WAL Receiver and WAL Writer IO is not tracked in pg_stat_io for now
+*
+* Function returns true if BackendType participates in the cumulative stats
+* subsystem for IO and false if it does not.
+*
+* When adding a new BackendType, also consider adding relevant restrictions to
+* pgstat_tracks_io_object() and pgstat_tracks_io_op().
+*/
+bool
+pgstat_tracks_io_bktype(BackendType bktype)
+{
+	/*
+	 * List every type so that new backend types trigger a warning about
+	 * needing to adjust this switch.
+	 */
+	switch (bktype)
+	{
+		case B_INVALID:
+		case B_ARCHIVER:
+		case B_LOGGER:
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
+			return false;
+
+		case B_AUTOVAC_LAUNCHER:
+		case B_AUTOVAC_WORKER:
+		case B_BACKEND:
+		case B_BG_WORKER:
+		case B_BG_WRITER:
+		case B_CHECKPOINTER:
+		case B_STANDALONE_BACKEND:
+		case B_STARTUP:
+		case B_WAL_SENDER:
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Some BackendTypes do not perform IO on certain IOObjects or in certain
+ * IOContexts. Some IOObjects are never operated on in some IOContexts. Check
+ * that the given BackendType is expected to do IO in the given IOContext and
+ * on the given IOObject and that the given IOObject is expected to be operated
+ * on in the given IOContext.
+ */
+bool
+pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
+						IOContext io_context)
+{
+	bool		no_temp_rel;
+
+	/*
+	 * Some BackendTypes should never track IO statistics.
+	 */
+	if (!pgstat_tracks_io_bktype(bktype))
+		return false;
+
+	/*
+	 * Currently, IO on temporary relations can only occur in the
+	 * IOCONTEXT_NORMAL IOContext.
+	 */
+	if (io_context != IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * In core Postgres, only regular backends and WAL Sender processes
+	 * executing queries will use local buffers and operate on temporary
+	 * relations. Parallel workers will not use local buffers (see
+	 * InitLocalBuffers()); however, extensions leveraging background workers
+	 * have no such limitation, so track IO on IOOBJECT_TEMP_RELATION for
+	 * BackendType B_BG_WORKER.
+	 */
+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
+
+	if (no_temp_rel && io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION)
+		return false;
+
+	/*
+	 * Some BackendTypes do not currently perform any IO in certain
+	 * IOContexts, and, while it may not be inherently incorrect for them to
+	 * do so, excluding those rows from the view makes the view easier to use.
+	 */
+	if ((bktype == B_CHECKPOINTER || bktype == B_BG_WRITER) &&
+		(io_context == IOCONTEXT_BULKREAD ||
+		 io_context == IOCONTEXT_BULKWRITE ||
+		 io_context == IOCONTEXT_VACUUM))
+		return false;
+
+	if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
+		return false;
+
+	if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
+		io_context == IOCONTEXT_BULKWRITE)
+		return false;
+
+	return true;
+}
+
+/*
+ * Some BackendTypes will never do certain IOOps and some IOOps should not
+ * occur in certain IOContexts or on certain IOObjects. Check that the given
+ * IOOp is valid for the given BackendType in the given IOContext and on the
+ * given IOObject. Note that there are currently no cases of an IOOp being
+ * invalid for a particular BackendType only within a certain IOContext and/or
+ * only on a certain IOObject.
+ */
+bool
+pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
+					IOContext io_context, IOOp io_op)
+{
+	bool		strategy_io_context;
+
+	/* if (io_context, io_object) will never collect stats, we're done */
+	if (!pgstat_tracks_io_object(bktype, io_object, io_context))
+		return false;
+
+	/*
+	 * Some BackendTypes will not do certain IOOps.
+	 */
+	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+		(io_op == IOOP_READ || io_op == IOOP_EVICT))
+		return false;
+
+	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
+		return false;
+
+	/*
+	 * Some IOOps are not valid in certain IOContexts and some IOOps are only
+	 * valid in certain contexts.
+	 */
+	if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
+		return false;
+
+	strategy_io_context = io_context == IOCONTEXT_BULKREAD ||
+		io_context == IOCONTEXT_BULKWRITE || io_context == IOCONTEXT_VACUUM;
+
+	/*
+	 * IOOP_REUSE is only relevant when a BufferAccessStrategy is in use.
+	 */
+	if (!strategy_io_context && io_op == IOOP_REUSE)
+		return false;
+
+	/*
+	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
+	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
+	 * register_dirty_segment() for more details.
+	 */
+	if (strategy_io_context && io_op == IOOP_FSYNC)
+		return false;
+
+	/*
+	 * Temporary tables are not logged and thus do not require fsync'ing.
+	 */
+	if (io_context == IOCONTEXT_NORMAL &&
+		io_object == IOOBJECT_TEMP_RELATION && io_op == IOOP_FSYNC)
+		return false;
+
+	return true;
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c20..f793ac1516 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -206,7 +206,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -258,10 +258,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
+	 * however this will not be called until after an entire autovacuum cycle
+	 * is done -- which will likely vacuum many relations -- or until the
+	 * VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the mod_since_analyze counter.
@@ -341,6 +349,9 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/* see pgstat_report_vacuum() */
+	pgstat_flush_io(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1506b53d0..09fffd0e82 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -202,6 +202,10 @@ StatsShmemInit(void)
 		LWLockInitialize(&ctl->checkpointer.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->slru.lock, LWTRANCHE_PGSTATS_DATA);
 		LWLockInitialize(&ctl->wal.lock, LWTRANCHE_PGSTATS_DATA);
+
+		for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+			LWLockInitialize(&ctl->io.locks[i],
+							 LWTRANCHE_PGSTATS_DATA);
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e7a82b5fed..e8598b2f4e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6737493402..924698e6ae 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1587,7 +1587,12 @@ pg_stat_reset(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-/* Reset some shared cluster-wide counters */
+/*
+ * Reset some shared cluster-wide counters
+ *
+ * When adding a new reset target, ideally the name should match that in
+ * pgstat_kind_infos, if relevant.
+ */
 Datum
 pg_stat_reset_shared(PG_FUNCTION_ARGS)
 {
@@ -1604,6 +1609,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -1612,7 +1619,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"io\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 96b3a1e1a0..c309e0233d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,6 +332,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES (B_WAL_WRITER + 1)
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e3326a3b9..9f09caa05f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -48,6 +48,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IO,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +277,55 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+
+/*
+ * Types related to counting IO operations
+ */
+typedef enum IOObject
+{
+	IOOBJECT_RELATION,
+	IOOBJECT_TEMP_RELATION,
+} IOObject;
+
+#define IOOBJECT_FIRST IOOBJECT_RELATION
+#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+
+typedef enum IOContext
+{
+	IOCONTEXT_BULKREAD,
+	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_NORMAL,
+	IOCONTEXT_VACUUM,
+} IOContext;
+
+#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+
+typedef enum IOOp
+{
+	IOOP_EVICT,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_REUSE,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_FIRST IOOP_EVICT
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef struct PgStat_BktypeIO
+{
+	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BktypeIO;
+
+typedef struct PgStat_IO
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+} PgStat_IO;
+
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter xact_commit;
@@ -453,6 +503,24 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
+										 BackendType bktype);
+extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern const char *pgstat_get_io_context_name(IOContext io_context);
+extern const char *pgstat_get_io_object_name(IOObject io_object);
+
+extern bool pgstat_tracks_io_bktype(BackendType bktype);
+extern bool pgstat_tracks_io_object(BackendType bktype,
+									IOObject io_object, IOContext io_context);
+extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
+								IOContext io_context, IOOp io_op);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 12fd51f1ae..6badb2fde4 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,17 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+/* Shared-memory ready PgStat_IO */
+typedef struct PgStatShared_IO
+{
+	/*
+	 * locks[i] protects stats.stats[i]. locks[0] also protects
+	 * stats.stat_reset_timestamp.
+	 */
+	LWLock		locks[BACKEND_NUM_TYPES];
+	PgStat_IO	stats;
+} PgStatShared_IO;
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +430,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_IO io;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +454,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_IO	io;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +563,15 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io.c
+ */
+
+extern bool pgstat_flush_io(bool nowait);
+extern void pgstat_io_reset_all_cb(TimestampTz ts);
+extern void pgstat_io_snapshot_cb(void);
+
+
 /*
  * Functions in pgstat_relation.c
  */
@@ -643,6 +666,13 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_LocalState pgStatLocal;
 
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT bool have_iostats;
+
+
 /*
  * Variables in pgstat_slru.c
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 09316039e4..65be0dea1b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1106,7 +1106,10 @@ ID
 INFIX
 INT128
 INTERFACE_INFO
+IOContext
 IOFuncSelector
+IOObject
+IOOp
 IPCompareMethod
 ITEM
 IV
@@ -2015,6 +2018,7 @@ PgStatShared_Common
 PgStatShared_Database
 PgStatShared_Function
 PgStatShared_HashEntry
+PgStatShared_IO
 PgStatShared_Relation
 PgStatShared_ReplSlot
 PgStatShared_SLRU
@@ -2024,6 +2028,7 @@ PgStat_ArchiverStats
 PgStat_BackendFunctionEntry
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
+PgStat_BktypeIO
 PgStat_CheckpointerStats
 PgStat_Counter
 PgStat_EntryRef
@@ -2032,6 +2037,7 @@ PgStat_FetchConsistency
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_HashKey
+PgStat_IO
 PgStat_Kind
 PgStat_KindInfo
 PgStat_LocalState
-- 
2.34.1

v51-0003-pgstat-Count-IO-for-relations.patchtext/x-patch; charset=US-ASCII; name=v51-0003-pgstat-Count-IO-for-relations.patchDownload

From 9ec02a2201ffc465be17fe0e9ee49d4e42a9d89f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 17 Jan 2023 16:25:31 -0500
Subject: [PATCH v51 3/5] pgstat: Count IO for relations

Count IOOps done on IOObjects in IOContexts by various BackendTypes
using the IO stats infrastructure introduced by a previous commit.

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO operations done by,
for example, the archiver or syslogger are not counted in these
statistics. WAL IO, temporary file IO, and IO done directly though smgr*
functions (such as when building an index) are not yet counted but would
be useful future additions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 110 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c |  58 ++++++++++----
 src/backend/storage/buffer/localbuf.c |  13 ++-
 src/backend/storage/smgr/md.c         |  24 ++++++
 src/include/storage/buf_internals.h   |   8 +-
 src/include/storage/bufmgr.h          |   7 +-
 6 files changed, 184 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a..fffd846098 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -481,8 +481,9 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+							   bool *foundPtr, IOContext *io_context);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -823,6 +824,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOContext	io_context;
+	IOObject	io_object;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -855,7 +858,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
+		/*
+		 * LocalBufferAlloc() will set the io_context to IOCONTEXT_NORMAL. We
+		 * do not use a BufferAccessStrategy for I/O of temporary tables.
+		 * However, in some cases, the "strategy" may not be NULL, so we can't
+		 * rely on IOContextForStrategy() to set the right IOContext for us.
+		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
+		 */
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
 		else if (isExtend)
@@ -871,7 +881,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
+							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
 		else if (isExtend)
@@ -986,7 +996,16 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (isLocalBuf)
+	{
+		bufBlock = LocalBufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bufBlock = BufHdrGetBlock(bufHdr);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	if (isExtend)
 	{
@@ -995,6 +1014,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1020,6 +1041,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(io_object, io_context, IOOP_READ);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1113,14 +1136,19 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
  * we keep it for simplicity in ReadBuffer.
  *
+ * io_context is passed as an output parameter to avoid calling
+ * IOContextForStrategy() when there is a shared buffers hit and no IO
+ * statistics need be captured.
+ *
  * No locks are held either at entry or exit.
  */
 static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			bool *foundPtr, IOContext *io_context)
 {
+	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
@@ -1172,8 +1200,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
+				 * have failed ... but we shall bravely try again. Set
+				 * io_context since we will in fact need to count an IO
+				 * Operation.
 				 */
+				*io_context = IOContextForStrategy(strategy);
 				*foundPtr = false;
 			}
 		}
@@ -1187,6 +1218,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	*io_context = IOContextForStrategy(strategy);
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1200,7 +1233,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Select a victim buffer.  The buffer is returned with its header
 		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
@@ -1254,7 +1287,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1269,7 +1302,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -1450,6 +1483,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	LWLockRelease(newPartitionLock);
 
+	if (oldFlags & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, *io_context,
+						   from_ring ? IOOP_REUSE : IOOP_EVICT);
+	}
+
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
 	 * start I/O.  If StartBufferIO returns false, then someone else managed
@@ -2570,7 +2625,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2820,7 +2875,8 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			IOContext io_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2912,6 +2968,26 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 			  bufToWrite,
 			  false);
 
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
+
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
@@ -3554,6 +3630,8 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
@@ -3586,7 +3664,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3684,7 +3762,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3894,7 +3972,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -3921,7 +3999,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7dec35801c..c690d5f15f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -81,12 +82,6 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
-	/*
-	 * True if the buffer just returned by StrategyGetBuffer had been in the
-	 * ring already.
-	 */
-	bool		current_was_in_ring;
-
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -198,13 +193,15 @@ have_free_buffer(void)
  *	return the buffer with the buffer header spinlock still held.
  */
 BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
+StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring)
 {
 	BufferDesc *buf;
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
 
+	*from_ring = false;
+
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
 	 * assume strategy objects don't need buffer_strategy_lock.
@@ -213,7 +210,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
 		if (buf != NULL)
+		{
+			*from_ring = true;
 			return buf;
+		}
 	}
 
 	/*
@@ -602,7 +602,7 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 
 /*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
- *		ring is empty.
+ *		ring is empty / not usable.
  *
  * The bufhdr spin lock is held on the returned buffer.
  */
@@ -625,10 +625,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 */
 	bufnum = strategy->buffers[strategy->current];
 	if (bufnum == InvalidBuffer)
-	{
-		strategy->current_was_in_ring = false;
 		return NULL;
-	}
 
 	/*
 	 * If the buffer is pinned we cannot use it under any circumstances.
@@ -644,7 +641,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
 		&& BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
 	{
-		strategy->current_was_in_ring = true;
 		*buf_state = local_buf_state;
 		return buf;
 	}
@@ -654,7 +650,6 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * Tell caller to allocate a new buffer with the normal allocation
 	 * strategy.  He'll then replace this ring element via AddBufferToRing.
 	 */
-	strategy->current_was_in_ring = false;
 	return NULL;
 }
 
@@ -670,6 +665,39 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
 	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
 }
 
+/*
+ * Utility function returning the IOContext of a given BufferAccessStrategy's
+ * strategy ring.
+ */
+IOContext
+IOContextForStrategy(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return IOCONTEXT_NORMAL;
+
+	switch (strategy->btype)
+	{
+		case BAS_NORMAL:
+
+			/*
+			 * Currently, GetAccessStrategy() returns NULL for
+			 * BufferAccessStrategyType BAS_NORMAL, so this case is
+			 * unreachable.
+			 */
+			pg_unreachable();
+			return IOCONTEXT_NORMAL;
+		case BAS_BULKREAD:
+			return IOCONTEXT_BULKREAD;
+		case BAS_BULKWRITE:
+			return IOCONTEXT_BULKWRITE;
+		case BAS_VACUUM:
+			return IOCONTEXT_VACUUM;
+	}
+
+	elog(ERROR, "unrecognized BufferAccessStrategyType: %d", strategy->btype);
+	pg_unreachable();
+}
+
 /*
  * StrategyRejectBuffer -- consider rejecting a dirty buffer
  *
@@ -682,14 +710,14 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
 
 	/* Don't muck with behavior of normal buffer-replacement strategy */
-	if (!strategy->current_was_in_ring ||
+	if (!from_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8372acc383..8e286db5df 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "utils/guc_hooks.h"
@@ -107,7 +108,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, IOContext *io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -127,6 +128,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
 
+	/*
+	 * IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
+	 * io_context here (instead of after a buffer hit would have returned) for
+	 * convenience since we don't have to worry about the overhead of calling
+	 * IOContextForStrategy().
+	 */
+	*io_context = IOCONTEXT_NORMAL;
+
 	if (hresult)
 	{
 		b = hresult->id;
@@ -230,6 +239,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
+		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 		pgBufferUsage.local_blks_written++;
 	}
 
@@ -256,6 +266,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		ClearBufferTag(&bufHdr->tag);
 		buf_state &= ~(BM_VALID | BM_TAG_VALID);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT);
 	}
 
 	hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..8da813600c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -983,6 +983,15 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
+		/*
+		 * fsyncs done through mdimmedsync() should be tracked in a separate
+		 * IOContext than those done through mdsyncfiletag() to differentiate
+		 * between unavoidable client backend fsyncs (e.g. those done during
+		 * index build) and those which ideally would have been done by the
+		 * checkpointer. Since other IO operations bypassing the buffer
+		 * manager could also be tracked in such an IOContext, wait until
+		 * these are also tracked to track immediate fsyncs.
+		 */
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
@@ -1021,6 +1030,19 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		/*
+		 * We have no way of knowing if the current IOContext is
+		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
+		 * point, so count the fsync as being in the IOCONTEXT_NORMAL
+		 * IOContext. This is probably okay, because the number of backend
+		 * fsyncs doesn't say anything about the efficacy of the
+		 * BufferAccessStrategy. And counting both fsyncs done in
+		 * IOCONTEXT_NORMAL and IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] under
+		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
+		 * backend fsyncs.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
@@ -1410,6 +1432,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+
 	errno = save_errno;
 	return result;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c..0b44814740 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
@@ -391,11 +392,12 @@ extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
+extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-									 uint32 *buf_state);
+									 uint32 *buf_state, bool *from_ring);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
@@ -417,7 +419,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc129..b8a18b8081 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -23,7 +23,12 @@
 
 typedef void *Block;
 
-/* Possible arguments for GetAccessStrategy() */
+/*
+ * Possible arguments for GetAccessStrategy().
+ *
+ * If adding a new BufferAccessStrategyType, also add a new IOContext so
+ * IO statistics using this strategy are tracked.
+ */
 typedef enum BufferAccessStrategyType
 {
 	BAS_NORMAL,					/* Normal random access */
-- 
2.34.1

v51-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchtext/x-patch; charset=US-ASCII; name=v51-0004-Add-system-view-tracking-IO-ops-per-backend-type.patchDownload

From cc70c22486f3d68c70e1059547a49179f007a51a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 17 Jan 2023 16:28:27 -0500
Subject: [PATCH v51 4/5] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOps
(evictions, reuses, reads, writes, extensions, and fsyncs) done on each
IOObject (relation, temp relation) in each IOContext ("normal" and those
using a BufferAccessStrategy) by each type of backend (e.g. client
backend, checkpointer).

Some BackendTypes do not accumulate IO operations statistics and will
not be included in the view.

Some IOObjects are never operated on in some IOContexts or by some
BackendTypes. These rows are omitted from the view. For example,
checkpointer will never operate on IOOBJECT_TEMP_RELATION data, so those
rows are omitted.

Some IOContexts are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no rows for BufferAccessStrategy
IOContexts for checkpointer.

Some IOOps are invalid in combination with certain IOContexts and
certain IOObjects. Those cells will be NULL in the view to distinguish
between 0 observed IOOps of that type and an invalid combination. For
example, temporary tables are not fsynced so cells for all BackendTypes
for IOOBJECT_TEMP_RELATION and IOOP_FSYNC will be NULL.

Some BackendTypes never perform certain IOOps. Those cells will also be
NULL in the view. For example, bgwriter should not perform reads.

View stats are populated with statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view shows stats for a particular BackendType, IOObject,
IOContext combination (e.g. a client backend's operations on permanent
relations in shared buffers) and each column in the view is the total
number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from
shared buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the
"reads", "writes", and "extends" columns for a given row.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and 'io'.

Suggested by Andres Freund

Catalog version should be bumped.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 contrib/amcheck/expected/check_heap.out |  34 ++++
 contrib/amcheck/sql/check_heap.sql      |  27 +++
 src/backend/catalog/system_views.sql    |  15 ++
 src/backend/utils/adt/pgstatfuncs.c     | 141 +++++++++++++++
 src/include/catalog/pg_proc.dat         |   9 +
 src/test/regress/expected/rules.out     |  12 ++
 src/test/regress/expected/stats.out     | 227 ++++++++++++++++++++++++
 src/test/regress/sql/stats.sql          | 141 +++++++++++++++
 src/tools/pgindent/typedefs.list        |   1 +
 9 files changed, 607 insertions(+)

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index c010361025..e4785141a6 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -66,6 +66,22 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy, whereas a
+-- sequential scan does so only if the table is large enough when compared to
+-- shared buffers (see initscan()). CREATE DATABASE ... also unconditionally
+-- uses a BAS_BULKREAD strategy, but we have chosen to use a tablespace and
+-- verify_heapam to provide coverage instead of adding another expensive
+-- operation to the main regression test suite.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten and all the blocks to reliably evicted from shared
+-- buffers -- guaranteeing actual reads when we next select from it.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -88,6 +104,23 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 -------+--------+--------+-----
 (0 rows)
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE ROLE regress_heaptest_role;
 -- verify permissions are checked (error due to function not callable)
 SET ROLE regress_heaptest_role;
@@ -195,6 +228,7 @@ ERROR:  cannot check relation "test_foreign_table"
 DETAIL:  This operation is not supported for foreign tables.
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index 298de6886a..6794ca4eb0 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -20,11 +20,29 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
 
+
 -- Add some data so subsequent tests are not entirely trivial
 INSERT INTO heaptest (a, b)
 	(SELECT gs, repeat('x', gs)
 		FROM generate_series(1,50) gs);
 
+-- pg_stat_io test:
+-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy, whereas a
+-- sequential scan does so only if the table is large enough when compared to
+-- shared buffers (see initscan()). CREATE DATABASE ... also unconditionally
+-- uses a BAS_BULKREAD strategy, but we have chosen to use a tablespace and
+-- verify_heapam to provide coverage instead of adding another expensive
+-- operation to the main regression test suite.
+--
+-- Create an alternative tablespace and move the heaptest table to it, causing
+-- it to be rewritten and all the blocks to reliably evicted from shared
+-- buffers -- guaranteeing actual reads when we next select from it.
+SET allow_in_place_tablespaces = true;
+CREATE TABLESPACE regress_test_stats_tblspc LOCATION '';
+SELECT sum(reads) AS stats_bulkreads_before
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+ALTER TABLE heaptest SET TABLESPACE regress_test_stats_tblspc;
+
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
@@ -32,6 +50,14 @@ SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
 
+-- verify_heapam should have read in the page written out by
+--   ALTER TABLE ... SET TABLESPACE ...
+-- causing an additional bulkread, which should be reflected in pg_stat_io.
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS stats_bulkreads_after
+  FROM pg_stat_io WHERE io_context = 'bulkread' \gset
+SELECT :stats_bulkreads_after > :stats_bulkreads_before;
+
 CREATE ROLE regress_heaptest_role;
 
 -- verify permissions are checked (error due to function not callable)
@@ -110,6 +136,7 @@ SELECT * FROM verify_heapam('test_foreign_table',
 
 -- cleanup
 DROP TABLE heaptest;
+DROP TABLESPACE regress_test_stats_tblspc;
 DROP TABLE test_partition;
 DROP TABLE test_partitioned;
 DROP OWNED BY regress_heaptest_role; -- permissions
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..34ca0e739f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1117,6 +1117,21 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_object,
+       b.io_context,
+       b.reads,
+       b.writes,
+       b.extends,
+       b.op_bytes,
+       b.evictions,
+       b.reuses,
+       b.fsyncs,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 924698e6ae..9d707c3521 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1245,6 +1245,147 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum value
+* here above IO_NUM_COLUMNS.
+*/
+typedef enum io_stat_col
+{
+	IO_COL_BACKEND_TYPE,
+	IO_COL_IO_OBJECT,
+	IO_COL_IO_CONTEXT,
+	IO_COL_READS,
+	IO_COL_WRITES,
+	IO_COL_EXTENDS,
+	IO_COL_CONVERSION,
+	IO_COL_EVICTIONS,
+	IO_COL_REUSES,
+	IO_COL_FSYNCS,
+	IO_COL_RESET_TIME,
+	IO_NUM_COLUMNS,
+} io_stat_col;
+
+/*
+ * When adding a new IOOp, add a new io_stat_col and add a case to this
+ * function returning the corresponding io_stat_col.
+ */
+static io_stat_col
+pgstat_get_io_op_index(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return IO_COL_EVICTIONS;
+		case IOOP_READ:
+			return IO_COL_READS;
+		case IOOP_REUSE:
+			return IO_COL_REUSES;
+		case IOOP_WRITE:
+			return IO_COL_WRITES;
+		case IOOP_EXTEND:
+			return IO_COL_EXTENDS;
+		case IOOP_FSYNC:
+			return IO_COL_FSYNCS;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+	Datum		reset_time;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
+
+	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters checking that only expected stats are non-zero, since it
+		 * keeps the non-Assert code cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		for (IOObject io_obj = IOOBJECT_FIRST;
+			 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+		{
+			const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+			for (IOContext io_context = IOCONTEXT_FIRST;
+				 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				const char *context_name = pgstat_get_io_context_name(io_context);
+
+				Datum		values[IO_NUM_COLUMNS] = {0};
+				bool		nulls[IO_NUM_COLUMNS] = {0};
+
+				/*
+				 * Some combinations of BackendType, IOObject, and IOContext
+				 * are not valid for any type of IOOp. In such cases, omit the
+				 * entire row from the view.
+				 */
+				if (!pgstat_tracks_io_object(bktype, io_obj, io_context))
+					continue;
+
+				values[IO_COL_BACKEND_TYPE] = bktype_desc;
+				values[IO_COL_IO_CONTEXT] = CStringGetTextDatum(context_name);
+				values[IO_COL_IO_OBJECT] = CStringGetTextDatum(obj_name);
+				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
+
+				/*
+				 * Hard-code this to the value of BLCKSZ for now. Future
+				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
+				 * and constant multipliers, once non-block-oriented IO (e.g.
+				 * temporary file IO) is tracked.
+				 */
+				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+
+				for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			col_idx = pgstat_get_io_op_index(io_op);
+
+					/*
+					 * Some combinations of BackendType and IOOp, of IOContext
+					 * and IOOp, and of IOObject and IOOp are not tracked. Set
+					 * these cells in the view NULL.
+					 */
+					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+
+					if (nulls[col_idx])
+						continue;
+
+					values[col_idx] =
+						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+				}
+
+				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+									 values, nulls);
+			}
+		}
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 86eb8e8c58..2e804c5bd4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5690,6 +5690,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: per backend type IO statistics',
+  proname => 'pg_stat_get_io', provolatile => 'v',
+  prorows => '30', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..174b725fff 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1876,6 +1876,18 @@ pg_stat_gssapi| SELECT pid,
     gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (client_port IS NOT NULL);
+pg_stat_io| SELECT backend_type,
+    io_object,
+    io_context,
+    reads,
+    writes,
+    extends,
+    op_bytes,
+    evictions,
+    reuses,
+    fsyncs,
+    stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1d84407a03..3ad38da0dd 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1126,4 +1126,231 @@ SELECT pg_stat_get_subscription_stats(NULL);
  
 (1 row)
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_before_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_shared_after_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
+-- See comment above for rationale for two explicit CHECKPOINTs.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SELECT sum(reads) AS io_sum_shared_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+ count 
+-------
+   100
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reads) AS io_sum_shared_after_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+-- Set temp_buffers to its minimum so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO 100;
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS extends, sum(evictions) AS evictions, sum(writes) AS writes
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation' \gset io_sum_local_before_
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 5000) as id, repeat('a', 200);
+-- Ensure the table is large enough to exceed our temp_buffers setting.
+SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT sum(reads) AS io_sum_local_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+ count 
+-------
+  5000
+(1 row)
+
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(evictions) AS evictions,
+       sum(reads) AS reads,
+       sum(writes) AS writes,
+       sum(extends) AS extends
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset io_sum_local_after_
+SELECT :io_sum_local_after_evictions > :io_sum_local_before_evictions,
+       :io_sum_local_after_reads > :io_sum_local_before_reads,
+       :io_sum_local_after_writes > :io_sum_local_before_writes,
+       :io_sum_local_after_extends > :io_sum_local_before_extends;
+ ?column? | ?column? | ?column? | ?column? 
+----------+----------+----------+----------
+ t        | t        | t        | t
+(1 row)
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_tblspace;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(writes) AS io_sum_local_new_tblspc_writes
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET temp_buffers;
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_before_
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_after_
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
+       :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
+
+RESET wal_skip_threshold;
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test IO stats reset
+SELECT pg_stat_have_stats('io', 0, 0);
+ pg_stat_have_stats 
+--------------------
+ t
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset
+  FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+ pg_stat_reset_shared 
+----------------------
+ 
+(1 row)
+
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset
+  FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b4d6753c71..5badd09a1c 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -536,4 +536,145 @@ SELECT pg_stat_get_replication_slot(NULL);
 SELECT pg_stat_get_subscription_stats(NULL);
 
 
+-- Test that the following operations are tracked in pg_stat_io:
+-- - reads of target blocks into shared buffers
+-- - writes of shared buffers to permanent storage
+-- - extends of relations using shared buffers
+-- - fsyncs done to ensure the durability of data dirtying shared buffers
+
+-- There is no test for blocks evicted from shared buffers, because we cannot
+-- be sure of the state of shared buffers at the point the test is run.
+
+-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
+-- extends.
+SELECT sum(extends) AS io_sum_shared_before_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+CREATE TABLE test_io_shared(a int);
+INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_shared_after_extends
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
+
+-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
+-- and fsyncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
+-- See comment above for rationale for two explicit CHECKPOINTs.
+CHECKPOINT;
+CHECKPOINT;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+
+SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+
+-- Change the tablespace so that the table is rewritten directly, then SELECT
+-- from it to cause it to be read back into shared buffers.
+SELECT sum(reads) AS io_sum_shared_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
+-- SELECT from the table so that the data is read into shared buffers and
+-- io_context 'normal', io_object 'relation' reads are counted.
+SELECT COUNT(*) FROM test_io_shared;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reads) AS io_sum_shared_after_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
+SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
+
+-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io:
+-- - eviction of local buffers in order to reuse them
+-- - reads of temporary table blocks into local buffers
+-- - writes of local buffers to permanent storage
+-- - extends of temporary tables
+
+-- Set temp_buffers to its minimum so that we can trigger writes with fewer
+-- inserted tuples. Do so in a new session in case temporary tables have been
+-- accessed by previous tests in this session.
+\c
+SET temp_buffers TO 100;
+CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
+SELECT sum(extends) AS extends, sum(evictions) AS evictions, sum(writes) AS writes
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation' \gset io_sum_local_before_
+-- Insert tuples into the temporary table, generating extends in the stats.
+-- Insert enough values that we need to reuse and write out dirty local
+-- buffers, generating evictions and writes.
+INSERT INTO test_io_local SELECT generate_series(1, 5000) as id, repeat('a', 200);
+-- Ensure the table is large enough to exceed our temp_buffers setting.
+SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100;
+
+SELECT sum(reads) AS io_sum_local_before_reads
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset
+-- Read in evicted buffers, generating reads.
+SELECT COUNT(*) FROM test_io_local;
+SELECT pg_stat_force_next_flush();
+SELECT sum(evictions) AS evictions,
+       sum(reads) AS reads,
+       sum(writes) AS writes,
+       sum(extends) AS extends
+  FROM pg_stat_io
+  WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset io_sum_local_after_
+SELECT :io_sum_local_after_evictions > :io_sum_local_before_evictions,
+       :io_sum_local_after_reads > :io_sum_local_before_reads,
+       :io_sum_local_after_writes > :io_sum_local_before_writes,
+       :io_sum_local_after_extends > :io_sum_local_before_extends;
+
+-- Change the tablespaces so that the temporary table is rewritten to other
+-- local buffers, exercising a different codepath than standard local buffer
+-- writes.
+ALTER TABLE test_io_local SET TABLESPACE regress_tblspace;
+SELECT pg_stat_force_next_flush();
+SELECT sum(writes) AS io_sum_local_new_tblspc_writes
+  FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation'  \gset
+SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
+RESET temp_buffers;
+
+-- Test that reuse of strategy buffers and reads of blocks into these reused
+-- buffers while VACUUMing are tracked in pg_stat_io.
+
+-- Set wal_skip_threshold smaller than the expected size of
+-- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
+-- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
+-- Writing it to WAL will result in the newly written relation pages being in
+-- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
+-- reads.
+SET wal_skip_threshold = '1 kB';
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_before_
+CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
+INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 8000)i;
+-- Ensure that the next VACUUM will need to perform IO by rewriting the table
+-- first with VACUUM (FULL).
+VACUUM (FULL) test_io_vac_strategy;
+VACUUM (PARALLEL 0) test_io_vac_strategy;
+SELECT pg_stat_force_next_flush();
+SELECT sum(reuses) AS reuses, sum(reads) AS reads
+  FROM pg_stat_io WHERE io_context = 'vacuum' \gset io_sum_vac_strategy_after_
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
+       :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
+RESET wal_skip_threshold;
+
+-- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
+-- BufferAccessStrategy, are tracked in pg_stat_io.
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_before
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+CREATE TABLE test_io_bulkwrite_strategy AS SELECT i FROM generate_series(1,100)i;
+SELECT pg_stat_force_next_flush();
+SELECT sum(extends) AS io_sum_bulkwrite_strategy_extends_after
+  FROM pg_stat_io WHERE io_context = 'bulkwrite' \gset
+SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_extends_before;
+
+-- Test IO stats reset
+SELECT pg_stat_have_stats('io', 0, 0);
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_pre_reset
+  FROM pg_stat_io \gset
+SELECT pg_stat_reset_shared('io');
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) AS io_stats_post_reset
+  FROM pg_stat_io \gset
+SELECT :io_stats_post_reset < :io_stats_pre_reset;
+
 -- End of Stats Test
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 65be0dea1b..970a0cfd1d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3376,6 +3376,7 @@ intset_internal_node
 intset_leaf_node
 intset_node
 intvKEY
+io_stat_col
 itemIdCompact
 itemIdCompactData
 iterator
-- 
2.34.1

#125

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#124)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hello.

At Thu, 19 Jan 2023 21:15:34 -0500, Melanie Plageman <melanieplageman@gmail.com> wrote in

Oh dear-- an extra FlushBuffer() snuck in there somehow.
Removed it in attached v51.
Also, I fixed an issue in my tablespace.sql updates

I only looked 0002 and 0004.
(Sorry for the random order of the comment..)

0002:

+ Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));

This is relatively complex checking. We already asserts-out increments
of invalid counters. Thus this is checking if some unrelated codes
clobbered them, which we do only when consistency is critical. Is
there any needs to do that here? I saw another occurance of the same
assertion.

-/* Reset some shared cluster-wide counters */
+/*
+ * Reset some shared cluster-wide counters
+ *
+ * When adding a new reset target, ideally the name should match that in
+ * pgstat_kind_infos, if relevant.
+ */

I'm not sure the addition is useful..

+pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));

Is there any reason for not checking the value ranges at the
bottom-most functions? They can lead to out-of-bounds access so I
don't think we need to continue execution for such invalid values.

+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;

I'm not sure I like to omit parentheses for such a long Boolean
expression on the right side.

+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+	if (!read_chunk_s(fpin, &shmem->io.stats))

The names of the functions hardly make sense alone to me. How about
write_struct()/read_struct()? (I personally prefer to use
write_chunk() directly..)

+ PgStat_BktypeIO

This patch abbreviates "backend" as "bk" but "be" is used in many
places. I think that naming should follow the predecessors.

0004:

system_views.sql:

+FROM pg_stat_get_io() b;

What does the "b" stand for? (Backend? then "s" or "i" seems
straight-forward.)

+		nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+
+		if (nulls[col_idx])
+			continue;
+
+		values[col_idx] =
+			Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);

This is a bit hard to read since it requires to follow the condition
flow. The following is simpler and I thhink close to our standard.

if (pgstat_tacks_io_op())
values[col_idx] =
Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
else
nulls[col_idx] = true;

+ Number of read operations in units of <varname>op_bytes</varname>.

I may be the only one who see the name as umbiguous between "total
number of handled bytes" and "bytes hadled at an operation". Can't it
be op_blocksize or just block_size?

+ b.io_object,
+ b.io_context,

It's uncertain to me why only the two columns are prefixed by
"io". Don't "object_type" and just "context" work instead?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#126

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Kyotaro Horiguchi (#125)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Tue, 24 Jan 2023 17:22:03 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

+pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
Is there any reason for not checking the value ranges at the
bottom-most functions? They can lead to out-of-bounds access so I

To make sure, the "They" means "out-of-range io_object/context/op
values"..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#127

andres@anarazel.de

almost 3 years ago

In reply to: Kyotaro Horiguchi (#125)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-01-24 17:22:03 +0900, Kyotaro Horiguchi wrote:

Hello.

At Thu, 19 Jan 2023 21:15:34 -0500, Melanie Plageman <melanieplageman@gmail.com> wrote in

Oh dear-- an extra FlushBuffer() snuck in there somehow.
Removed it in attached v51.
Also, I fixed an issue in my tablespace.sql updates

I only looked 0002 and 0004.
(Sorry for the random order of the comment..)

0002:

+ Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));

This is relatively complex checking. We already asserts-out increments
of invalid counters. Thus this is checking if some unrelated codes
clobbered them, which we do only when consistency is critical. Is
there any needs to do that here? I saw another occurance of the same
assertion.

I found it useful to find problems.

+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
I'm not sure I like to omit parentheses for such a long Boolean
expression on the right side.

What parens would help?

+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+	if (!read_chunk_s(fpin, &shmem->io.stats))
The names of the functions hardly make sense alone to me. How about
write_struct()/read_struct()? (I personally prefer to use
write_chunk() directly..)

That's not related to this patch - there's several existing callers for
it. And write_struct wouldn't be better imo, because it's not just for
structs.

+ PgStat_BktypeIO

This patch abbreviates "backend" as "bk" but "be" is used in many
places. I think that naming should follow the predecessors.

The precedence aren't consistent unfortunately :)

+ Number of read operations in units of <varname>op_bytes</varname>.

I may be the only one who see the name as umbiguous between "total
number of handled bytes" and "bytes hadled at an operation". Can't it
be op_blocksize or just block_size?

+ b.io_object,
+ b.io_context,

No, block wouldn't be helpful - we'd like to use this for something that isn't
uniform blocks.

Greetings,

Andres Freund

#128

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Andres Freund (#127)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund <andres@anarazel.de> wrote in

0002:

+ Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));

This is relatively complex checking. We already asserts-out increments
of invalid counters. Thus this is checking if some unrelated codes
clobbered them, which we do only when consistency is critical. Is
there any needs to do that here? I saw another occurance of the same
assertion.

I found it useful to find problems.

Okay.

+	no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+		bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+		bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
I'm not sure I like to omit parentheses for such a long Boolean
expression on the right side.
What parens would help?

I thought about the following.

no_temp_rel =
(bktype == B_AUTOVAC_LAUNCHER ||
bktype == B_BG_WRITER ||
bktype == B_CHECKPOINTER ||
bktype == B_AUTOVAC_WORKER ||
bktype == B_STANDALONE_BACKEND ||
bktype == B_STARTUP);

+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+	if (!read_chunk_s(fpin, &shmem->io.stats))
The names of the functions hardly make sense alone to me. How about
write_struct()/read_struct()? (I personally prefer to use
write_chunk() directly..)
That's not related to this patch - there's several existing callers for
it. And write_struct wouldn't be better imo, because it's not just for
structs.

Hmm. Then what the "_s" stands for?

+ PgStat_BktypeIO

This patch abbreviates "backend" as "bk" but "be" is used in many
places. I think that naming should follow the predecessors.

The precedence aren't consistent unfortunately :)

Uuuummmmm. Okay, just I like "be" there! Anyway, I don't strongly
push that.

+ Number of read operations in units of <varname>op_bytes</varname>.

I may be the only one who see the name as umbiguous between "total
number of handled bytes" and "bytes hadled at an operation". Can't it
be op_blocksize or just block_size?

+ b.io_object,
+ b.io_context,

No, block wouldn't be helpful - we'd like to use this for something that isn't
uniform blocks.

What does the field show in that case? The mean of operation size? Or
one row per opration size? If the former, the name looks somewhat
wrong. If the latter, block_size seems making sense.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#129

andres@anarazel.de

almost 3 years ago

In reply to: Kyotaro Horiguchi (#128)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

I did another read through the series. I do have some minor changes, but
they're minor. I think this is ready for commit. I plan to start pushing
tomorrow.

The changes I made are:
- the tablespace test changes didn't quite work in isolation / needed a bit of
polishing
- moved the tablespace changes to later in the series
- split the tests out of the commit adding the view into its own commit
- minor code formatting things (e.g. didn't like nested for()s without {})

On 2023-01-25 16:56:17 +0900, Kyotaro Horiguchi wrote:

At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund <andres@anarazel.de> wrote in
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+	if (!read_chunk_s(fpin, &shmem->io.stats))
The names of the functions hardly make sense alone to me. How about
write_struct()/read_struct()? (I personally prefer to use
write_chunk() directly..)
That's not related to this patch - there's several existing callers for
it. And write_struct wouldn't be better imo, because it's not just for
structs.
Hmm. Then what the "_s" stands for?

Size. It's a macro that just forwards to read_chunk()/write_chunk().

+ Number of read operations in units of <varname>op_bytes</varname>.

I may be the only one who see the name as umbiguous between "total
number of handled bytes" and "bytes hadled at an operation". Can't it
be op_blocksize or just block_size?

+ b.io_object,
+ b.io_context,

No, block wouldn't be helpful - we'd like to use this for something that isn't
uniform blocks.

What does the field show in that case? The mean of operation size? Or
one row per opration size? If the former, the name looks somewhat
wrong. If the latter, block_size seems making sense.

1, so that it's clear that the rest are in bytes.

Greetings,

Andres Freund

#130

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Andres Freund (#129)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Tue, 7 Feb 2023 22:38:14 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

I did another read through the series. I do have some minor changes, but
they're minor. I think this is ready for commit. I plan to start pushing
tomorrow.

The changes I made are:
- the tablespace test changes didn't quite work in isolation / needed a bit of
polishing
- moved the tablespace changes to later in the series
- split the tests out of the commit adding the view into its own commit
- minor code formatting things (e.g. didn't like nested for()s without {})

On 2023-01-25 16:56:17 +0900, Kyotaro Horiguchi wrote:
At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund <andres@anarazel.de> wrote in
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io);
+	if (!read_chunk_s(fpin, &shmem->io.stats))
The names of the functions hardly make sense alone to me. How about
write_struct()/read_struct()? (I personally prefer to use
write_chunk() directly..)
That's not related to this patch - there's several existing callers for
it. And write_struct wouldn't be better imo, because it's not just for
structs.
Hmm. Then what the "_s" stands for?
Size. It's a macro that just forwards to read_chunk()/write_chunk().

I know what the macros do. But, I'm fine with the names as they are
there since before this patch. Sorry for the noise.

+ Number of read operations in units of <varname>op_bytes</varname>.

I may be the only one who see the name as umbiguous between "total
number of handled bytes" and "bytes hadled at an operation". Can't it
be op_blocksize or just block_size?

+ b.io_object,
+ b.io_context,

No, block wouldn't be helpful - we'd like to use this for something that isn't
uniform blocks.

What does the field show in that case? The mean of operation size? Or
one row per opration size? If the former, the name looks somewhat
wrong. If the latter, block_size seems making sense.

1, so that it's clear that the rest are in bytes.

Thanks. Okay, I guess the documentation will be changed as necessary.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#131

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#129)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-02-07 22:38:14 -0800, Andres Freund wrote:

I did another read through the series. I do have some minor changes, but
they're minor. I think this is ready for commit. I plan to start pushing
tomorrow.

Pushed the first (and biggest) commit. More tomorrow.

Already can't wait to see incremental improvements of this version of
pg_stat_io ;). Tracking buffer hits. Tracking Wal IO. Tracking relation IO
bypassing shared buffers. Per connection IO statistics. Tracking IO time.

Greetings,

Andres Freund

#132

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#131)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-02-08 21:03:19 -0800, Andres Freund wrote:

Pushed the first (and biggest) commit. More tomorrow.

Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
and the pg_stat_io tests.

Yay!

Thanks all for patch and review!

Already can't wait to see incremental improvements of this version of
pg_stat_io ;). Tracking buffer hits. Tracking Wal IO. Tracking relation IO
bypassing shared buffers. Per connection IO statistics. Tracking IO time.

That's still the case.

Greetings,

Andres Freund

#133

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#132)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-02-11 10:24:37 -0800, Andres Freund wrote:

Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
and the pg_stat_io tests.

One thing I started to wonder about since is whether we should remove the io_
prefix from io_object, io_context. The prefixes make sense on the C level, but
it's not clear to me that that's also the case on the table level.

Greetings,

Andres Freund

#134

m.sakrejda@gmail.com

almost 3 years ago

In reply to: Andres Freund (#133)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Feb 14, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote:

One thing I started to wonder about since is whether we should remove the io_
prefix from io_object, io_context. The prefixes make sense on the C level, but
it's not clear to me that that's also the case on the table level.

Yeah, +1. It's hard to argue that there would be any confusion,
considering `io_` is in the name of the view.

(Unless, I suppose, some other, non-I/O, "some_object" or
"some_context" column were to be introduced to this view in the
future. But that doesn't seem likely?)

#135

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Maciek Sakrejda (#134)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Tue, 14 Feb 2023 22:35:01 -0800, Maciek Sakrejda <m.sakrejda@gmail.com> wrote in

On Tue, Feb 14, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote:

One thing I started to wonder about since is whether we should remove the io_
prefix from io_object, io_context. The prefixes make sense on the C level, but
it's not clear to me that that's also the case on the table level.

Yeah, +1. It's hard to argue that there would be any confusion,
considering `io_` is in the name of the view.

We usually add such prefixes to the columns of system views and
catalogs, but it seems that's not the case for the stats views. Thus
+1 from me, too.

(Unless, I suppose, some other, non-I/O, "some_object" or
"some_context" column were to be introduced to this view in the
future. But that doesn't seem likely?)

I don't think that can happen. As for corss-views ambiguity, that is
already present. Many columns in stats views share the same names with
some other views.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#136

pryzby@telsasoft.com

almost 3 years ago

In reply to: Andres Freund (#132)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Sat, Feb 11, 2023 at 10:24:37AM -0800, Andres Freund wrote:

On 2023-02-08 21:03:19 -0800, Andres Freund wrote:

Pushed the first (and biggest) commit. More tomorrow.

Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
and the pg_stat_io tests.

pg_stat_io says:

* Some BackendTypes do not currently perform any IO in certain
* IOContexts, and, while it may not be inherently incorrect for them to
* do so, excluding those rows from the view makes the view easier to use.

if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
return false;

if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
io_context == IOCONTEXT_BULKWRITE)
return false;

What about these combinations? Aren't these also "can't happen" ?

--
Justin

#137

pryzby@telsasoft.com

almost 3 years ago

In reply to: Justin Pryzby (#136)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Feb 21, 2023 at 07:50:35PM -0600, Justin Pryzby wrote:

On Sat, Feb 11, 2023 at 10:24:37AM -0800, Andres Freund wrote:

On 2023-02-08 21:03:19 -0800, Andres Freund wrote:

Pushed the first (and biggest) commit. More tomorrow.

Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
and the pg_stat_io tests.

pg_stat_io says:

* Some BackendTypes do not currently perform any IO in certain
* IOContexts, and, while it may not be inherently incorrect for them to
* do so, excluding those rows from the view makes the view easier to use.

if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM)
return false;

if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) &&
io_context == IOCONTEXT_BULKWRITE)
return false;

What about these combinations? Aren't these also "can't happen" ?

relation | bulkread | autovacuum worker
relation | bulkread | autovacuum launcher
relation | vacuum | startup

Nevermind - at least these are possible.

(gdb) p MyBackendType
$1 = B_AUTOVAC_WORKER
(gdb) p io_object
$2 = IOOBJECT_RELATION
(gdb) p io_context
$3 = IOCONTEXT_BULKREAD
(gdb) p io_op
$4 = IOOP_EVICT
(gdb) bt
...
#9 0x0000557b2f6097a3 in ReadBufferExtended (reln=0x7ff5ccee36b8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=16, mode=mode@entry=RBM_NORMAL, strategy=0x557b305fb568) at ../src/include/utils/rel.h:573
#10 0x0000557b2f3057c0 in heapgetpage (sscan=sscan@entry=0x557b305fb158, block=block@entry=16) at ../src/backend/access/heap/heapam.c:405
#11 0x0000557b2f305d6c in heapgettup_pagemode (scan=scan@entry=0x557b305fb158, dir=dir@entry=ForwardScanDirection, nkeys=0, key=0x0) at ../src/backend/access/heap/heapam.c:885
#12 0x0000557b2f306956 in heap_getnext (sscan=sscan@entry=0x557b305fb158, direction=direction@entry=ForwardScanDirection) at ../src/backend/access/heap/heapam.c:1122
#13 0x0000557b2f59be0c in do_autovacuum () at ../src/backend/postmaster/autovacuum.c:2061
#14 0x0000557b2f59ccf7 in AutoVacWorkerMain (argc=argc@entry=0, argv=argv@entry=0x0) at ../src/backend/postmaster/autovacuum.c:1716
#15 0x0000557b2f59cdd8 in StartAutoVacWorker () at ../src/backend/postmaster/autovacuum.c:1494
#16 0x0000557b2f5a561a in StartAutovacuumWorker () at ../src/backend/postmaster/postmaster.c:5481
#17 0x0000557b2f5a5a39 in process_pm_pmsignal () at ../src/backend/postmaster/postmaster.c:5192
#18 0x0000557b2f5a5d7e in ServerLoop () at ../src/backend/postmaster/postmaster.c:1770
#19 0x0000557b2f5a73da in PostmasterMain (argc=9, argv=<optimized out>) at ../src/backend/postmaster/postmaster.c:1463
#20 0x0000557b2f4dfc39 in main (argc=9, argv=0x557b30568f50) at ../src/backend/main/main.c:200

--
Justin

#138

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Andres Freund (#131)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Andres Freund <andres@anarazel.de> writes:

Pushed the first (and biggest) commit. More tomorrow.

I hadn't run my buildfarm-compile-warning scraper for a little while,
but I just did, and I find that this commit is causing warnings on
no fewer than 14 buildfarm animals. They all look like

ayu | 2023-02-25 23:02:08 | pgstat_io.c:40:14: warning: comparison of constant 2 with expression of type 'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:43:16: warning: comparison of constant 4 with expression of type 'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:70:19: warning: comparison of constant 2 with expression of type 'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:71:20: warning: comparison of constant 4 with expression of type 'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:115:14: warning: comparison of constant 2 with expression of type 'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:118:16: warning: comparison of constant 4 with expression of type 'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstatfuncs.c:1329:12: warning: comparison of constant 2 with expression of type 'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstatfuncs.c:1334:17: warning: comparison of constant 4 with expression of type 'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare]

That is, these compilers think that comparisons like

io_object < IOOBJECT_NUM_TYPES
io_context < IOCONTEXT_NUM_TYPES

are constant-true. This seems not good; if they were to actually
act on this observation, by removing those loop-ending tests,
we'd have a problem.

The issue seems to be that code like this:

typedef enum IOContext
{
IOCONTEXT_BULKREAD,
IOCONTEXT_BULKWRITE,
IOCONTEXT_NORMAL,
IOCONTEXT_VACUUM,
} IOContext;

#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)

is far too cute for its own good. I'm not sure about how to fix it
either. I thought of defining

#define IOCONTEXT_LAST IOCONTEXT_VACUUM

and make the loop conditions like "io_context <= IOCONTEXT_LAST",
but that doesn't actually fix the problem.

(Even aside from that, I do not find this coding even a little bit
mistake-proof: you still have to remember to update the #define
when adding another enum value.)

We have similar code involving enum ForkNumber but it looks to me
like the loop variables are always declared as plain "int". That
might be the path of least resistance here.

regards, tom lane

#139

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Tom Lane (#138)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

I wrote:

The issue seems to be that code like this:
...
is far too cute for its own good.

Oh, there's another thing here that qualifies as too-cute: loops like

for (IOObject io_object = IOOBJECT_FIRST;
io_object < IOOBJECT_NUM_TYPES; io_object++)

make it look like we could define these enums as 1-based rather
than 0-based, but if we did this code would fail, because it's
confusing "the number of values" with "1 more than the last value".

Again, we could fix that with tests like "io_context <= IOCONTEXT_LAST",
but I don't see the point of adding more macros rather than removing
some. We do need IOOBJECT_NUM_TYPES to declare array sizes with,
so I think we should nuke the "xxx_FIRST" macros as being not worth
the electrons they're written on, and write these loops like

for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)

which is not actually adding any assumptions that you don't already
make by using io_object as a C array subscript.

regards, tom lane

#140

andres@anarazel.de

almost 3 years ago

In reply to: Tom Lane (#138)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-02-26 13:20:00 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Pushed the first (and biggest) commit. More tomorrow.

I hadn't run my buildfarm-compile-warning scraper for a little while,
but I just did, and I find that this commit is causing warnings on
no fewer than 14 buildfarm animals. They all look like

ayu | 2023-02-25 23:02:08 | pgstat_io.c:40:14: warning: comparison of constant 2 with expression of type 'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:43:16: warning: comparison of constant 4 with expression of type 'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:70:19: warning: comparison of constant 2 with expression of type 'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:71:20: warning: comparison of constant 4 with expression of type 'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:115:14: warning: comparison of constant 2 with expression of type 'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstat_io.c:118:16: warning: comparison of constant 4 with expression of type 'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstatfuncs.c:1329:12: warning: comparison of constant 2 with expression of type 'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare]
ayu | 2023-02-25 23:02:08 | pgstatfuncs.c:1334:17: warning: comparison of constant 4 with expression of type 'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare]

What other animals? If it had been just ayu / clang 4, I'd not be sure it's
worth doing much here.

That is, these compilers think that comparisons like

io_object < IOOBJECT_NUM_TYPES
io_context < IOCONTEXT_NUM_TYPES

are constant-true. This seems not good; if they were to actually
act on this observation, by removing those loop-ending tests,
we'd have a problem.

It'd at least be obvious breakage :/

The issue seems to be that code like this:

typedef enum IOContext
{
IOCONTEXT_BULKREAD,
IOCONTEXT_BULKWRITE,
IOCONTEXT_NORMAL,
IOCONTEXT_VACUUM,
} IOContext;

#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)

is far too cute for its own good. I'm not sure about how to fix it
either. I thought of defining

#define IOCONTEXT_LAST IOCONTEXT_VACUUM

and make the loop conditions like "io_context <= IOCONTEXT_LAST",
but that doesn't actually fix the problem.

(Even aside from that, I do not find this coding even a little bit
mistake-proof: you still have to remember to update the #define
when adding another enum value.)

But the alternative is going around and updating N places, or having a LAST
member in the enum, which then precludes means either adding pointless case
statements or adding default: cases, which prevents the compiler from warning
when a new case is added.

I haven't dug up an old enough compiler yet, what happens if
IOCONTEXT_NUM_TYPES is redefined to ((int)IOOBJECT_TEMP_RELATION + 1)?

We have similar code involving enum ForkNumber but it looks to me
like the loop variables are always declared as plain "int". That
might be the path of least resistance here.

IIRC that caused some even longer lines due to casting the integer to the enum
in some other lines. Perhaps we should just case for the < comparison?

Greetings,

Andres Freund

#141

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Andres Freund (#140)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Andres Freund <andres@anarazel.de> writes:

On 2023-02-26 13:20:00 -0500, Tom Lane wrote:

I hadn't run my buildfarm-compile-warning scraper for a little while,
but I just did, and I find that this commit is causing warnings on
no fewer than 14 buildfarm animals. They all look like

What other animals? If it had been just ayu / clang 4, I'd not be sure it's
worth doing much here.

ayu
batfish
demoiselle
desmoxytes
dragonet
idiacanthus
mantid
petalura
phycodurus
pogona
wobbegong

Some of those are yours ;-)

Actually there are only 11, because I miscounted before, but
there are new compilers in that group not only old ones.
desmoxytes is gcc 10, for instance.

regards, tom lane

#142

andres@anarazel.de

almost 3 years ago

In reply to: Tom Lane (#141)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-02-26 14:40:00 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-02-26 13:20:00 -0500, Tom Lane wrote:

I hadn't run my buildfarm-compile-warning scraper for a little while,
but I just did, and I find that this commit is causing warnings on
no fewer than 14 buildfarm animals. They all look like

What other animals? If it had been just ayu / clang 4, I'd not be sure it's
worth doing much here.

ayu
batfish
demoiselle
desmoxytes
dragonet
idiacanthus
mantid
petalura
phycodurus
pogona
wobbegong

Some of those are yours ;-)

Actually there are only 11, because I miscounted before, but
there are new compilers in that group not only old ones.
desmoxytes is gcc 10, for instance.

I think on mine the warnings come from the clang to generate bitcode, rather
than gcc. The parallel make output makes that a bit hard to see though, as
commands and warnings are interspersed.

They're all animals for testing older LLVM versions. They're using
pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and
desmoxytes is clang 4, idiacanthus and pogona are clang 5.

Greetings,

Andres Freund

#143

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Andres Freund (#142)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Andres Freund <andres@anarazel.de> writes:

They're all animals for testing older LLVM versions. They're using
pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and
desmoxytes is clang 4, idiacanthus and pogona are clang 5.

[ shrug ... ] If I thought this was actually good code, I might
agree with ignoring these warnings; but I think what it mostly is
is misleading overcomplication.

regards, tom lane

#144

andres@anarazel.de

almost 3 years ago

In reply to: Tom Lane (#143)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On 2023-02-26 15:08:33 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

They're all animals for testing older LLVM versions. They're using
pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and
desmoxytes is clang 4, idiacanthus and pogona are clang 5.

[ shrug ... ] If I thought this was actually good code, I might
agree with ignoring these warnings; but I think what it mostly is
is misleading overcomplication.

I don't mind removing *_FIRST et al by using 0. None of the proposals for
getting rid of *_NUM_* seemed a cure actually better than the disease.

Adding a cast to int of the loop iteration variable seems to work and only
noticeably, not untollerably, ugly.

One thing that's odd is that the warnings don't appear reliably. The
"io_op < IOOP_NUM_TYPES" comparison in pgstatfuncs.c doesn't trigger any
with clang-4.

Greetings,

Andres Freund

#145

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#144)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Sun, Feb 26, 2023 at 12:33:03PM -0800, Andres Freund wrote:

On 2023-02-26 15:08:33 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

They're all animals for testing older LLVM versions. They're using
pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and
desmoxytes is clang 4, idiacanthus and pogona are clang 5.

[ shrug ... ] If I thought this was actually good code, I might
agree with ignoring these warnings; but I think what it mostly is
is misleading overcomplication.

I don't mind removing *_FIRST et al by using 0. None of the proposals for
getting rid of *_NUM_* seemed a cure actually better than the disease.

I am also fine with removing *_FIRST and allowing those electrons to
move on to bigger and better things :)

Adding a cast to int of the loop iteration variable seems to work and only
noticeably, not untollerably, ugly.

One thing that's odd is that the warnings don't appear reliably. The
"io_op < IOOP_NUM_TYPES" comparison in pgstatfuncs.c doesn't trigger any
with clang-4.

Using an int and casting all over the place certainly doesn't make the
code more attractive, but I am fine with this if it seems like the least
bad solution.

I didn't want to write a patch with this (ints instead of enums as loop
control variable) without being able to reproduce the warnings myself
and confirm the patch silences them. However, I wasn't able to reproduce
the warnings myself. I tried to do so with a minimal repro on godbolt,
and even with
-Wtautological-constant-out-of-range-compare -Wall -Wextra -Weverything -Werror
I couldn't get clang 4 or 5 (or a number of other compilers I randomly
picked from the dropdown) to produce the warnings.

- Melanie

#146

melanieplageman@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#145)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Sun, Feb 26, 2023 at 04:11:45PM -0500, Melanie Plageman wrote:

On Sun, Feb 26, 2023 at 12:33:03PM -0800, Andres Freund wrote:

On 2023-02-26 15:08:33 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

They're all animals for testing older LLVM versions. They're using
pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and
desmoxytes is clang 4, idiacanthus and pogona are clang 5.

[ shrug ... ] If I thought this was actually good code, I might
agree with ignoring these warnings; but I think what it mostly is
is misleading overcomplication.

I don't mind removing *_FIRST et al by using 0. None of the proposals for
getting rid of *_NUM_* seemed a cure actually better than the disease.

I am also fine with removing *_FIRST and allowing those electrons to
move on to bigger and better things :)

Adding a cast to int of the loop iteration variable seems to work and only
noticeably, not untollerably, ugly.

One thing that's odd is that the warnings don't appear reliably. The
"io_op < IOOP_NUM_TYPES" comparison in pgstatfuncs.c doesn't trigger any
with clang-4.

Using an int and casting all over the place certainly doesn't make the
code more attractive, but I am fine with this if it seems like the least
bad solution.

I didn't want to write a patch with this (ints instead of enums as loop
control variable) without being able to reproduce the warnings myself
and confirm the patch silences them. However, I wasn't able to reproduce
the warnings myself. I tried to do so with a minimal repro on godbolt,
and even with
-Wtautological-constant-out-of-range-compare -Wall -Wextra -Weverything -Werror
I couldn't get clang 4 or 5 (or a number of other compilers I randomly
picked from the dropdown) to produce the warnings.

Just kidding: it reproduces if the defined enum has two or less values.
Interesting...

After discovering this, tried out various solutions including one Andres
suggested:

for (IOOp io_op = 0; (int) io_op < IOOP_NUM_TYPES; io_op++)

and it does silence the warning. What do you think?

- Melanie

#147

melanieplageman@gmail.com

almost 3 years ago

In reply to: Tom Lane (#139)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Sun, Feb 26, 2023 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

The issue seems to be that code like this:
...
is far too cute for its own good.

Oh, there's another thing here that qualifies as too-cute: loops like

for (IOObject io_object = IOOBJECT_FIRST;
io_object < IOOBJECT_NUM_TYPES; io_object++)

make it look like we could define these enums as 1-based rather
than 0-based, but if we did this code would fail, because it's
confusing "the number of values" with "1 more than the last value".

Again, we could fix that with tests like "io_context <= IOCONTEXT_LAST",
but I don't see the point of adding more macros rather than removing
some. We do need IOOBJECT_NUM_TYPES to declare array sizes with,
so I think we should nuke the "xxx_FIRST" macros as being not worth
the electrons they're written on, and write these loops like

for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)

which is not actually adding any assumptions that you don't already
make by using io_object as a C array subscript.

Attached is a patch to remove the *_FIRST macros.
I was going to add in code to change

for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
to
for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES;
io_object++)

but then I couldn't remember why we didn't just do

for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)

I recall that when passing that loop variable into a function I was
getting a compiler warning that required me to cast the value back to an
enum to silence it:

pgstat_tracks_io_op(bktype, (IOObject) io_object,
io_context, io_op))

However, I am now unable to reproduce that warning.
Moreover, I see in cases like table_block_relation_size() with
ForkNumber, the variable i is passed with no cast to smgrnblocks().

- Melanie

Attachments:

v1-0001-Remove-potentially-misleading-_FIRST-macros.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Remove-potentially-misleading-_FIRST-macros.patchDownload

From cce6dc75e9e4fc9adc018a1d05874be5f3be96ae Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 27 Feb 2023 08:22:53 -0500
Subject: [PATCH v1 1/2] Remove potentially misleading *_FIRST macros

28e626bde00ef introduced IO statistic enums IOOp, IOObject, and
IOContext along with macros *_FIRST intended for use when looping
through the enumerated values of each. Per discussion in [1] these
macros are confusing and error-prone. Remove them.

[1] https://www.postgresql.org/message-id/23770.1677437567%40sss.pgh.pa.us
---
 src/backend/utils/activity/pgstat_io.c | 17 ++++++-----------
 src/backend/utils/adt/pgstatfuncs.c    | 10 ++++------
 src/include/pgstat.h                   |  3 ---
 3 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 0e07e0848d..c478b126fa 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -36,18 +36,16 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 {
 	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
 
-	for (IOObject io_object = IOOBJECT_FIRST;
-		 io_object < IOOBJECT_NUM_TYPES; io_object++)
+	for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
-		for (IOContext io_context = IOCONTEXT_FIRST;
-			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (IOContext io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			/*
 			 * Don't bother trying to skip to the next loop iteration if
 			 * pgstat_tracks_io_object() would return false here. We still
 			 * need to validate that each counter is zero anyway.
 			 */
-			for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+			for (IOOp io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
 				/* No stats, so nothing to validate */
 				if (backend_io->data[io_object][io_context][io_op] == 0)
@@ -111,14 +109,11 @@ pgstat_flush_io(bool nowait)
 	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
 		return true;
 
-	for (IOObject io_object = IOOBJECT_FIRST;
-		 io_object < IOOBJECT_NUM_TYPES; io_object++)
+	for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
-		for (IOContext io_context = IOCONTEXT_FIRST;
-			 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (IOContext io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			for (IOOp io_op = IOOP_FIRST;
-				 io_op < IOOP_NUM_TYPES; io_op++)
+			for (IOOp io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				bktype_shstats->data[io_object][io_context][io_op] +=
 					PendingIOStats.data[io_object][io_context][io_op];
 		}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 9d707c3521..12eda4ade0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1306,7 +1306,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
 
-	for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++)
+	for (BackendType bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
 	{
 		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
 		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
@@ -1325,13 +1325,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 		if (!pgstat_tracks_io_bktype(bktype))
 			continue;
 
-		for (IOObject io_obj = IOOBJECT_FIRST;
-			 io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+		for (IOObject io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
 		{
 			const char *obj_name = pgstat_get_io_object_name(io_obj);
 
-			for (IOContext io_context = IOCONTEXT_FIRST;
-				 io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			for (IOContext io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
 
@@ -1359,7 +1357,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				 */
 				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
 
-				for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++)
+				for (IOOp io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
 					int			col_idx = pgstat_get_io_op_index(io_op);
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index db9675884f..f43fac09ed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -287,7 +287,6 @@ typedef enum IOObject
 	IOOBJECT_TEMP_RELATION,
 } IOObject;
 
-#define IOOBJECT_FIRST IOOBJECT_RELATION
 #define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
 
 typedef enum IOContext
@@ -298,7 +297,6 @@ typedef enum IOContext
 	IOCONTEXT_VACUUM,
 } IOContext;
 
-#define IOCONTEXT_FIRST IOCONTEXT_BULKREAD
 #define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
 
 typedef enum IOOp
@@ -311,7 +309,6 @@ typedef enum IOOp
 	IOOP_WRITE,
 } IOOp;
 
-#define IOOP_FIRST IOOP_EVICT
 #define IOOP_NUM_TYPES (IOOP_WRITE + 1)
 
 typedef struct PgStat_BktypeIO
-- 
2.37.2

#148

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Melanie Plageman (#147)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Melanie Plageman <melanieplageman@gmail.com> writes:

Attached is a patch to remove the *_FIRST macros.
I was going to add in code to change

for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
to
for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES; io_object++)

I don't really like that proposal. ISTM it's just silencing the
messenger rather than addressing the underlying problem, namely that
there's no guarantee that an IOObject variable can hold the value
IOOBJECT_NUM_TYPES, which it had better do if you want the loop to
terminate. Admittedly it's quite unlikely that these three enums would
grow to the point that that becomes an actual hazard for them --- but
IMO it's still bad practice and a bad precedent for future code.

but then I couldn't remember why we didn't just do

for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)

I recall that when passing that loop variable into a function I was
getting a compiler warning that required me to cast the value back to an
enum to silence it:

pgstat_tracks_io_op(bktype, (IOObject) io_object,
io_context, io_op))

However, I am now unable to reproduce that warning.
Moreover, I see in cases like table_block_relation_size() with
ForkNumber, the variable i is passed with no cast to smgrnblocks().

Yeah, my druthers would be to just do it the way we do comparable
things with ForkNumber. I don't feel like we need to invent a
better way here.

The risk of needing to cast when using the "int" loop variable
as an enum is obviously the downside of that approach, but we have
not seen any indication that any compilers actually do warn.
It's interesting that you did see such a warning ... I wonder which
compiler you were using at the time?

regards, tom lane

#149

melanieplageman@gmail.com

almost 3 years ago

In reply to: Tom Lane (#148)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Feb 27, 2023 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Melanie Plageman <melanieplageman@gmail.com> writes:

Attached is a patch to remove the *_FIRST macros.
I was going to add in code to change

for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
to
for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES; io_object++)

I don't really like that proposal. ISTM it's just silencing the
messenger rather than addressing the underlying problem, namely that
there's no guarantee that an IOObject variable can hold the value
IOOBJECT_NUM_TYPES, which it had better do if you want the loop to
terminate. Admittedly it's quite unlikely that these three enums would
grow to the point that that becomes an actual hazard for them --- but
IMO it's still bad practice and a bad precedent for future code.

That's fair. Patch attached.

but then I couldn't remember why we didn't just do

for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)

I recall that when passing that loop variable into a function I was
getting a compiler warning that required me to cast the value back to an
enum to silence it:

pgstat_tracks_io_op(bktype, (IOObject) io_object,
io_context, io_op))

However, I am now unable to reproduce that warning.
Moreover, I see in cases like table_block_relation_size() with
ForkNumber, the variable i is passed with no cast to smgrnblocks().

Yeah, my druthers would be to just do it the way we do comparable
things with ForkNumber. I don't feel like we need to invent a
better way here.

The risk of needing to cast when using the "int" loop variable
as an enum is obviously the downside of that approach, but we have
not seen any indication that any compilers actually do warn.
It's interesting that you did see such a warning ... I wonder which
compiler you were using at the time?

so, pretty much any version of clang I tried with
-Wsign-conversion produces a warning.

<source>:35:32: warning: implicit conversion changes signedness: 'int'
to 'IOOp' (aka 'enum IOOp') [-Wsign-conversion]

I didn't do the casts in the attached patch since they aren't done elsewhere.

- Melanie

Attachments:

Change-IO-stats-enum-loop-variables-to-ints.patchtext/x-patch; charset=US-ASCII; name=Change-IO-stats-enum-loop-variables-to-ints.patchDownload

From 34fee4a3d1d1353aa38a95b3afc2d302a5f586ff Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 27 Feb 2023 08:48:11 -0500
Subject: [PATCH v1 2/2] Change IO stats enum loop variables to ints

Per [1], using an enum as the loop variable with a macro-defined
termination condition of #_enum_values + 1 is not guaranteed to be safe
- as compilers are free to make enums as small as a char.

Some (older) compilers will notice this when building with
-Wtautological-constant-out-of-range-compare.

[1] https://www.postgresql.org/message-id/354645.1677511842%40sss.pgh.pa.us
---
 src/backend/utils/activity/pgstat_io.c | 12 ++++++------
 src/backend/utils/adt/pgstatfuncs.c    |  8 ++++----
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index c478b126fa..c4199d18c8 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -36,16 +36,16 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 {
 	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
 
-	for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
-		for (IOContext io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			/*
 			 * Don't bother trying to skip to the next loop iteration if
 			 * pgstat_tracks_io_object() would return false here. We still
 			 * need to validate that each counter is zero anyway.
 			 */
-			for (IOOp io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
 				/* No stats, so nothing to validate */
 				if (backend_io->data[io_object][io_context][io_op] == 0)
@@ -109,11 +109,11 @@ pgstat_flush_io(bool nowait)
 	else if (!LWLockConditionalAcquire(bktype_lock, LW_EXCLUSIVE))
 		return true;
 
-	for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
-		for (IOContext io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			for (IOOp io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				bktype_shstats->data[io_object][io_context][io_op] +=
 					PendingIOStats.data[io_object][io_context][io_op];
 		}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 12eda4ade0..b61a12382b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1306,7 +1306,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 	reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
 
-	for (BackendType bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
 	{
 		Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
 		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
@@ -1325,11 +1325,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 		if (!pgstat_tracks_io_bktype(bktype))
 			continue;
 
-		for (IOObject io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+		for (int io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
 		{
 			const char *obj_name = pgstat_get_io_object_name(io_obj);
 
-			for (IOContext io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
 
@@ -1357,7 +1357,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				 */
 				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
 
-				for (IOOp io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
 					int			col_idx = pgstat_get_io_op_index(io_op);
 
-- 
2.37.2

#150

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Melanie Plageman (#149)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Melanie Plageman <melanieplageman@gmail.com> writes:

On Mon, Feb 27, 2023 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The risk of needing to cast when using the "int" loop variable
as an enum is obviously the downside of that approach, but we have
not seen any indication that any compilers actually do warn.
It's interesting that you did see such a warning ... I wonder which
compiler you were using at the time?

so, pretty much any version of clang I tried with
-Wsign-conversion produces a warning.

<source>:35:32: warning: implicit conversion changes signedness: 'int'
to 'IOOp' (aka 'enum IOOp') [-Wsign-conversion]

Oh, interesting --- so it's not about the implicit conversion to enum
but just about signedness. I bet we could silence that by making the
loop variables be "unsigned int". I doubt it's worth any extra keystrokes
though, because we are not at all clean about sign-conversion warnings.
I tried enabling -Wsign-conversion on Apple's clang 14.0.0 just now,
and counted 13462 such warnings just in the core build :-(. I don't
foresee anybody trying to clean that up.

I didn't do the casts in the attached patch since they aren't done elsewhere.

Agreed. I'll push this along with the earlier patch if there are
not objections.

regards, tom lane

#151

andres@anarazel.de

almost 3 years ago

In reply to: Tom Lane (#150)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On 2023-02-27 14:58:30 -0500, Tom Lane wrote:

Agreed. I'll push this along with the earlier patch if there are
not objections.

None here.

#152

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2023-03-04%2020%3A30%3A05

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Andres Freund (#132)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Andres Freund <andres@anarazel.de> writes:

Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
and the pg_stat_io tests.

One of the test cases is flapping a bit:

diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out
--- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out	2023-03-04 21:30:05.891579466 +0100
+++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out	2023-03-04 21:34:26.745552661 +0100
@@ -1201,7 +1201,7 @@
 SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
  ?column? 
 ----------
- t
+ f
 (1 row)

DROP TABLE test_io_shared;

There are two instances of this today [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2023-03-04%2020%3A30%3A05, and I've seen it before
but failed to note down where.

regards, tom lane

#153

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Tom Lane (#152)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Sat, 04 Mar 2023 18:21:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Andres Freund <andres@anarazel.de> writes:

Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
and the pg_stat_io tests.

One of the test cases is flapping a bit:
diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out
--- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out	2023-03-04 21:30:05.891579466 +0100
+++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out	2023-03-04 21:34:26.745552661 +0100
@@ -1201,7 +1201,7 @@
SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
?column? 
----------
- t
+ f
(1 row)
DROP TABLE test_io_shared;

There are two instances of this today [1][2], and I've seen it before
but failed to note down where.

The concurrent autoanalyze below is logged as performing at least one
page read from the table. It is unclear, however, how that analyze
operation resulted in 19 hits and 2 reads on the (I think) single-page
relation.

In any case, I think we need to avoid such concurrent autovacuum/analyze.

2023-03-04 22:36:27.781 CET [4073:106] pg_regress/stats LOG: statement: ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
2023-03-04 22:36:27.838 CET [4073:107] pg_regress/stats LOG: statement: SELECT COUNT(*) FROM test_io_shared;
2023-03-04 22:36:27.864 CET [4255:5] LOG: automatic analyze of table "regression.public.test_io_shared"
avg read rate: 5.208 MB/s, avg write rate: 5.208 MB/s
buffer usage: 17 hits, 2 misses, 2 dirtied
2023-03-04 22:36:28.024 CET [4073:108] pg_regress/stats LOG: statement: SELECT pg_stat_force_next_flush();
2023-03-04 22:36:28.024 CET [4073:108] pg_regress/stats LOG: statement: SELECT pg_stat_force_next_flush();
2023-03-04 22:36:28.027 CET [4073:109] pg_regress/stats LOG: statement: SELECT sum(reads) AS io_sum_shared_after_reads
FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2023-03-04%2020%3A30%3A05

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#154

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Kyotaro Horiguchi (#153)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

In any case, I think we need to avoid such concurrent autovacuum/analyze.

If it is correct, I believe the attached fix works.

regads.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

fix_stats_test.difftext/x-patch; charset=us-asciiDownload

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 937b2101b3..023ec5ecc4 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1137,7 +1137,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- extends.
 SELECT sum(extends) AS io_sum_shared_before_extends
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
-CREATE TABLE test_io_shared(a int);
+CREATE TABLE test_io_shared(a int) WITH (autovacuum_enabled = 'false');
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
  pg_stat_force_next_flush 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 74e592aa8a..aa6552befd 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -549,7 +549,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- extends.
 SELECT sum(extends) AS io_sum_shared_before_extends
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
-CREATE TABLE test_io_shared(a int);
+CREATE TABLE test_io_shared(a int) WITH (autovacuum_enabled = 'false');
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
 SELECT sum(extends) AS io_sum_shared_after_extends

#155

melanieplageman@gmail.com

almost 3 years ago

In reply to: Kyotaro Horiguchi (#154)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

In any case, I think we need to avoid such concurrent autovacuum/analyze.

If it is correct, I believe the attached fix works.

Thanks for investigating this!

Yes, this fix looks correct and makes sense to me.

On Mon, Mar 6, 2023 at 1:24 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 04 Mar 2023 18:21:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in
Andres Freund <andres@anarazel.de> writes:

Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
and the pg_stat_io tests.

One of the test cases is flapping a bit:
diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out
--- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out     2023-03-04 21:30:05.891579466 +0100
+++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out      2023-03-04 21:34:26.745552661 +0100
@@ -1201,7 +1201,7 @@
SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads;
?column?
----------
- t
+ f
(1 row)
DROP TABLE test_io_shared;

There are two instances of this today [1][2], and I've seen it before
but failed to note down where.
The concurrent autoanalyze below is logged as performing at least one
page read from the table. It is unclear, however, how that analyze
operation resulted in 19 hits and 2 reads on the (I think) single-page
relation.

Yes, it is a single page.
I think there could be a few different reasons by it is 2 misses/2
dirtied, but the one that seems most likely is that I/O of other
relations done during this autovac/analyze of this relation is counted
in the same global variables (like catalog tables).

- Melanie

#156

andres@anarazel.de

almost 3 years ago

In reply to: Melanie Plageman (#155)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:

On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

In any case, I think we need to avoid such concurrent autovacuum/analyze.

If it is correct, I believe the attached fix works.

Thanks for investigating this!

Yes, this fix looks correct and makes sense to me.

Wouldn't it be better to just perform the section from the ALTER TABLE till
the DROP TABLE in a transaction? Then there couldn't be any other accesses in
just that section. I'm not convinced it's good to disallow all concurrent
activity in other parts of the test.

Greetings,

Andres Freund

#157

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#156)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:

Hi,

On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:

On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

In any case, I think we need to avoid such concurrent autovacuum/analyze.

If it is correct, I believe the attached fix works.

Thanks for investigating this!

Yes, this fix looks correct and makes sense to me.

Wouldn't it be better to just perform the section from the ALTER TABLE till
the DROP TABLE in a transaction? Then there couldn't be any other accesses in
just that section. I'm not convinced it's good to disallow all concurrent
activity in other parts of the test.

You mean for test coverage reasons? Because the table in question only
exists for a few operations in this test file.

- Melanie

#158

andres@anarazel.de

almost 3 years ago

In reply to: Melanie Plageman (#157)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-03-06 14:24:09 -0500, Melanie Plageman wrote:

On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:

On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:

On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

In any case, I think we need to avoid such concurrent autovacuum/analyze.

If it is correct, I believe the attached fix works.

Thanks for investigating this!

Yes, this fix looks correct and makes sense to me.

Wouldn't it be better to just perform the section from the ALTER TABLE till
the DROP TABLE in a transaction? Then there couldn't be any other accesses in
just that section. I'm not convinced it's good to disallow all concurrent
activity in other parts of the test.

You mean for test coverage reasons? Because the table in question only
exists for a few operations in this test file.

That, but also because it's simply more reliable. autovacuum=off doesn't
protect against a anti-wraparound vacuum or such. Or a concurrent test somehow
triggering a read. Or ...

Greetings,

Andres Freund

#159

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#158)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Mar 6, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-03-06 14:24:09 -0500, Melanie Plageman wrote:

On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote:

On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote:

On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

In any case, I think we need to avoid such concurrent autovacuum/analyze.

If it is correct, I believe the attached fix works.

Thanks for investigating this!

Yes, this fix looks correct and makes sense to me.

Wouldn't it be better to just perform the section from the ALTER TABLE till
the DROP TABLE in a transaction? Then there couldn't be any other accesses in
just that section. I'm not convinced it's good to disallow all concurrent
activity in other parts of the test.

You mean for test coverage reasons? Because the table in question only
exists for a few operations in this test file.

That, but also because it's simply more reliable. autovacuum=off doesn't
protect against a anti-wraparound vacuum or such. Or a concurrent test somehow
triggering a read. Or ...

Good point. Attached is what you suggested. I committed the transaction
before the drop table so that the statistics would be visible when we
queried pg_stat_io.

- Melanie

Attachments:

v1-0001-Fix-flakey-pg_stat_io-test.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-flakey-pg_stat_io-test.patchDownload

From 78ca019624fbd7d6e2a4d94970b804fc834731b4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 6 Mar 2023 15:16:03 -0500
Subject: [PATCH v1] Fix flakey pg_stat_io test

Wrap test of pg_stat_io's tracking of shared buffer reads in a
transaction to prevent concurrent accesses (e.g. by autovacuum) leading
to incorrect test failures.

Discussion: https://www.postgresql.org/message-id/20230306190919.ai6mxdq3sygyyths%40awork3.anarazel.de
---
 src/test/regress/expected/stats.out | 4 ++++
 src/test/regress/sql/stats.sql      | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 937b2101b3..fb5adb0fd7 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1181,6 +1181,9 @@ SELECT current_setting('fsync') = 'off'
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+-- Do this in a transaction to prevent any other concurrent access to our newly
+-- rewritten table, guaranteeing our test will pass.
+BEGIN;
 ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
 -- SELECT from the table so that the data is read into shared buffers and
 -- io_context 'normal', io_object 'relation' reads are counted.
@@ -1190,6 +1193,7 @@ SELECT COUNT(*) FROM test_io_shared;
    100
 (1 row)
 
+COMMIT;
 SELECT pg_stat_force_next_flush();
  pg_stat_force_next_flush 
 --------------------------
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 74e592aa8a..84604d8fa0 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -576,10 +576,14 @@ SELECT current_setting('fsync') = 'off'
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+-- Do this in a transaction to prevent any other concurrent access to our newly
+-- rewritten table, guaranteeing our test will pass.
+BEGIN;
 ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace;
 -- SELECT from the table so that the data is read into shared buffers and
 -- io_context 'normal', io_object 'relation' reads are counted.
 SELECT COUNT(*) FROM test_io_shared;
+COMMIT;
 SELECT pg_stat_force_next_flush();
 SELECT sum(reads) AS io_sum_shared_after_reads
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation'  \gset
-- 
2.37.2

#160

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#159)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

At Mon, 6 Mar 2023 15:21:14 -0500, Melanie Plageman <melanieplageman@gmail.com> wrote in

On Mon, Mar 6, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:

That, but also because it's simply more reliable. autovacuum=off doesn't
protect against a anti-wraparound vacuum or such. Or a concurrent test somehow
triggering a read. Or ...

Good point. Attached is what you suggested. I committed the transaction
before the drop table so that the statistics would be visible when we
queried pg_stat_io.

While I don't believe anti-wraparound vacuum can occur during testing,
Melanie's solution (moving the commit by a few lines) seems working
(by a manual testing).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#161

andres@anarazel.de

almost 3 years ago

In reply to: Melanie Plageman (#159)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:

Good point. Attached is what you suggested. I committed the transaction
before the drop table so that the statistics would be visible when we
queried pg_stat_io.

Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.

Greetings,

Andres Freund

#162

https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5355168397524992/log/src/test/recovery/tmp_check/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/6142435751886848/testrun/build/testrun/recovery/027_stream_regress/log/regress_log_027_stream_regress

pryzby@telsasoft.com

almost 3 years ago

In reply to: Andres Freund (#161)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote:

Hi,

On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:

Good point. Attached is what you suggested. I committed the transaction
before the drop table so that the statistics would be visible when we
queried pg_stat_io.

Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.

There's a 2nd portion of the test that's still flapping, at least on
cirrusci.

The issue that Tom mentioned is at:
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

But what I've seen on cirrusci is at:
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

It'd be neat if cfbot could show a histogram of test failures, although
I'm not entirely sure what granularity would be most useful: the test
that failed (027_regress) or the way it failed (:after_write >
:before_writes). Maybe it's enough to show the test, with links to its
recent failures.

--
Justin

#163

andres@anarazel.de

almost 3 years ago

In reply to: Justin Pryzby (#162)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote:

On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote:

Hi,

On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:

Good point. Attached is what you suggested. I committed the transaction
before the drop table so that the statistics would be visible when we
queried pg_stat_io.

Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.

There's a 2nd portion of the test that's still flapping, at least on
cirrusci.

The issue that Tom mentioned is at:
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

But what I've seen on cirrusci is at:
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

Seems you meant to copy a different line for Tom's (s/writes/redas/)?

https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs

Hm. I guess the explanation here is that the buffers were already all written
out by another backend. Which is made more likely by your patch.

I found a few more occurances and chatted with Melanie. Melanie will come up
with a fix I think.

Greetings,

Andres Freund

#164

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#163)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Mar 9, 2023 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote:

On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote:
There's a 2nd portion of the test that's still flapping, at least on
cirrusci.

The issue that Tom mentioned is at:
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

But what I've seen on cirrusci is at:
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

Seems you meant to copy a different line for Tom's (s/writes/redas/)?

https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs

Hm. I guess the explanation here is that the buffers were already all written
out by another backend. Which is made more likely by your patch.

I found a few more occurances and chatted with Melanie. Melanie will come up
with a fix I think.

So, what this test is relying on is that either the checkpointer or
another backend will flush the pages of test_io_shared which we dirtied
above in the test. The test specifically checks for IOCONTEXT_NORMAL
writes. It could fail if some other backend is doing a bulkread or
bulkwrite and flushes these buffers first in a strategy context.
This will happen more often when shared buffers is small.

I tried to come up with a reliable test which was limited to
IOCONTEXT_NORMAL. I thought if we could guarantee a dirty buffer would
be pinned using a cursor, that we could then issue a checkpoint and
guarantee a flush that way. However, I don't see a way to guarantee that
no one flushes the buffer between dirtying it and pinning it with the
cursor.

So, I think our best bet is to just change the test to pass if there are
any writes in any contexts. By moving the sum(writes) before the INSERT
and keeping the checkpoint, we can guarantee that someway or another,
some buffers will be flushed. This essentially covers the same code anyway.

Patch attached.

- Melanie

Attachments:

Stabilize-pg_stat_io-writes-test.patchtext/x-patch; charset=US-ASCII; name=Stabilize-pg_stat_io-writes-test.patchDownload

From 0d2904f2cf3b6cbf016e5701aaa2bc6997b505cc Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 10 Mar 2023 14:26:37 -0500
Subject: [PATCH v1] Stabilize pg_stat_io writes test

Counting writes only for io_context = 'normal' is unreliable, as
backends using a buffer access strategy could flush all of the dirty
buffers out from under the other backends and checkpointer. Change the
test to count writes in any context. This achieves roughly the same
coverage anyway.

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://www.postgresql.org/message-id/ZAnWU8WbXEDjrfUE%40telsasoft.com
---
 src/test/regress/expected/stats.out | 8 ++++----
 src/test/regress/sql/stats.sql      | 9 ++++-----
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 186c296299..e90940f676 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1137,6 +1137,9 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- extends.
 SELECT sum(extends) AS io_sum_shared_before_extends
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_object = 'relation' \gset io_sum_shared_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1155,15 +1158,12 @@ SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
 
 -- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
 -- and fsyncs.
-SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
-  FROM pg_stat_io
-  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
 -- See comment above for rationale for two explicit CHECKPOINTs.
 CHECKPOINT;
 CHECKPOINT;
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
-  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
+  WHERE io_object = 'relation' \gset io_sum_shared_after_
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index d7f873cfc9..b94410e49e 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -549,6 +549,9 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- extends.
 SELECT sum(extends) AS io_sum_shared_before_extends
   FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' \gset
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE io_object = 'relation' \gset io_sum_shared_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -558,16 +561,12 @@ SELECT :io_sum_shared_after_extends > :io_sum_shared_before_extends;
 
 -- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes
 -- and fsyncs.
-SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
-  FROM pg_stat_io
-  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_before_
 -- See comment above for rationale for two explicit CHECKPOINTs.
 CHECKPOINT;
 CHECKPOINT;
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
-  WHERE io_context = 'normal' AND io_object = 'relation' \gset io_sum_shared_after_
-
+  WHERE io_object = 'relation' \gset io_sum_shared_after_
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
-- 
2.37.2

#165

pryzby@telsasoft.com

almost 3 years ago

In reply to: Andres Freund (#163)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Thu, Mar 09, 2023 at 11:43:01AM -0800, Andres Freund wrote:

On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote:

On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote:

Hi,

On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:

Good point. Attached is what you suggested. I committed the transaction
before the drop table so that the statistics would be visible when we
queried pg_stat_io.

Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.

There's a 2nd portion of the test that's still flapping, at least on
cirrusci.

The issue that Tom mentioned is at:
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

But what I've seen on cirrusci is at:
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;

Seems you meant to copy a different line for Tom's (s/writes/redas/)?

Seems so

https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs

Hm. I guess the explanation here is that the buffers were already all written
out by another backend. Which is made more likely by your patch.

FYI: that patch would've made it more likely for each backend to write
out its *own* dirty pages of TOAST ... but the two other failures that I
mentioned were for patches which wouldn't have affected this at all.

--
Justin

#166

melanieplageman@gmail.com

almost 3 years ago

In reply to: Justin Pryzby (#165)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Fri, Mar 10, 2023 at 3:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Mar 09, 2023 at 11:43:01AM -0800, Andres Freund wrote:

https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs

Hm. I guess the explanation here is that the buffers were already all written
out by another backend. Which is made more likely by your patch.

FYI: that patch would've made it more likely for each backend to write
out its *own* dirty pages of TOAST ... but the two other failures that I
mentioned were for patches which wouldn't have affected this at all.

I think your patch made it more likely that a backend needing to flush a
buffer in order to fit its own data would be doing so in a buffer access
strategy IO context.

Your patch makes it so those toast table writes are using a
BAS_BULKWRITE (see GetBulkInsertState()) and when they are looking for
buffers to put their data in, they have to evict other data (theirs and
others) but all of this is tracked in io_context = 'bulkwrite' -- and
the test only counted writes done in io_context 'normal'. But it is good
that your patch did that! It helped us to see that this test is not
reliable.

The other times this test failed in cfbot were for a patch that had many
failures and might have something wrong with its code, IIRC.

Thanks again for the report!

- Melanie

#167

p.luzanov@postgrespro.ru

almost 3 years ago

In reply to: Melanie Plageman (#166)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hello,

I found that the 'standalone backend' backend type is not documented
right now.
Adding something like (from commit message) would be helpful:

Both the bootstrap backend and single user mode backends will have
backend_type STANDALONE_BACKEND.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com

#168

melanieplageman@gmail.com

almost 3 years ago

In reply to: Pavel Luzanov (#167)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Apr 3, 2023 at 12:13 AM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:

Hello,

I found that the 'standalone backend' backend type is not documented
right now.
Adding something like (from commit message) would be helpful:

Both the bootstrap backend and single user mode backends will have
backend_type STANDALONE_BACKEND.

Thanks for the report.

Attached is a tiny patch to add standalone backend type to
pg_stat_activity documentation (referenced by pg_stat_io).

I mentioned both the bootstrap process and single user mode process in
the docs, though I can't imagine that the bootstrap process is relevant
for pg_stat_activity.

I also noticed that the pg_stat_activity docs call background workers
"parallel workers" (though it also mentions that extensions could have
other background workers registered), but this seems a bit weird because
pg_stat_activity uses GetBackendTypeDesc() and this prints "background
worker" for type B_BG_WORKER. Background workers doing parallelism tasks
is what users will most often see in pg_stat_activity, but I feel like
it is confusing to have it documented as something different than what
would appear in the view. Unless I am misunderstanding something...

- Melanie

Attachments:

v1-0001-Document-standalone-backend-type-in-pg_stat_activ.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Document-standalone-backend-type-in-pg_stat_activ.patchDownload

From d9218d082397d9b87a3e126bce4a45e9ec720ff2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 3 Apr 2023 16:38:47 -0400
Subject: [PATCH v1] Document standalone backend type in pg_stat_activity

Reported-by: Pavel Luzanov <p.luzanov@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/fcbe2851-f1fb-9863-54bc-a95dc7a0d946%40postgrespro.ru
---
 doc/src/sgml/monitoring.sgml | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d5a45f996d..a00fe9c6a3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -989,10 +989,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        <literal>parallel worker</literal>, <literal>background writer</literal>,
        <literal>client backend</literal>, <literal>checkpointer</literal>,
        <literal>archiver</literal>,
-       <literal>startup</literal>, <literal>walreceiver</literal>,
-       <literal>walsender</literal> and <literal>walwriter</literal>.
-       In addition, background workers registered by extensions may have
-       additional types.
+       <literal>startup</literal>,
+       <literal>standalone backend</literal> (which includes both the
+       <xref linkend="app-postgres-single-user"/> process and bootstrap
+       process), <literal>walreceiver</literal>, <literal>walsender</literal>
+       and <literal>walwriter</literal>. In addition, background workers
+       registered by extensions may have additional types.
       </para></entry>
      </row>
     </tbody>
-- 
2.37.2

#169

p.luzanov@postgrespro.ru

almost 3 years ago

In reply to: Melanie Plageman (#168)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On 03.04.2023 23:50, Melanie Plageman wrote:

Attached is a tiny patch to add standalone backend type to
pg_stat_activity documentation (referenced by pg_stat_io).

I mentioned both the bootstrap process and single user mode process in
the docs, though I can't imagine that the bootstrap process is relevant
for pg_stat_activity.

After a little thought... I'm not sure about the term 'bootstrap
process'. I can't find this term in the documentation.
Do I understand correctly that this is a postmaster? If so, then the
postmaster process is not shown in pg_stat_activity.

Perhaps it may be worth adding a description of the standalone backend
to pg_stat_io, not to pg_stat_activity.
Something like: backend_type is all types from pg_stat_activity plus
'standalone backend',
which is used for the postmaster process and in a single user mode.

I also noticed that the pg_stat_activity docs call background workers
"parallel workers" (though it also mentions that extensions could have
other background workers registered), but this seems a bit weird because
pg_stat_activity uses GetBackendTypeDesc() and this prints "background
worker" for type B_BG_WORKER. Background workers doing parallelism tasks
is what users will most often see in pg_stat_activity, but I feel like
it is confusing to have it documented as something different than what
would appear in the view. Unless I am misunderstanding something...

'parallel worker' appears in the pg_stat_activity for parallel queries.
I think it's right here.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com

#170

[1]: https://www.postgresql.org/docs/current/app-initdb.html

melanieplageman@gmail.com

almost 3 years ago

In reply to: Pavel Luzanov (#169)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:

On 03.04.2023 23:50, Melanie Plageman wrote:

Attached is a tiny patch to add standalone backend type to
pg_stat_activity documentation (referenced by pg_stat_io).

I mentioned both the bootstrap process and single user mode process in
the docs, though I can't imagine that the bootstrap process is relevant
for pg_stat_activity.

After a little thought... I'm not sure about the term 'bootstrap
process'. I can't find this term in the documentation.

There are various mentions of "bootstrap" peppered throughout the docs
but no concise summary of what it is. For example, initdb docs mention
the "bootstrap backend" [1]https://www.postgresql.org/docs/current/app-initdb.html.

Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This
doesn't really cover what bootstrapping is itself, but I wonder if that
is useful? If so, you could propose a glossary entry for it?
(preferably in a new thread)

Do I understand correctly that this is a postmaster? If so, then the
postmaster process is not shown in pg_stat_activity.

No, bootstrap process is for initializing the template database. You
will not be able to see pg_stat_activity when it is running.

Perhaps it may be worth adding a description of the standalone backend
to pg_stat_io, not to pg_stat_activity.
Something like: backend_type is all types from pg_stat_activity plus
'standalone backend',
which is used for the postmaster process and in a single user mode.

You can query pg_stat_activity from single user mode, so it is relevant
to pg_stat_activity also. I take your point that bootstrap mode isn't
relevant for pg_stat_activity, but I am hesitant to add that distinction
to the pg_stat_io docs since the reason you won't see it in
pg_stat_activity is because it is ephemeral and before a user can access
the database and not because stats are not tracked for it.

Can you think of a way to convey this?

I also noticed that the pg_stat_activity docs call background workers
"parallel workers" (though it also mentions that extensions could have
other background workers registered), but this seems a bit weird because
pg_stat_activity uses GetBackendTypeDesc() and this prints "background
worker" for type B_BG_WORKER. Background workers doing parallelism tasks
is what users will most often see in pg_stat_activity, but I feel like
it is confusing to have it documented as something different than what
would appear in the view. Unless I am misunderstanding something...

'parallel worker' appears in the pg_stat_activity for parallel queries.
I think it's right here.

Ah, I didn't read the code closely enough in pg_stat_get_activity()
Even though there is no BackendType which GetBackendTypeDesc() returns
called "parallel worker", we to out of our way to be specific using
GetBackgroundWorkerTypeByPid()

/* Add backend type */
if (beentry->st_backendType == B_BG_WORKER)
{
const char *bgw_type;

bgw_type = GetBackgroundWorkerTypeByPid(beentry->st_procpid);
if (bgw_type)
values[17] = CStringGetTextDatum(bgw_type);
else
nulls[17] = true;
}
else
values[17] =
CStringGetTextDatum(GetBackendTypeDesc(beentry->st_backendType));

- Melanie

#171

p.luzanov@postgrespro.ru

almost 3 years ago

In reply to: Melanie Plageman (#170)

1 attachment(s)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On 05.04.2023 03:41, Melanie Plageman wrote:

On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:

After a little thought... I'm not sure about the term 'bootstrap
process'. I can't find this term in the documentation.

There are various mentions of "bootstrap" peppered throughout the docs
but no concise summary of what it is. For example, initdb docs mention
the "bootstrap backend" [1].

Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This
doesn't really cover what bootstrapping is itself, but I wonder if that
is useful? If so, you could propose a glossary entry for it?
(preferably in a new thread)

I'm not sure if this is the reason for adding a new entry in the glossary.

Do I understand correctly that this is a postmaster? If so, then the
postmaster process is not shown in pg_stat_activity.

No, bootstrap process is for initializing the template database. You
will not be able to see pg_stat_activity when it is running.

Oh, it's clear to me now. Thank you for the explanation.

You can query pg_stat_activity from single user mode, so it is relevant
to pg_stat_activity also. I take your point that bootstrap mode isn't
relevant for pg_stat_activity, but I am hesitant to add that distinction
to the pg_stat_io docs since the reason you won't see it in
pg_stat_activity is because it is ephemeral and before a user can access
the database and not because stats are not tracked for it.

Can you think of a way to convey this?

See my attempt attached.
I'm not sure about the wording. But I think we can avoid the term
'bootstrap process'
by replacing it with "database cluster initialization", which should be
clear to everyone.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com

Attachments:

v2-0001-PATCH-v2-Document-standalone-backend-type-in-pg_s.patchtext/x-patch; charset=UTF-8; name=v2-0001-PATCH-v2-Document-standalone-backend-type-in-pg_s.patchDownload

From ff76b81a9d52581f6fdaf9c1f3885e8272d2ae3c Mon Sep 17 00:00:00 2001
From: Pavel Luzanov <p.luzanov@postgrespro.ru>
Date: Mon, 10 Apr 2023 10:25:52 +0300
Subject: [PATCH v2] [PATCH v2] Document standalone backend type in
 pg_stat_activity

Reported-by: Pavel Luzanov <p.luzanov@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/fcbe2851-f1fb-9863-54bc-a95dc7a0d946%40postgrespro.ru
---
 doc/src/sgml/monitoring.sgml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f33a1c56c..45e20efbfb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -991,6 +991,9 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        <literal>archiver</literal>,
        <literal>startup</literal>, <literal>walreceiver</literal>,
        <literal>walsender</literal> and <literal>walwriter</literal>.
+       The special type <literal>standalone backend</literal> is used
+       when initializing a database cluster by <xref linkend="app-initdb"/>
+       and when running in the <xref linkend="app-postgres-single-user"/>.
        In addition, background workers registered by extensions may have
        additional types.
       </para></entry>
-- 
2.34.1

#172

melanieplageman@gmail.com

over 2 years ago

In reply to: Pavel Luzanov (#171)

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

On Mon, Apr 10, 2023 at 3:41 AM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:

On 05.04.2023 03:41, Melanie Plageman wrote:

On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:

After a little thought... I'm not sure about the term 'bootstrap
process'. I can't find this term in the documentation.

There are various mentions of "bootstrap" peppered throughout the docs
but no concise summary of what it is. For example, initdb docs mention
the "bootstrap backend" [1].

Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This
doesn't really cover what bootstrapping is itself, but I wonder if that
is useful? If so, you could propose a glossary entry for it?
(preferably in a new thread)

I'm not sure if this is the reason for adding a new entry in the glossary.

Do I understand correctly that this is a postmaster? If so, then the
postmaster process is not shown in pg_stat_activity.

No, bootstrap process is for initializing the template database. You
will not be able to see pg_stat_activity when it is running.

Oh, it's clear to me now. Thank you for the explanation.

You can query pg_stat_activity from single user mode, so it is relevant
to pg_stat_activity also. I take your point that bootstrap mode isn't
relevant for pg_stat_activity, but I am hesitant to add that distinction
to the pg_stat_io docs since the reason you won't see it in
pg_stat_activity is because it is ephemeral and before a user can access
the database and not because stats are not tracked for it.

Can you think of a way to convey this?

See my attempt attached.
I'm not sure about the wording. But I think we can avoid the term
'bootstrap process'
by replacing it with "database cluster initialization", which should be
clear to everyone.

I like that idea.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f33a1c56c..45e20efbfb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -991,6 +991,9 @@ postgres   27093  0.0  0.0  30096  2752 ?
Ss   11:34   0:00 postgres: ser
        <literal>archiver</literal>,
        <literal>startup</literal>, <literal>walreceiver</literal>,
        <literal>walsender</literal> and <literal>walwriter</literal>.
+       The special type <literal>standalone backend</literal> is used

I think referring to it as a "special type" is a bit confusing. I think
you can just start the sentence with "standalone backend". You could
even include it in the main list of backend_types since it is possible
to see it in pg_stat_activity when in single user mode.

+       when initializing a database cluster by <xref linkend="app-initdb"/>
+       and when running in the <xref linkend="app-postgres-single-user"/>.
        In addition, background workers registered by extensions may have
        additional types.
       </para></entry>

I like the rest of this.

I copied the committer who most recently touched pg_stat_io (Michael
Paquier) to see if we could get someone interested in committing this
docs update.

- Melanie

#173