Get rid of pgstat_count_backend_io_op*() functions

Started by Bertrand Drouvot4 months ago10 messages
#1Bertrand Drouvot
bertranddrouvot.pg@gmail.com
1 attachment(s)

Hi hackers,

This patch removes the functions that are incrementing the backend IO stats.

Instead, it now copies the IO pending stats to the backend IO pending stats when
the pending IO stats are flushed. This behaves the same way as for some relation
and database stats (see pgstat_relation_flush_cb()).

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles (reported by Andres in [1]/messages/by-id/7fhpds4xqk6bnudzmzkqi33pinsxammpljwde5gfkjdygvejrj@ojkzfr7dxkmm).

Please note that per-backend statistics have to be last when we define the
PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are
populated while other stats kinds are flushing.

[1]: /messages/by-id/7fhpds4xqk6bnudzmzkqi33pinsxammpljwde5gfkjdygvejrj@ojkzfr7dxkmm

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Get-rid-of-pgstat_count_backend_io_op-functions.patchtext/x-diff; charset=us-asciiDownload
From 7d00ae64adaed2f1c10c45af79f9eefea86c126a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 1 Sep 2025 10:54:35 +0000
Subject: [PATCH v1] Get rid of pgstat_count_backend_io_op*() functions

This commit removes the functions that are incrementing the backend IO stats.
Instead, it now copies the IO pending stats to the backend IO pending stats when
the pending IO stats are flushed. This behaves the same way as for some relation
and database stats.

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles.

Note that per-backend statistics have to be last when we define the PGSTAT_KIND_%
values in pgstat_kind.h, as some of their pending stats are populated while other
stats kinds are flushing.

Reported-by: Andres Freund <andres@anarazel.de>
---
 src/backend/utils/activity/pgstat_backend.c | 42 +--------------------
 src/backend/utils/activity/pgstat_io.c      | 18 +++++----
 src/include/pgstat.h                        | 14 +++----
 src/include/utils/pgstat_kind.h             | 21 +++++++----
 4 files changed, 31 insertions(+), 64 deletions(-)
  59.4% src/backend/utils/activity/
  23.5% src/include/utils/
  17.0% src/include/

diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 8714a85e2d9..b746b55192e 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -36,8 +36,8 @@
  * reported within critical sections so we use static memory in order to avoid
  * memory allocation.
  */
-static PgStat_BackendPending PendingBackendStats;
-static bool backend_has_iostats = false;
+PgStat_BackendPending PendingBackendStats;
+bool		backend_has_iostats = false;
 
 /*
  * WAL usage counters saved from pgWalUsage at the previous call to
@@ -47,44 +47,6 @@ static bool backend_has_iostats = false;
  */
 static WalUsage prevBackendWalUsage;
 
-/*
- * Utility routines to report I/O stats for backends, kept here to avoid
- * exposing PendingBackendStats to the outside world.
- */
-void
-pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
-								IOOp io_op, instr_time io_time)
-{
-	Assert(track_io_timing || track_wal_io_timing);
-
-	if (!pgstat_tracks_backend_bktype(MyBackendType))
-		return;
-
-	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
-
-	INSTR_TIME_ADD(PendingBackendStats.pending_io.pending_times[io_object][io_context][io_op],
-				   io_time);
-
-	backend_has_iostats = true;
-	pgstat_report_fixed = true;
-}
-
-void
-pgstat_count_backend_io_op(IOObject io_object, IOContext io_context,
-						   IOOp io_op, uint32 cnt, uint64 bytes)
-{
-	if (!pgstat_tracks_backend_bktype(MyBackendType))
-		return;
-
-	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
-
-	PendingBackendStats.pending_io.counts[io_object][io_context][io_op] += cnt;
-	PendingBackendStats.pending_io.bytes[io_object][io_context][io_op] += bytes;
-
-	backend_has_iostats = true;
-	pgstat_report_fixed = true;
-}
-
 /*
  * Returns statistics of a backend by proc number.
  */
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13ae57ed649..0520f4a1d33 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -76,9 +76,6 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op,
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
 	PendingIOStats.bytes[io_object][io_context][io_op] += bytes;
 
-	/* Add the per-backend counts */
-	pgstat_count_backend_io_op(io_object, io_context, io_op, cnt, bytes);
-
 	have_iostats = true;
 	pgstat_report_fixed = true;
 }
@@ -151,10 +148,6 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
-
-		/* Add the per-backend count */
-		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
-										io_time);
 	}
 
 	pgstat_count_io_op(io_object, io_context, io_op, cnt, bytes);
@@ -184,6 +177,9 @@ pgstat_flush_io(bool nowait)
  *
  * If nowait is true, this function returns true if the lock could not be
  * acquired. Otherwise, return false.
+ *
+ * The stats are copied to the corresponding pending backend stats when
+ * successfully flushing.
  */
 bool
 pgstat_io_flush_cb(bool nowait)
@@ -227,6 +223,14 @@ pgstat_io_flush_cb(bool nowait)
 
 	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
 
+	/* Do the same for the backend stats */
+	if (pgstat_tracks_backend_bktype(MyBackendType))
+	{
+		PendingBackendStats.pending_io = PendingIOStats;
+		backend_has_iostats = true;
+		pgstat_report_fixed = true;
+	}
+
 	LWLockRelease(bktype_lock);
 
 	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 202bd2d5ace..7c2c4a0adc2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -548,15 +548,6 @@ extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
  * Functions in pgstat_backend.c
  */
 
-/* used by pgstat_io.c for I/O stats tracked in backends */
-extern void pgstat_count_backend_io_op_time(IOObject io_object,
-											IOContext io_context,
-											IOOp io_op,
-											instr_time io_time);
-extern void pgstat_count_backend_io_op(IOObject io_object,
-									   IOContext io_context,
-									   IOOp io_op, uint32 cnt,
-									   uint64 bytes);
 extern PgStat_Backend *pgstat_fetch_stat_backend(ProcNumber procNumber);
 extern PgStat_Backend *pgstat_fetch_stat_backend_by_pid(int pid,
 														BackendType *bktype);
@@ -800,6 +791,11 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_fetch_consistency;
 
+/*
+ * Variables in pgstat_backend.c
+ */
+extern PGDLLIMPORT PgStat_BackendPending PendingBackendStats;
+extern PGDLLIMPORT bool backend_has_iostats;
 
 /*
  * Variables in pgstat_bgwriter.c
diff --git a/src/include/utils/pgstat_kind.h b/src/include/utils/pgstat_kind.h
index eb5f0b3ae6d..4d38f5d922c 100644
--- a/src/include/utils/pgstat_kind.h
+++ b/src/include/utils/pgstat_kind.h
@@ -29,18 +29,23 @@
 #define PGSTAT_KIND_FUNCTION	3	/* per-function statistics */
 #define PGSTAT_KIND_REPLSLOT	4	/* per-slot statistics */
 #define PGSTAT_KIND_SUBSCRIPTION	5	/* per-subscription statistics */
-#define PGSTAT_KIND_BACKEND	6	/* per-backend statistics */
 
 /* stats for fixed-numbered objects */
-#define PGSTAT_KIND_ARCHIVER	7
-#define PGSTAT_KIND_BGWRITER	8
-#define PGSTAT_KIND_CHECKPOINTER	9
-#define PGSTAT_KIND_IO	10
-#define PGSTAT_KIND_SLRU	11
-#define PGSTAT_KIND_WAL	12
+#define PGSTAT_KIND_ARCHIVER	6
+#define PGSTAT_KIND_BGWRITER	7
+#define PGSTAT_KIND_CHECKPOINTER	8
+#define PGSTAT_KIND_IO	9
+#define PGSTAT_KIND_SLRU	10
+#define PGSTAT_KIND_WAL	11
+
+/*
+ * per-backend statistics has to be last, as some of their pending stats are
+ * populated while other stats kinds are flushing.
+ */
+#define PGSTAT_KIND_BACKEND	12	/* per-backend statistics */
 
 #define PGSTAT_KIND_BUILTIN_MIN PGSTAT_KIND_DATABASE
-#define PGSTAT_KIND_BUILTIN_MAX PGSTAT_KIND_WAL
+#define PGSTAT_KIND_BUILTIN_MAX PGSTAT_KIND_BACKEND
 #define PGSTAT_KIND_BUILTIN_SIZE (PGSTAT_KIND_BUILTIN_MAX + 1)
 
 /* Custom stats kinds */
-- 
2.34.1

#2Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#1)
Re: Get rid of pgstat_count_backend_io_op*() functions

On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote:

Instead, it now copies the IO pending stats to the backend IO pending stats when
the pending IO stats are flushed. This behaves the same way as for some relation
and database stats (see pgstat_relation_flush_cb()).

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles (reported by Andres in [1]).

So, is the complaint about the addition of the extra incrementations
for backend counters on top of the existing IO counters in v17,
leading to more instructions generated, the addition of the functions,
or both things at the same time? Do you have an example of profile
and/or workload where this actually matters? I am aware that you are
doing crazy things in this area, so..

Also, how much does it matter for v18? Removing both function calls
and counter incrementations implies refactoring that touches both the
I/O stats and the backend counterparts, which may not be a a good idea
during RC. I'd be OK in beta 1~2 for this amount of work, the timing
of the discussion is what it is. So, depending on the arguments
raised this may warrant a revert for the IO part of the backend stats.

Please note that per-backend statistics have to be last when we define the
PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are
populated while other stats kinds are flushing.

+/*
+ * per-backend statistics has to be last, as some of their pending stats are
+ * populated while other stats kinds are flushing.
+ */
+#define PGSTAT_KIND_BACKEND	12	/* per-backend statistics */

I don't think that adding ordering dependencies for the stats kinds is
a good concept in general. What makes you think that we could not
have a custom pgstats kind willing to increment backend counters as
well, say if we have a table AM with its own stats, for example?

Removing both the function calls and the extra counter incrementations
means to do the same thing as the WAL stats, where we have one
structure in charge of incrementing the counters, then the data diffs
are fed when flushing the entries in all the stats kinds. So this
would imply the introduction of an equivalent to WalUsage, but applied
to the IO stats, like a IOUsage, or something like that.
--
Michael

#3Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#2)
Re: Get rid of pgstat_count_backend_io_op*() functions

Hi,

On Tue, Sep 02, 2025 at 08:19:22AM +0900, Michael Paquier wrote:

On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote:

Instead, it now copies the IO pending stats to the backend IO pending stats when
the pending IO stats are flushed. This behaves the same way as for some relation
and database stats (see pgstat_relation_flush_cb()).

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles (reported by Andres in [1]).

So, is the complaint about the addition of the extra incrementations
for backend counters on top of the existing IO counters in v17,
leading to more instructions generated, the addition of the functions,
or both things at the same time? Do you have an example of profile
and/or workload where this actually matters? I am aware that you are
doing crazy things in this area, so..

I did not observe such noticeable overhead myself, so I guess that those questions
are for Andres. That said, I think it makes fully sense to avoid incrementing
the "same" counters twice (and specially in hot path).

Please note that per-backend statistics have to be last when we define the
PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are
populated while other stats kinds are flushing.

+/*
+ * per-backend statistics has to be last, as some of their pending stats are
+ * populated while other stats kinds are flushing.
+ */
+#define PGSTAT_KIND_BACKEND	12	/* per-backend statistics */

I don't think that adding ordering dependencies for the stats kinds is
a good concept in general.

I agree that it has to be simple. I don't think that just ensuring that the
backends are last introduces much complexity.

What makes you think that we could not
have a custom pgstats kind willing to increment backend counters as
well, say if we have a table AM with its own stats, for example?

Nothing, but the issue would be the same if we keep the backends
at their current position. That's right that it's "only" solving the in-core
ordering.

Removing both the function calls and the extra counter incrementations
means to do the same thing as the WAL stats, where we have one
structure in charge of incrementing the counters, then the data diffs
are fed when flushing the entries in all the stats kinds. So this
would imply the introduction of an equivalent to WalUsage, but applied
to the IO stats, like a IOUsage, or something like that.

Yeah, that sounds more complicated at first but I think that would solve
the "ordering" issue mentioned above.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#4Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bertrand Drouvot (#3)
1 attachment(s)
Re: Get rid of pgstat_count_backend_io_op*() functions

Hi,

On Tue, Sep 02, 2025 at 07:41:59AM +0000, Bertrand Drouvot wrote:

On Tue, Sep 02, 2025 at 08:19:22AM +0900, Michael Paquier wrote:

Removing both the function calls and the extra counter incrementations
means to do the same thing as the WAL stats, where we have one
structure in charge of incrementing the counters, then the data diffs
are fed when flushing the entries in all the stats kinds. So this
would imply the introduction of an equivalent to WalUsage, but applied
to the IO stats, like a IOUsage, or something like that.

Yeah, that sounds more complicated at first but I think that would solve
the "ordering" issue mentioned above.

Attached is a draft version that implements an equivalent of WalUsage, that way:

- no ordering concerns anymore
- we can also get rid of PgStat_PendingIO ("replaced" by IOUsage)
- we can also get rid of PgStat_BackendPending (since both IO and WAL stats now
use the WAL's approach)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Get-rid-of-pgstat_count_backend_io_op-functions.patchtext/x-diff; charset=us-asciiDownload
From 4362bfbd9972e98c2c369434f9cfa9d51dd7aca0 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 1 Sep 2025 10:54:35 +0000
Subject: [PATCH v2] Get rid of pgstat_count_backend_io_op*() functions

This commit removes the functions that are incrementing the backend IO stats.
Instead, it now uses a "global" IO stat counters and each IO stats kind (global
and per-backend) compute the diff during their dedicated flush with a previous
copy of the global counters. That's the same idea as for WAL usage.

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles.

Reported-by: Andres Freund <andres@anarazel.de>
---
 src/backend/utils/activity/pgstat.c         |  2 +
 src/backend/utils/activity/pgstat_backend.c | 86 +++++++--------------
 src/backend/utils/activity/pgstat_io.c      | 56 +++++++++-----
 src/include/pgstat.h                        | 35 +++------
 src/include/utils/pgstat_internal.h         |  1 +
 src/tools/pgindent/typedefs.list            |  3 +-
 6 files changed, 84 insertions(+), 99 deletions(-)
  84.8% src/backend/utils/activity/
  14.4% src/include/

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index ffb5b8cce34..d499d052c54 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -114,6 +114,7 @@
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
+IOUsage		pgIOUsage;
 
 /* ----------
  * Timer definitions.
@@ -440,6 +441,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
 		.shared_data_off = offsetof(PgStatShared_IO, stats),
 		.shared_data_len = sizeof(((PgStatShared_IO *) 0)->stats),
 
+		.init_backend_cb = pgstat_io_init_backend_cb,
 		.flush_static_cb = pgstat_io_flush_cb,
 		.init_shmem_cb = pgstat_io_init_shmem_cb,
 		.reset_all_cb = pgstat_io_reset_all_cb,
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 8714a85e2d9..124aee7f6ae 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -12,7 +12,7 @@
  * of pgstats.  Entries are created each time a process is spawned, and are
  * dropped when the process exits.  These are not written to the pgstats file
  * on disk.  Pending statistics are managed without direct interactions with
- * PgStat_EntryRef->pending, relying on PendingBackendStats instead so as it
+ * PgStat_EntryRef->pending, relying on global counters instead so as it
  * is possible to report data within critical sections.
  *
  * Copyright (c) 2001-2025, PostgreSQL Global Development Group
@@ -32,12 +32,13 @@
 #include "utils/pgstat_internal.h"
 
 /*
- * Backend statistics counts waiting to be flushed out. These counters may be
- * reported within critical sections so we use static memory in order to avoid
- * memory allocation.
+ * IO counters saved from pgIOUsage at the previous call to
+ * pgstat_report_stat(). This is used to calculate how much IO usage
+ * happens between pgstat_report_stat() calls, by subtracting
+ * the previous counters from the current ones.
  */
-static PgStat_BackendPending PendingBackendStats;
-static bool backend_has_iostats = false;
+static IOUsage prevBackendIOUsage;
+bool		backend_has_iostats = false;
 
 /*
  * WAL usage counters saved from pgWalUsage at the previous call to
@@ -47,44 +48,6 @@ static bool backend_has_iostats = false;
  */
 static WalUsage prevBackendWalUsage;
 
-/*
- * Utility routines to report I/O stats for backends, kept here to avoid
- * exposing PendingBackendStats to the outside world.
- */
-void
-pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
-								IOOp io_op, instr_time io_time)
-{
-	Assert(track_io_timing || track_wal_io_timing);
-
-	if (!pgstat_tracks_backend_bktype(MyBackendType))
-		return;
-
-	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
-
-	INSTR_TIME_ADD(PendingBackendStats.pending_io.pending_times[io_object][io_context][io_op],
-				   io_time);
-
-	backend_has_iostats = true;
-	pgstat_report_fixed = true;
-}
-
-void
-pgstat_count_backend_io_op(IOObject io_object, IOContext io_context,
-						   IOOp io_op, uint32 cnt, uint64 bytes)
-{
-	if (!pgstat_tracks_backend_bktype(MyBackendType))
-		return;
-
-	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
-
-	PendingBackendStats.pending_io.counts[io_object][io_context][io_op] += cnt;
-	PendingBackendStats.pending_io.bytes[io_object][io_context][io_op] += bytes;
-
-	backend_has_iostats = true;
-	pgstat_report_fixed = true;
-}
-
 /*
  * Returns statistics of a backend by proc number.
  */
@@ -166,7 +129,6 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 {
 	PgStatShared_Backend *shbackendent;
 	PgStat_BktypeIO *bktype_shstats;
-	PgStat_PendingIO pending_io;
 
 	/*
 	 * This function can be called even if nothing at all has happened for IO
@@ -178,7 +140,6 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 
 	shbackendent = (PgStatShared_Backend *) entry_ref->shared_stats;
 	bktype_shstats = &shbackendent->stats.io_stats;
-	pending_io = PendingBackendStats.pending_io;
 
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
@@ -186,24 +147,31 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				instr_time	time;
+				instr_time	time_diff;
 
 				bktype_shstats->counts[io_object][io_context][io_op] +=
-					pending_io.counts[io_object][io_context][io_op];
+					pgIOUsage.counts[io_object][io_context][io_op] -
+					prevBackendIOUsage.counts[io_object][io_context][io_op];
+
 				bktype_shstats->bytes[io_object][io_context][io_op] +=
-					pending_io.bytes[io_object][io_context][io_op];
-				time = pending_io.pending_times[io_object][io_context][io_op];
+					pgIOUsage.bytes[io_object][io_context][io_op] -
+					prevBackendIOUsage.bytes[io_object][io_context][io_op];
+
+				INSTR_TIME_SET_ZERO(time_diff);
+				INSTR_TIME_ACCUM_DIFF(time_diff,
+									  pgIOUsage.pending_times[io_object][io_context][io_op],
+									  prevBackendIOUsage.pending_times[io_object][io_context][io_op]);
 
 				bktype_shstats->times[io_object][io_context][io_op] +=
-					INSTR_TIME_GET_MICROSEC(time);
+					INSTR_TIME_GET_MICROSEC(time_diff);
 			}
 		}
 	}
 
 	/*
-	 * Clear out the statistics buffer, so it can be re-used.
+	 * Save the current counters for the subsequent calculation of IO usage.
 	 */
-	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_PendingIO));
+	prevBackendIOUsage = pgIOUsage;
 
 	backend_has_iostats = false;
 }
@@ -334,15 +302,21 @@ pgstat_create_backend(ProcNumber procnum)
 	memset(&shstatent->stats, 0, sizeof(shstatent->stats));
 	pgstat_unlock_entry(entry_ref);
 
-	MemSet(&PendingBackendStats, 0, sizeof(PgStat_BackendPending));
-	backend_has_iostats = false;
-
 	/*
 	 * Initialize prevBackendWalUsage with pgWalUsage so that
 	 * pgstat_backend_flush_cb() can calculate how much pgWalUsage counters
 	 * are increased by subtracting prevBackendWalUsage from pgWalUsage.
 	 */
 	prevBackendWalUsage = pgWalUsage;
+
+	/*
+	 * Initialize prevBackendIOUsage with pgIOUsage so that
+	 * pgstat_backend_flush_cb() can calculate how much IO counters are
+	 * increased by subtracting prevBackendIOUsage from pgIOUsage.
+	 */
+	prevBackendIOUsage = pgIOUsage;
+
+	backend_has_iostats = false;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13ae57ed649..aa20bd2590e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -20,9 +20,16 @@
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
-static PgStat_PendingIO PendingIOStats;
 static bool have_iostats = false;
 
+/*
+ * IO counters saved from pgIOUsage at the previous call to
+ * pgstat_report_stat(). This is used to calculate how much IO usage
+ * happens between pgstat_report_stat() calls, by subtracting
+ * the previous counters from the current ones.
+ */
+static IOUsage prevIOUsage;
+
 /*
  * Check that stats have not been counted for any combination of IOObject,
  * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
@@ -73,13 +80,11 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op,
 	Assert(pgstat_is_ioop_tracked_in_bytes(io_op) || bytes == 0);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
-	PendingIOStats.bytes[io_object][io_context][io_op] += bytes;
-
-	/* Add the per-backend counts */
-	pgstat_count_backend_io_op(io_object, io_context, io_op, cnt, bytes);
+	pgIOUsage.counts[io_object][io_context][io_op] += cnt;
+	pgIOUsage.bytes[io_object][io_context][io_op] += bytes;
 
 	have_iostats = true;
+	backend_has_iostats = true;
 	pgstat_report_fixed = true;
 }
 
@@ -149,12 +154,8 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			}
 		}
 
-		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
+		INSTR_TIME_ADD(pgIOUsage.pending_times[io_object][io_context][io_op],
 					   io_time);
-
-		/* Add the per-backend count */
-		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
-										io_time);
 	}
 
 	pgstat_count_io_op(io_object, io_context, io_op, cnt, bytes);
@@ -209,27 +210,35 @@ pgstat_io_flush_cb(bool nowait)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				instr_time	time;
+				instr_time	time_diff;
 
 				bktype_shstats->counts[io_object][io_context][io_op] +=
-					PendingIOStats.counts[io_object][io_context][io_op];
+					pgIOUsage.counts[io_object][io_context][io_op] -
+					prevIOUsage.counts[io_object][io_context][io_op];
 
 				bktype_shstats->bytes[io_object][io_context][io_op] +=
-					PendingIOStats.bytes[io_object][io_context][io_op];
+					pgIOUsage.bytes[io_object][io_context][io_op] -
+					prevIOUsage.bytes[io_object][io_context][io_op];
 
-				time = PendingIOStats.pending_times[io_object][io_context][io_op];
+				INSTR_TIME_SET_ZERO(time_diff);
+				INSTR_TIME_ACCUM_DIFF(time_diff,
+									  pgIOUsage.pending_times[io_object][io_context][io_op],
+									  prevIOUsage.pending_times[io_object][io_context][io_op]);
 
 				bktype_shstats->times[io_object][io_context][io_op] +=
-					INSTR_TIME_GET_MICROSEC(time);
+					INSTR_TIME_GET_MICROSEC(time_diff);
 			}
 		}
 	}
 
 	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
 
-	LWLockRelease(bktype_lock);
+	/*
+	 * Save the current counters for the subsequent calculation of IO usage.
+	 */
+	prevIOUsage = pgIOUsage;
 
-	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+	LWLockRelease(bktype_lock);
 
 	have_iostats = false;
 
@@ -274,6 +283,17 @@ pgstat_get_io_object_name(IOObject io_object)
 	pg_unreachable();
 }
 
+void
+pgstat_io_init_backend_cb(void)
+{
+	/*
+	 * Initialize prevIOUsage with pgIOUsage so that pgstat_io_flush_cb() can
+	 * calculate how much pgIOUsage counters are increased by subtracting
+	 * prevIOUsage from pgIOUsage.
+	 */
+	prevIOUsage = pgIOUsage;
+}
+
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f402b17295c..660bb6c6876 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -326,12 +326,12 @@ typedef struct PgStat_BktypeIO
 	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
-typedef struct PgStat_PendingIO
+typedef struct IOUsage
 {
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
-} PgStat_PendingIO;
+} IOUsage;
 
 typedef struct PgStat_IO
 {
@@ -492,18 +492,6 @@ typedef struct PgStat_Backend
 	PgStat_WalCounters wal_counters;
 } PgStat_Backend;
 
-/* ---------
- * PgStat_BackendPending	Non-flushed backend stats.
- * ---------
- */
-typedef struct PgStat_BackendPending
-{
-	/*
-	 * Backend statistics store the same amount of IO data as PGSTAT_KIND_IO.
-	 */
-	PgStat_PendingIO pending_io;
-} PgStat_BackendPending;
-
 /*
  * Functions in pgstat.c
  */
@@ -548,15 +536,6 @@ extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
  * Functions in pgstat_backend.c
  */
 
-/* used by pgstat_io.c for I/O stats tracked in backends */
-extern void pgstat_count_backend_io_op_time(IOObject io_object,
-											IOContext io_context,
-											IOOp io_op,
-											instr_time io_time);
-extern void pgstat_count_backend_io_op(IOObject io_object,
-									   IOContext io_context,
-									   IOOp io_op, uint32 cnt,
-									   uint64 bytes);
 extern PgStat_Backend *pgstat_fetch_stat_backend(ProcNumber procNumber);
 extern PgStat_Backend *pgstat_fetch_stat_backend_by_pid(int pid,
 														BackendType *bktype);
@@ -800,6 +779,10 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_fetch_consistency;
 
+/*
+ * Variables in pgstat_backend.c
+ */
+extern PGDLLIMPORT bool backend_has_iostats;
 
 /*
  * Variables in pgstat_bgwriter.c
@@ -838,4 +821,10 @@ extern PGDLLIMPORT PgStat_Counter pgStatTransactionIdleTime;
 /* updated by the traffic cop and in errfinish() */
 extern PGDLLIMPORT SessionEndType pgStatSessionEndCause;
 
+/*
+ * Variables in pgstat_io.c
+ */
+
+extern PGDLLIMPORT IOUsage pgIOUsage;
+
 #endif							/* PGSTAT_H */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 6cf00008f63..99a94560fcc 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -666,6 +666,7 @@ extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
  * Functions in pgstat_io.c
  */
 
+extern void pgstat_io_init_backend_cb(void);
 extern void pgstat_flush_io(bool nowait);
 
 extern bool pgstat_io_flush_cb(bool nowait);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..36eadeaf281 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1309,6 +1309,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethodOps
+IOUsage
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2226,7 +2227,6 @@ PgStatShared_Subscription
 PgStatShared_Wal
 PgStat_ArchiverStats
 PgStat_Backend
-PgStat_BackendPending
 PgStat_BackendSubEntry
 PgStat_BgWriterStats
 PgStat_BktypeIO
@@ -2242,7 +2242,6 @@ PgStat_IO
 PgStat_KindInfo
 PgStat_LocalState
 PgStat_PendingDroppedStatsItem
-PgStat_PendingIO
 PgStat_SLRUStats
 PgStat_ShmemControl
 PgStat_Snapshot
-- 
2.34.1

#5Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#2)
Re: Get rid of pgstat_count_backend_io_op*() functions

Hi,

On 2025-09-02 08:19:22 +0900, Michael Paquier wrote:

On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote:

Instead, it now copies the IO pending stats to the backend IO pending stats when
the pending IO stats are flushed. This behaves the same way as for some relation
and database stats (see pgstat_relation_flush_cb()).

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles (reported by Andres in [1]).

So, is the complaint about the addition of the extra incrementations
for backend counters on top of the existing IO counters in v17,
leading to more instructions generated, the addition of the functions,
or both things at the same time?

The latter, I think.

Do you have an example of profile and/or workload where this actually
matters?

It's not a large regression by any means - it shows up when micro-benchmarking
seqscans that are consumed where quickly (e.g. OFFSET large; or count(*)).

Also, how much does it matter for v18?

I think it's ok for 18. We just don't want to go further down the wrong way. I
objected to this approach in the context of the tuple level counters, where it
matters more, because obviously those are incremented a lot more often.

Removing both the function calls and the extra counter incrementations
means to do the same thing as the WAL stats, where we have one
structure in charge of incrementing the counters, then the data diffs
are fed when flushing the entries in all the stats kinds.

I think that's the wrong direction to go. Diffing stats is far from cheap and
gets more expensive the more detail you add to stats.

EXPLAIN ANALYZE spends a large chunk of the time doing diffing of buffer
access stats, for example. We need to work towards doing less of that stuff,
not more.

Greetings,

Andres Freund

#6Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#5)
Re: Get rid of pgstat_count_backend_io_op*() functions

On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote:

On 2025-09-02 08:19:22 +0900, Michael Paquier wrote:

On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote:

Instead, it now copies the IO pending stats to the backend IO pending stats when
the pending IO stats are flushed. This behaves the same way as for some relation
and database stats (see pgstat_relation_flush_cb()).

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles (reported by Andres in [1]).

So, is the complaint about the addition of the extra incrementations
for backend counters on top of the existing IO counters in v17,
leading to more instructions generated, the addition of the functions,
or both things at the same time?

The latter, I think.

Another option would be more inlining, applying the same to the
pgstat_io.c parts. It sounds to me like that limiting the number of
incrementations to happen only once when shared across multiple stats
kinds would be beneficial anyway, so we could just do that. My whole
picture would then come to:
- Split the IO counters (PendingIOStats) into a separate file, with
perhaps inline functions to have a more elegant layer, somewhat
relevant for pgstat_report_fixed as well.
- Update the stats kinds to apply the diffs in the counters at flush
time.

Removing both the function calls and the extra counter incrementations
means to do the same thing as the WAL stats, where we have one
structure in charge of incrementing the counters, then the data diffs
are fed when flushing the entries in all the stats kinds.

I think that's the wrong direction to go. Diffing stats is far from cheap and
gets more expensive the more detail you add to stats.

Even if we only do the diffs calculations when flushing the pending
stats in the flush callbacks? That would mean to calculate the diffs
when stats reports are forced at transaction commit or when the delay
threshold is reached in non-force mode for pgstat_report_stat().

EXPLAIN ANALYZE spends a large chunk of the time doing diffing of buffer
access stats, for example. We need to work towards doing less of that stuff,
not more.

Yep, agreed. I've seen people complaining about the overhead that can
happen in the EXPLAIN path, even for stuff aimed to be maintained
outside of core, that uses the backend hooks.
--
Michael

#7Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#6)
Re: Get rid of pgstat_count_backend_io_op*() functions

Hi,

On Wed, Sep 03, 2025 at 02:47:51PM +0900, Michael Paquier wrote:

On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote:

I think that's the wrong direction to go. Diffing stats is far from cheap and
gets more expensive the more detail you add to stats.

Even if we only do the diffs calculations when flushing the pending
stats in the flush callbacks?

If my math are correct we have 3 � 5 � 8 = 120 array positions and for each
we'd do the diff on counts, bytes and times. That means 360 diff operations per
flush. That means 720 diff operations to flush the global and per backend IO
stats. That's certainly more expensive than "just" copying the pending stats
as proposed in v1.

As far the ordering concern for v1, what about:

- let backend kind enum defined as 6
- move the backend flush outside of the loop in pgstat_report_stat()

That way the backend are flushed last and that solves your concern about custom
stats kind.

The backend would not be the "only" exception in pgstat_report_stat(), the db
stats are already handled as exception with pgstat_update_dbstats().

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#8Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bertrand Drouvot (#7)
1 attachment(s)
Re: Get rid of pgstat_count_backend_io_op*() functions

Hi,

On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote:

Hi,

On Wed, Sep 03, 2025 at 02:47:51PM +0900, Michael Paquier wrote:

On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote:

I think that's the wrong direction to go. Diffing stats is far from cheap and
gets more expensive the more detail you add to stats.

Even if we only do the diffs calculations when flushing the pending
stats in the flush callbacks?

If my math are correct we have 3 � 5 � 8 = 120 array positions and for each
we'd do the diff on counts, bytes and times. That means 360 diff operations per
flush. That means 720 diff operations to flush the global and per backend IO
stats. That's certainly more expensive than "just" copying the pending stats
as proposed in v1.

As far the ordering concern for v1, what about:

- let backend kind enum defined as 6
- move the backend flush outside of the loop in pgstat_report_stat()

That way the backend are flushed last and that solves your concern about custom
stats kind.

The backend would not be the "only" exception in pgstat_report_stat(), the db
stats are already handled as exception with pgstat_update_dbstats().

That would give something like v3 attached, thoughts?

Once we agree on the approach to deal with per backend pending stats, I'll make
use of the same in [1]/messages/by-id/aJDBwNlyiFuJOoPx@ip-10-97-1-34.eu-west-3.compute.internal and [2]/messages/by-id/aJrxug4LCg4Hm5Mm@ip-10-97-1-34.eu-west-3.compute.internal.

[1]: /messages/by-id/aJDBwNlyiFuJOoPx@ip-10-97-1-34.eu-west-3.compute.internal
[2]: /messages/by-id/aJrxug4LCg4Hm5Mm@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Get-rid-of-pgstat_count_backend_io_op-functions.patchtext/x-diff; charset=us-asciiDownload
From ffe65b418be46a8b4d9f5e51891967abbff96f86 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 1 Sep 2025 10:54:35 +0000
Subject: [PATCH v3] Get rid of pgstat_count_backend_io_op*() functions

This commit removes the functions that are incrementing the backend IO stats.
Instead, it now copies the IO pending stats to the backend IO pending stats when
the pending IO stats are flushed. This behaves the same way as for some relation
and database stats.

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles.

Note that per-backend statistics have to be last in pgstat_report_stat() as some
of their pending stats are populated during other stats kinds flush.

Reported-by: Andres Freund <andres@anarazel.de>
---
 src/backend/utils/activity/pgstat.c         | 14 ++++++-
 src/backend/utils/activity/pgstat_backend.c | 42 +--------------------
 src/backend/utils/activity/pgstat_io.c      | 18 +++++----
 src/include/pgstat.h                        | 14 +++----
 4 files changed, 31 insertions(+), 57 deletions(-)
  81.2% src/backend/utils/activity/
  18.7% src/include/

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 73c2ced3f4e..f57a66d89b0 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -768,10 +768,15 @@ pgstat_report_stat(bool force)
 	/* flush of other stats kinds */
 	if (pgstat_report_fixed)
 	{
+		const PgStat_KindInfo *kind_info;
+
 		for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
 		{
-			const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
 
+			kind_info = pgstat_get_kind_info(kind);
+
+			if (kind == PGSTAT_KIND_BACKEND)
+				continue;
 			if (!kind_info)
 				continue;
 			if (!kind_info->flush_static_cb)
@@ -779,6 +784,13 @@ pgstat_report_stat(bool force)
 
 			partial_flush |= kind_info->flush_static_cb(nowait);
 		}
+
+		/*
+		 * Do per-backend last as some of their pending stats are populated
+		 * during the flush of other stats kinds.
+		 */
+		kind_info = pgstat_get_kind_info(PGSTAT_KIND_BACKEND);
+		partial_flush |= kind_info->flush_static_cb(nowait);
 	}
 
 	last_flush = now;
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 07a1116671b..ea80053304d 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -36,8 +36,8 @@
  * reported within critical sections so we use static memory in order to avoid
  * memory allocation.
  */
-static PgStat_BackendPending PendingBackendStats;
-static bool backend_has_iostats = false;
+PgStat_BackendPending PendingBackendStats;
+bool		backend_has_iostats = false;
 
 /*
  * WAL usage counters saved from pgWalUsage at the previous call to
@@ -47,44 +47,6 @@ static bool backend_has_iostats = false;
  */
 static WalUsage prevBackendWalUsage;
 
-/*
- * Utility routines to report I/O stats for backends, kept here to avoid
- * exposing PendingBackendStats to the outside world.
- */
-void
-pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
-								IOOp io_op, instr_time io_time)
-{
-	Assert(track_io_timing || track_wal_io_timing);
-
-	if (!pgstat_tracks_backend_bktype(MyBackendType))
-		return;
-
-	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
-
-	INSTR_TIME_ADD(PendingBackendStats.pending_io.pending_times[io_object][io_context][io_op],
-				   io_time);
-
-	backend_has_iostats = true;
-	pgstat_report_fixed = true;
-}
-
-void
-pgstat_count_backend_io_op(IOObject io_object, IOContext io_context,
-						   IOOp io_op, uint32 cnt, uint64 bytes)
-{
-	if (!pgstat_tracks_backend_bktype(MyBackendType))
-		return;
-
-	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
-
-	PendingBackendStats.pending_io.counts[io_object][io_context][io_op] += cnt;
-	PendingBackendStats.pending_io.bytes[io_object][io_context][io_op] += bytes;
-
-	backend_has_iostats = true;
-	pgstat_report_fixed = true;
-}
-
 /*
  * Returns statistics of a backend by proc number.
  */
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13ae57ed649..0520f4a1d33 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -76,9 +76,6 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op,
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
 	PendingIOStats.bytes[io_object][io_context][io_op] += bytes;
 
-	/* Add the per-backend counts */
-	pgstat_count_backend_io_op(io_object, io_context, io_op, cnt, bytes);
-
 	have_iostats = true;
 	pgstat_report_fixed = true;
 }
@@ -151,10 +148,6 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
-
-		/* Add the per-backend count */
-		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
-										io_time);
 	}
 
 	pgstat_count_io_op(io_object, io_context, io_op, cnt, bytes);
@@ -184,6 +177,9 @@ pgstat_flush_io(bool nowait)
  *
  * If nowait is true, this function returns true if the lock could not be
  * acquired. Otherwise, return false.
+ *
+ * The stats are copied to the corresponding pending backend stats when
+ * successfully flushing.
  */
 bool
 pgstat_io_flush_cb(bool nowait)
@@ -227,6 +223,14 @@ pgstat_io_flush_cb(bool nowait)
 
 	Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
 
+	/* Do the same for the backend stats */
+	if (pgstat_tracks_backend_bktype(MyBackendType))
+	{
+		PendingBackendStats.pending_io = PendingIOStats;
+		backend_has_iostats = true;
+		pgstat_report_fixed = true;
+	}
+
 	LWLockRelease(bktype_lock);
 
 	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f402b17295c..fe0fdaf4883 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -548,15 +548,6 @@ extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
  * Functions in pgstat_backend.c
  */
 
-/* used by pgstat_io.c for I/O stats tracked in backends */
-extern void pgstat_count_backend_io_op_time(IOObject io_object,
-											IOContext io_context,
-											IOOp io_op,
-											instr_time io_time);
-extern void pgstat_count_backend_io_op(IOObject io_object,
-									   IOContext io_context,
-									   IOOp io_op, uint32 cnt,
-									   uint64 bytes);
 extern PgStat_Backend *pgstat_fetch_stat_backend(ProcNumber procNumber);
 extern PgStat_Backend *pgstat_fetch_stat_backend_by_pid(int pid,
 														BackendType *bktype);
@@ -800,6 +791,11 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_fetch_consistency;
 
+/*
+ * Variables in pgstat_backend.c
+ */
+extern PGDLLIMPORT PgStat_BackendPending PendingBackendStats;
+extern PGDLLIMPORT bool backend_has_iostats;
 
 /*
  * Variables in pgstat_bgwriter.c
-- 
2.34.1

#9Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#8)
Re: Get rid of pgstat_count_backend_io_op*() functions

On Wed, Sep 24, 2025 at 07:48:32AM +0000, Bertrand Drouvot wrote:

On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote:

As far the ordering concern for v1, what about:

- let backend kind enum defined as 6
- move the backend flush outside of the loop in pgstat_report_stat()

That way the backend are flushed last and that solves your concern about custom
stats kind.

The backend would not be the "only" exception in pgstat_report_stat(), the db
stats are already handled as exception with pgstat_update_dbstats().

That would give something like v3 attached, thoughts?

Once we agree on the approach to deal with per backend pending stats, I'll make
use of the same in [1] and [2].

         for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
         {
[...]
+            if (kind == PGSTAT_KIND_BACKEND)
+                continue;
             if (!kind_info)
                 continue;
             if (!kind_info->flush_static_cb)
             partial_flush |= kind_info->flush_static_cb(nowait);
         }
+
+        /*
+         * Do per-backend last as some of their pending stats are populated
+         * during the flush of other stats kinds.
+         */
+        kind_info = pgstat_get_kind_info(PGSTAT_KIND_BACKEND);
+        partial_flush |= kind_info->flush_static_cb(nowait);
     }

Hmm. I really have an issue with the ordering dependency this
introduces between stats kinds, assumed to happen inside the code. Up
to now, we have always considered stats kinds as rather independent
pieces at flush time, simplifying its code (I am not saying counter
increment time). I get that you are doing what you do here because
the IO pending statistics (PendingIOStats) are reset each time the
stat IO flush callback finishes its job, and we'd lose the amount of
stats saved between two reports.

Putting aside the style issue (aka I like the pgstat_count_backend_*
interface, rather than copying the pending data), I think that we lack
data to be able to clearly identify how and actually how much we
should try to optimize and change this code:
- Should we try to use more inlining?
- Should we try to use more static structures, without any functions
at all.
- Should we try to prefer copies? In which case, could it make sense
to introduce a concept of dependency between stats kinds? This way,
we could order how the flush actions are taken rather than having a
for loop based on the sole stats kind IDs.
- Should we try to reduce diff operations at flush time? This is
something done by the WAL stats and worry about their costs, with
diffs calculated each time between the current and previous states?
With a non-forced report delay of 10s, this does not worry me, short
transactions would, as we force reports each time a transaction
finishes.

For the first and second point, more optimization could be done for
more stats kinds. For the last point, we'd want to reconsider how
pgstat_wal.c does its job. As the proposal stands, at least it seems
to me, this sacrifices code readability, though I get that the opinion
of each may differ on the matter. And perhaps there are practices
that we may be able to use in core for all the stats kinds, as well.
That would mean doing benchmarks with specific workloads on big
hardware, likely. So, what I am seeing currently here is an
incomplete picture.
--
Michael

#10Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#9)
Re: Get rid of pgstat_count_backend_io_op*() functions

Hi,

On Thu, Sep 25, 2025 at 03:42:33PM +0900, Michael Paquier wrote:

On Wed, Sep 24, 2025 at 07:48:32AM +0000, Bertrand Drouvot wrote:

On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote:

As far the ordering concern for v1, what about:

- let backend kind enum defined as 6
- move the backend flush outside of the loop in pgstat_report_stat()

That way the backend are flushed last and that solves your concern about custom
stats kind.

The backend would not be the "only" exception in pgstat_report_stat(), the db
stats are already handled as exception with pgstat_update_dbstats().

That would give something like v3 attached, thoughts?

Once we agree on the approach to deal with per backend pending stats, I'll make
use of the same in [1] and [2].

for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
{
[...]
+            if (kind == PGSTAT_KIND_BACKEND)
+                continue;
if (!kind_info)
continue;
if (!kind_info->flush_static_cb)
partial_flush |= kind_info->flush_static_cb(nowait);
}
+
+        /*
+         * Do per-backend last as some of their pending stats are populated
+         * during the flush of other stats kinds.
+         */
+        kind_info = pgstat_get_kind_info(PGSTAT_KIND_BACKEND);
+        partial_flush |= kind_info->flush_static_cb(nowait);
}

Hmm. I really have an issue with the ordering dependency this
introduces between stats kinds, assumed to happen inside the code. Up
to now, we have always considered stats kinds as rather independent
pieces at flush time, simplifying its code (I am not saying counter
increment time).

We already have such dependencies for some databases/relations stats but that
works well because all of them are "variable" stats and their flush is based
on the pgStatPending list.

I get that you are doing what you do here because
the IO pending statistics (PendingIOStats) are reset each time the
stat IO flush callback finishes its job, and we'd lose the amount of
stats saved between two reports.

We don't lose the stats, but it takes one more reporting cyle to
get them flushed. The sequence (without the ordering being forced) is:

1. static flush of IO backend (happens before IO because backend's enum < IO's enum)
2. static flush of IO -> pending IOs are copied to backend's pending IOs

later on (i.e next report cycle):

3. static flush of IO backend
4. static flush of IO -> pending IOs are copied to backend's pending IO

During 3. we now see the backend IO stats coming from 2.

So the stats would still be reported correctly, just with a one-cycle backend
IO reporting stats delay.

Putting aside the style issue (aka I like the pgstat_count_backend_*
interface, rather than copying the pending data), I think that we lack
data to be able to clearly identify how and actually how much we
should try to optimize and change this code:
- Should we try to use more inlining?
- Should we try to use more static structures, without any functions
at all.
- Should we try to prefer copies?

The fundamental issue is that as we add more stats to PGSTAT_KIND_BACKEND,
we increasingly end up doing duplicate work: the same operation gets counted in
both the specialized stat kind (IO, relation, etc.) and the backend stats.

I think that backend stats should serve as a consolidated view of what a particular
backend is doing, not as an independent counting mechanism.

It makes more sense for backend stats to copy from the specialized pending stats
rather than having every operation increment multiple independent counters.

The specific optimization techniques you mention (inlining, static structures, etc.)
don't address the fundamental architectural issue of duplicate counting.

In which case, could it make sense
to introduce a concept of dependency between stats kinds? This way,
we could order how the flush actions are taken rather than having a
for loop based on the sole stats kind IDs.

We could, but I don't think we need that complexity yet.
Backend stats are inherently special they're a "consolidated" view of all activity
for a specific backend, which naturally makes them dependent on other stat kinds.

This makes backend stats naturally suited to be "last" in the flush order.

I'm not sure we'll need such ordering dependencies for other kinds. And if we need
more later on, we could always work on a more sophisticated approach.

The v3 implementation is explicit about this ordering (with comments and clear
code structure).

- Should we try to reduce diff operations at flush time? This is
something done by the WAL stats and worry about their costs, with
diffs calculated each time between the current and previous states?
With a non-forced report delay of 10s, this does not worry me, short
transactions would, as we force reports each time a transaction
finishes.

Right, that might be a concern for short transactions.

The copy approach makes more sense to me: it is just a simple memory copy operation.
That produces less operations than iterating through all the counter arrays doing
arithmetic. I think it's hard to beat (even if we "could" reduce diff operations
at flush time).

For the first and second point, more optimization could be done for
more stats kinds. For the last point, we'd want to reconsider how
pgstat_wal.c does its job. As the proposal stands, at least it seems
to me, this sacrifices code readability, though I get that the opinion
of each may differ on the matter. And perhaps there are practices
that we may be able to use in core for all the stats kinds, as well.

I disagree on the readability point. The copy approach is actually quite readable
and is very clear about what's happening.

The diff approach requires understanding:

- Why we maintain separate previous state variables
- How the diff calculations work across multi-dimensional arrays
- The complex time arithmetic with INSTR_TIME_ACCUM_DIFF

Regarding performance, I think it's hard to beat a simple memory copy.
Even with potential optimizations to other approaches, you're still doing more
work (arithmetic operations across arrays) versus a straightforward copy operation.

As for incrementing counters twice "just" for code readability, that seems backwards
to me.

That would mean doing benchmarks with specific workloads on big
hardware, likely.

I don't think we need to benchmark each approach.

I suspect benchmarks would show that avoiding duplicate work in the hot path (I/O operations)
provides benefits.

So, what I am seeing currently here is an incomplete picture.

I think the picture is actually quite complete for the specific problem we're solving.

Backend stats are unique in that they're designed to "aggregate" activity from
other stat kinds and that's what creates the duplicate counting issue.

The "incomplete picture" would be if we were trying to optimize all stat kinds
uniformly, but that's not the problem here.

The problem is specifically that backend stats create duplicate work by counting
the same operations that are already being counted in their respective dedicated
stats kind.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com