Flush some statistics within running transactions
Hi hackers,
Long running transactions can accumulate significant statistics (WAL, IO, ...)
that remain unflushed until the transaction ends. This delays visibility of
resource usage in monitoring views like pg_stat_io and pg_stat_wal.
This patch series introduce the ability to $SUBJECT (suggested in [1]/messages/by-id/erpzwxoptqhuptdrtehqydzjapvroumkhh7lc6poclbhe7jk7l@l3yfsq5q4pw7) to:
- improve monitoring of long running transactions
- avoid missing places where we should flush statistics (like the one fixed in
039549d70f6)
The patch series is made of 3 sub-patches:
0001: Add pgstat_report_anytime_stat() for periodic stats flushing
It introduces pgstat_report_anytime_stat(), which flushes non transactional
statistics even inside active transactions. A new timeout handler fires every
second to call this function, ensuring timely stats visibility without waiting
for transaction completion.
Implementation details:
- Add PgStat_FlushBehavior enum to classify stats kinds:
* FLUSH_ANYTIME: Stats that can always be flushed (WAL, IO, ...)
* FLUSH_AT_TXN_BOUNDARY: Stats requiring transaction boundaries
- Modify pgstat_flush_pending_entries() and pgstat_flush_fixed_stats() to accept
a boolean anytime_only parameter:
* When false: flushes all stats (existing behavior)
* When true: flushes only FLUSH_ANYTIME stats and skips FLUSH_AT_TXN_BOUNDARY
stats
- Register ANYTIME_STATS_UPDATE_TIMEOUT that fires every 1 second, calling
pgstat_report_anytime_stat(false)
Remarks:
- The force parameter in pgstat_report_anytime_stat() is currently unused (always
called with force=false) but reserved for future use cases requiring immediate flushing.
The 1 second flush interval is currently hardcoded but we could imagine increase
it or make it configurable. I ran some benchmarks and did not notice any noticeable
performance regression even with a large number of pending entries.
0002: Remove useless calls to flush some stats
Now that some stats can be flushed outside of transaction boundaries, remove
useless calls to flush some stats. Those calls were in place because
before 0001 stats were flushed only at transaction boundaries.
Remarks:
- it reverts 039549d70f6 (it just keeps its tests)
- it can't be done for checkpointer and bgworker for example because they don't
have a flush callback to call
- it can't be done for auxiliary process (walsummarizer for example) because they
currently do not register the new timeout handler
- we may want to improve the current behavior to "fix" the 2 above
0003: Add FLUSH_MIXED support and implement it for RELATION stats
This patch extends the non transactional stats infrastructure to support statistics
kinds with mixed transaction behavior: some fields are transactional (e.g., tuple
inserts/updates/deletes) while others are non transactional (e.g., sequential scans
blocks read, ...).
It introduces FLUSH_MIXED as a third flush behavior type, alongside FLUSH_ANYTIME
and FLUSH_AT_TXN_BOUNDARY. For FLUSH_MIXED kinds, a new flush_anytime_cb callback
enables partial flushing of only the non transactional fields during running
transactions.
Some tests are also added.
Implementation details:
- Add FLUSH_MIXED to PgStat_FlushBehavior enum
- Add flush_anytime_cb to PgStat_KindInfo for partial flushing callback
- Update pgstat_flush_pending_entries() to call flush_anytime_cb for
FLUSH_MIXED entries when in anytime_only mode
- Keep FLUSH_MIXED entries in the pending list after partial flush, as
transactional fields still need to be flushed at transaction boundary
RELATION stats are making use of FLUSH_MIXED:
- Change RELATION from TXN_ALL to FLUSH_MIXED
- Implement pgstat_relation_flush_anytime_cb() to flush only read related
stats: numscans, tuples_returned, tuples_fetched, blocks_fetched,
blocks_hit
- Clear these fields after flushing to prevent double counting when
pgstat_relation_flush_cb() runs at transaction commit
- Transactional stats (tuples_inserted, tuples_updated, tuples_deleted,
live_tuples, dead_tuples) remain pending until transaction boundary
Remark:
We could also imagine adding a new flush_anytime_static_cb() callback for
future FLUSH_MIXED fixed amount stats.
[1]: /messages/by-id/erpzwxoptqhuptdrtehqydzjapvroumkhh7lc6poclbhe7jk7l@l3yfsq5q4pw7
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Add-pgstat_report_anytime_stat-for-periodic-stats.patchtext/x-diff; charset=us-asciiDownload
From 2acc48f3c101b3230090abb53b0e05cc1d8af85f Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 5 Jan 2026 09:41:39 +0000
Subject: [PATCH v1 1/3] Add pgstat_report_anytime_stat() for periodic stats
flushing
Long running transactions can accumulate significant statistics (WAL, IO, ...)
that remain unflushed until the transaction ends. This delays visibility of
resource usage in monitoring views like pg_stat_io and pg_stat_wal.
This commit introduces pgstat_report_anytime_stat(), which flushes
non transactional statistics even inside active transactions. A new timeout
handler fires every second to call this function, ensuring timely stats visibility
without waiting for transaction completion.
Implementation details:
- Add PgStat_FlushBehavior enum to classify stats kinds:
* FLUSH_ANYTIME: Stats that can always be flushed (WAL, IO, ...)
* FLUSH_AT_TXN_BOUNDARY: Stats requiring transaction boundaries
- Modify pgstat_flush_pending_entries() and pgstat_flush_fixed_stats()
to accept a boolean anytime_only parameter:
* When false: flushes all stats (existing behavior)
* When true: flushes only FLUSH_ANYTIME stats and skips FLUSH_AT_TXN_BOUNDARY stats
- Register ANYTIME_STATS_UPDATE_TIMEOUT that fires every 1 second, calling
pgstat_report_anytime_stat(false)
The force parameter in pgstat_report_anytime_stat() is currently unused (always
called with force=false) but reserved for future use cases requiring immediate
flushing.
---
src/backend/tcop/postgres.c | 18 +++++
src/backend/utils/activity/pgstat.c | 119 ++++++++++++++++++++++++----
src/backend/utils/init/globals.c | 1 +
src/backend/utils/init/postinit.c | 15 ++++
src/include/miscadmin.h | 1 +
src/include/pgstat.h | 4 +
src/include/utils/pgstat_internal.h | 11 +++
src/include/utils/timeout.h | 1 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 154 insertions(+), 17 deletions(-)
9.7% src/backend/tcop/
70.2% src/backend/utils/activity/
9.3% src/backend/utils/init/
6.0% src/include/utils/
4.3% src/include/
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e54bf1e760f..6a91543f80a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3530,6 +3530,24 @@ ProcessInterrupts(void)
pgstat_report_stat(true);
}
+ /*
+ * Flush stats outside of transaction boundary if the timeout fired.
+ * Unlike transactional stats, these can be flushed even inside a running
+ * transaction.
+ */
+ if (AnytimeStatsUpdateTimeoutPending)
+ {
+ AnytimeStatsUpdateTimeoutPending = false;
+
+ /* Skip if completely idle */
+ if (!DoingCommandRead || IsTransactionOrTransactionBlock())
+ pgstat_report_anytime_stat(false);
+
+ /* Schedule next timeout */
+ enable_timeout_after(ANYTIME_STATS_UPDATE_TIMEOUT,
+ PGSTAT_ANYTIME_FLUSH_INTERVAL);
+ }
+
if (ProcSignalBarrierPending)
ProcessProcSignalBarrier();
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 11bb71cad5a..f7942e47475 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -187,7 +187,8 @@ static void pgstat_init_snapshot_fixed(void);
static void pgstat_reset_after_failure(void);
-static bool pgstat_flush_pending_entries(bool nowait);
+static bool pgstat_flush_pending_entries(bool nowait, bool anytime_only);
+static bool pgstat_flush_fixed_stats(bool nowait, bool anytime_only);
static void pgstat_prep_snapshot(void);
static void pgstat_build_snapshot(void);
@@ -288,6 +289,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
/* so pg_stat_database entries can be seen in all databases */
.accessed_across_databases = true,
@@ -305,6 +307,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
.shared_size = sizeof(PgStatShared_Relation),
.shared_data_off = offsetof(PgStatShared_Relation, stats),
@@ -321,6 +324,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
.shared_size = sizeof(PgStatShared_Function),
.shared_data_off = offsetof(PgStatShared_Function, stats),
@@ -336,6 +340,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
.accessed_across_databases = true,
@@ -353,6 +358,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
/* so pg_stat_subscription_stats entries can be seen in all databases */
.accessed_across_databases = true,
@@ -370,6 +376,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = false,
+ .flush_behavior = FLUSH_ANYTIME,
.accessed_across_databases = true,
@@ -388,6 +395,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, archiver),
.shared_ctl_off = offsetof(PgStat_ShmemControl, archiver),
@@ -404,6 +412,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, bgwriter),
.shared_ctl_off = offsetof(PgStat_ShmemControl, bgwriter),
@@ -420,6 +429,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, checkpointer),
.shared_ctl_off = offsetof(PgStat_ShmemControl, checkpointer),
@@ -436,6 +446,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, io),
.shared_ctl_off = offsetof(PgStat_ShmemControl, io),
@@ -453,6 +464,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, slru),
.shared_ctl_off = offsetof(PgStat_ShmemControl, slru),
@@ -470,6 +482,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, wal),
.shared_ctl_off = offsetof(PgStat_ShmemControl, wal),
@@ -775,23 +788,11 @@ pgstat_report_stat(bool force)
partial_flush = false;
/* flush of variable-numbered stats tracked in pending entries list */
- partial_flush |= pgstat_flush_pending_entries(nowait);
+ partial_flush |= pgstat_flush_pending_entries(nowait, false);
/* flush of other stats kinds */
if (pgstat_report_fixed)
- {
- for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
- {
- const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
-
- if (!kind_info)
- continue;
- if (!kind_info->flush_static_cb)
- continue;
-
- partial_flush |= kind_info->flush_static_cb(nowait);
- }
- }
+ partial_flush |= pgstat_flush_fixed_stats(nowait, false);
last_flush = now;
@@ -1345,9 +1346,14 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
/*
* Flush out pending variable-numbered stats.
+ *
+ * If anytime_only is true, only flushes FLUSH_ANYTIME entries.
+ * This is safe to call inside transactions.
+ *
+ * If anytime_only is false, flushes all entries.
*/
static bool
-pgstat_flush_pending_entries(bool nowait)
+pgstat_flush_pending_entries(bool nowait, bool anytime_only)
{
bool have_pending = false;
dlist_node *cur = NULL;
@@ -1377,6 +1383,20 @@ pgstat_flush_pending_entries(bool nowait)
Assert(!kind_info->fixed_amount);
Assert(kind_info->flush_pending_cb != NULL);
+ /* Skip transactional stats if we're in anytime_only mode */
+ if (anytime_only && kind_info->flush_behavior == FLUSH_AT_TXN_BOUNDARY)
+ {
+ have_pending = true;
+
+ if (dlist_has_next(&pgStatPending, cur))
+ next = dlist_next_node(&pgStatPending, cur);
+ else
+ next = NULL;
+
+ cur = next;
+ continue;
+ }
+
/* flush the stats, if possible */
did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
@@ -1397,11 +1417,42 @@ pgstat_flush_pending_entries(bool nowait)
cur = next;
}
- Assert(dlist_is_empty(&pgStatPending) == !have_pending);
+ /*
+ * When in anytime_only mode, the list may not be empty because
+ * FLUSH_AT_TXN_BOUNDARY entries were skipped.
+ */
+ Assert(!anytime_only || dlist_is_empty(&pgStatPending) == !have_pending);
return have_pending;
}
+/*
+ * Flush fixed-amount stats.
+ *
+ * If anytime_only is true, only flushes FLUSH_ANYTIME stats (safe inside transactions).
+ * If anytime_only is false, flushes all stats with flush_static_cb.
+ */
+static bool
+pgstat_flush_fixed_stats(bool nowait, bool anytime_only)
+{
+ bool partial_flush = false;
+
+ for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
+ {
+ const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
+
+ if (!kind_info || !kind_info->flush_static_cb)
+ continue;
+
+ /* Skip transactional stats if we're in anytime_only mode */
+ if (anytime_only && kind_info->flush_behavior == FLUSH_AT_TXN_BOUNDARY)
+ continue;
+
+ partial_flush |= kind_info->flush_static_cb(nowait);
+ }
+
+ return partial_flush;
+}
/* ------------------------------------------------------------
* Helper / infrastructure functions
@@ -2119,3 +2170,37 @@ assign_stats_fetch_consistency(int newval, void *extra)
if (pgstat_fetch_consistency != newval)
force_stats_snapshot_clear = true;
}
+
+/*
+ * Flush non-transactional stats
+ *
+ * This is safe to call even inside a transaction. It only flushes stats
+ * kinds marked as FLUSH_ANYTIME.
+ *
+ * This allows long running transactions to report activity without waiting
+ * for transaction to finish.
+ */
+void
+pgstat_report_anytime_stat(bool force)
+{
+ bool nowait = !force;
+
+ pgstat_assert_is_up();
+
+ /*
+ * Exit if no pending stats at all. This avoids unnecessary work when
+ * backends are idle or in sessions without stats accumulation.
+ *
+ * Note: This check isn't precise as there might be only transactional
+ * stats pending, which we'll skip during the flush. However, maintaining
+ * precise tracking would add complexity that does not seem worth it from
+ * a performance point of view (no noticeable performance regression has
+ * been observed with the current implementation).
+ */
+ if (dlist_is_empty(&pgStatPending) && !pgstat_report_fixed)
+ return;
+
+ /* Flush stats outside of transaction boundary */
+ pgstat_flush_pending_entries(nowait, true);
+ pgstat_flush_fixed_stats(nowait, true);
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..ad44826c39e 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -40,6 +40,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false;
volatile sig_atomic_t ProcSignalBarrierPending = false;
volatile sig_atomic_t LogMemoryContextPending = false;
volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
+volatile sig_atomic_t AnytimeStatsUpdateTimeoutPending = false;
volatile uint32 InterruptHoldoffCount = 0;
volatile uint32 QueryCancelHoldoffCount = 0;
volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 3f401faf3de..cb0f6aecad1 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -82,6 +82,7 @@ static void TransactionTimeoutHandler(void);
static void IdleSessionTimeoutHandler(void);
static void IdleStatsUpdateTimeoutHandler(void);
static void ClientCheckTimeoutHandler(void);
+static void AnytimeStatsUpdateTimeoutHandler(void);
static bool ThereIsAtLeastOneRole(void);
static void process_startup_options(Port *port, bool am_superuser);
static void process_settings(Oid databaseid, Oid roleid);
@@ -765,6 +766,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
IdleStatsUpdateTimeoutHandler);
+ RegisterTimeout(ANYTIME_STATS_UPDATE_TIMEOUT,
+ AnytimeStatsUpdateTimeoutHandler);
+ enable_timeout_after(ANYTIME_STATS_UPDATE_TIMEOUT, PGSTAT_ANYTIME_FLUSH_INTERVAL);
}
/*
@@ -1446,3 +1450,14 @@ ThereIsAtLeastOneRole(void)
return result;
}
+
+/*
+ * Timeout handler for flushing non-transactional stats.
+ */
+static void
+AnytimeStatsUpdateTimeoutHandler(void)
+{
+ AnytimeStatsUpdateTimeoutPending = true;
+ InterruptPending = true;
+ SetLatch(MyLatch);
+}
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index db559b39c4d..8aeb9628871 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -96,6 +96,7 @@ extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
extern PGDLLIMPORT volatile sig_atomic_t LogMemoryContextPending;
extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t AnytimeStatsUpdateTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t CheckClientConnectionPending;
extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff7ecc2533..86e65397614 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -35,6 +35,9 @@
/* Default directory to store temporary statistics data in */
#define PG_STAT_TMP_DIR "pg_stat_tmp"
+/* When to call pgstat_report_anytime_stat() again */
+#define PGSTAT_ANYTIME_FLUSH_INTERVAL 1000
+
/* Values for track_functions GUC variable --- order is significant! */
typedef enum TrackFunctionsLevel
{
@@ -533,6 +536,7 @@ extern void pgstat_initialize(void);
/* Functions called from backends */
extern long pgstat_report_stat(bool force);
+extern void pgstat_report_anytime_stat(bool force);
extern void pgstat_force_next_flush(void);
extern void pgstat_reset_counters(void);
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9b8fbae00ed..02f4f13fc0f 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -224,6 +224,14 @@ typedef struct PgStat_SubXactStatus
PgStat_TableXactStatus *first; /* head of list for this subxact */
} PgStat_SubXactStatus;
+/*
+ * Flush behavior for statistics kinds.
+ */
+typedef enum PgStat_FlushBehavior
+{
+ FLUSH_ANYTIME, /* All fields can flush anytime */
+ FLUSH_AT_TXN_BOUNDARY, /* All fields need transaction boundary */
+} PgStat_FlushBehavior;
/*
* Metadata for a specific kind of statistics.
@@ -251,6 +259,9 @@ typedef struct PgStat_KindInfo
*/
bool track_entry_count:1;
+ /* Flush behavior */
+ PgStat_FlushBehavior flush_behavior;
+
/*
* The size of an entry in the shared stats hash table (pointed to by
* PgStatShared_HashEntry->body). For fixed-numbered statistics, this is
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 0965b590b34..10723bb664c 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
IDLE_SESSION_TIMEOUT,
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
+ ANYTIME_STATS_UPDATE_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 09e7f1d420e..9aabb325f16 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2261,6 +2261,7 @@ PgStat_Counter
PgStat_EntryRef
PgStat_EntryRefHashEntry
PgStat_FetchConsistency
+PgStat_FlushBehavior
PgStat_FunctionCallUsage
PgStat_FunctionCounts
PgStat_HashKey
--
2.34.1
v1-0002-Remove-useless-calls-to-flush-some-stats.patchtext/x-diff; charset=us-asciiDownload
From 2c9e50c45138319660c5aa6860873ffdeebb7a67 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 6 Jan 2026 11:06:31 +0000
Subject: [PATCH v1 2/3] Remove useless calls to flush some stats
Now that some stats can be flushed outside of transaction boundaries, remove
useless calls to report/flush some stats. Those calls were in place because
before commit <XXXX> stats were flushed only at transaction boundaries.
Note that:
- it reverts 039549d70f6 (it just keeps its tests)
- it can't be done for checkpointer and bgworker for example because they don't
have a flush callback to call
- it can't be done for auxiliary process (walsummarizer for example) because they
currently do not register the new timeout handler
---
src/backend/replication/walreceiver.c | 10 ------
src/backend/replication/walsender.c | 36 ++------------------
src/backend/utils/activity/pgstat_relation.c | 13 -------
3 files changed, 2 insertions(+), 57 deletions(-)
75.3% src/backend/replication/
24.6% src/backend/utils/activity/
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a41453530a1..266379c780a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -553,16 +553,6 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
*/
bool requestReply = false;
- /*
- * Report pending statistics to the cumulative stats
- * system. This location is useful for the report as it
- * is not within a tight loop in the WAL receiver, to
- * avoid bloating pgstats with requests, while also making
- * sure that the reports happen each time a status update
- * is sent.
- */
- pgstat_report_wal(false);
-
/*
* Check if time since last receive from primary has
* reached the configured limit.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1ab09655a70..c33185bd337 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -94,14 +94,10 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/pg_lsn.h"
-#include "utils/pgstat_internal.h"
#include "utils/ps_status.h"
#include "utils/timeout.h"
#include "utils/timestamp.h"
-/* Minimum interval used by walsender for stats flushes, in ms */
-#define WALSENDER_STATS_FLUSH_INTERVAL 1000
-
/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
@@ -1826,7 +1822,6 @@ WalSndWaitForWal(XLogRecPtr loc)
int wakeEvents;
uint32 wait_event = 0;
static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
- TimestampTz last_flush = 0;
/*
* Fast path to avoid acquiring the spinlock in case we already know we
@@ -1847,7 +1842,6 @@ WalSndWaitForWal(XLogRecPtr loc)
{
bool wait_for_standby_at_stop = false;
long sleeptime;
- TimestampTz now;
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
@@ -1958,8 +1952,7 @@ WalSndWaitForWal(XLogRecPtr loc)
* new WAL to be generated. (But if we have nothing to send, we don't
* want to wake on socket-writable.)
*/
- now = GetCurrentTimestamp();
- sleeptime = WalSndComputeSleeptime(now);
+ sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
wakeEvents = WL_SOCKET_READABLE;
@@ -1968,15 +1961,6 @@ WalSndWaitForWal(XLogRecPtr loc)
Assert(wait_event != 0);
- /* Report IO statistics, if needed */
- if (TimestampDifferenceExceeds(last_flush, now,
- WALSENDER_STATS_FLUSH_INTERVAL))
- {
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
- last_flush = now;
- }
-
WalSndWait(wakeEvents, sleeptime, wait_event);
}
@@ -2879,8 +2863,6 @@ WalSndCheckTimeOut(void)
static void
WalSndLoop(WalSndSendDataCallback send_data)
{
- TimestampTz last_flush = 0;
-
/*
* Initialize the last reply timestamp. That enables timeout processing
* from hereon.
@@ -2975,9 +2957,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
* WalSndWaitForWal() handle any other blocking; idle receivers need
* its additional actions. For physical replication, also block if
* caught up; its send_data does not block.
- *
- * The IO statistics are reported in WalSndWaitForWal() for the
- * logical WAL senders.
*/
if ((WalSndCaughtUp && send_data != XLogSendLogical &&
!streamingDoneSending) ||
@@ -2985,7 +2964,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
{
long sleeptime;
int wakeEvents;
- TimestampTz now;
if (!streamingDoneReceiving)
wakeEvents = WL_SOCKET_READABLE;
@@ -2996,21 +2974,11 @@ WalSndLoop(WalSndSendDataCallback send_data)
* Use fresh timestamp, not last_processing, to reduce the chance
* of reaching wal_sender_timeout before sending a keepalive.
*/
- now = GetCurrentTimestamp();
- sleeptime = WalSndComputeSleeptime(now);
+ sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
if (pq_is_send_pending())
wakeEvents |= WL_SOCKET_WRITEABLE;
- /* Report IO statistics, if needed */
- if (TimestampDifferenceExceeds(last_flush, now,
- WALSENDER_STATS_FLUSH_INTERVAL))
- {
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
- last_flush = now;
- }
-
/* Sleep until something happens or we time out */
WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index bc8c43b96aa..feae2ae5f44 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -260,15 +260,6 @@ pgstat_report_vacuum(Relation rel, PgStat_Counter livetuples,
}
pgstat_unlock_entry(entry_ref);
-
- /*
- * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
- * however this will not be called until after an entire autovacuum cycle
- * is done -- which will likely vacuum many relations -- or until the
- * VACUUM command has processed all tables and committed.
- */
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
}
/*
@@ -360,10 +351,6 @@ pgstat_report_analyze(Relation rel,
}
pgstat_unlock_entry(entry_ref);
-
- /* see pgstat_report_vacuum() */
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
}
/*
--
2.34.1
v1-0003-Add-FLUSH_MIXED-support-and-implement-it-for-RELA.patchtext/x-diff; charset=us-asciiDownload
From af71e6472727b4e18ca369e21e2b4667d4cd172b Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Thu, 8 Jan 2026 09:17:38 +0000
Subject: [PATCH v1 3/3] Add FLUSH_MIXED support and implement it for RELATION
stats
This commit extends the non transactional stats infrastructure to support statistics
kinds with mixed transaction behavior: some fields are transactional (e.g., tuple
inserts/updates/deletes) while others are non transactional (e.g., sequential scans
blocks read, ...).
It introduces FLUSH_MIXED as a third flush behavior type, alongside FLUSH_ANYTIME
and FLUSH_AT_TXN_BOUNDARY. For FLUSH_MIXED kinds, a new flush_anytime_cb callback
enables partial flushing of only the non transactional fields during running
transactions.
Some tests are also added.
Implementation details:
- Add FLUSH_MIXED to PgStat_FlushBehavior enum
- Add flush_anytime_cb to PgStat_KindInfo for partial flushing callback
- Update pgstat_flush_pending_entries() to call flush_anytime_cb for
FLUSH_MIXED entries when in anytime_only mode
- Keep FLUSH_MIXED entries in the pending list after partial flush, as
transactional fields still need to be flushed at transaction boundary
RELATION stats are making use of FLUSH_MIXED:
- Change RELATION from TXN_ALL to FLUSH_MIXED
- Implement pgstat_relation_flush_anytime_cb() to flush only read related
stats: numscans, tuples_returned, tuples_fetched, blocks_fetched,
blocks_hit
- Clear these fields after flushing to prevent double counting when
pgstat_relation_flush_cb() runs at transaction commit
- Transactional stats (tuples_inserted, tuples_updated, tuples_deleted,
live_tuples, dead_tuples) remain pending until transaction boundary
Remark:
We could also imagine adding a new flush_anytime_static_cb() callback for
future FLUSH_MIXED fixed amount stats.
---
src/backend/utils/activity/pgstat.c | 36 ++++++---
src/backend/utils/activity/pgstat_relation.c | 82 ++++++++++++++++++++
src/include/utils/pgstat_internal.h | 8 ++
src/test/isolation/expected/stats.out | 40 ++++++++++
src/test/isolation/expected/stats_1.out | 40 ++++++++++
src/test/isolation/specs/stats.spec | 12 +++
6 files changed, 209 insertions(+), 9 deletions(-)
56.6% src/backend/utils/activity/
4.7% src/include/utils/
34.0% src/test/isolation/expected/
4.6% src/test/isolation/specs/
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index f7942e47475..191e0ceac88 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -307,7 +307,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
- .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
+ .flush_behavior = FLUSH_MIXED,
.shared_size = sizeof(PgStatShared_Relation),
.shared_data_off = offsetof(PgStatShared_Relation, stats),
@@ -315,6 +315,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.pending_size = sizeof(PgStat_TableStatus),
.flush_pending_cb = pgstat_relation_flush_cb,
+ .flush_anytime_cb = pgstat_relation_flush_anytime_cb,
.delete_pending_cb = pgstat_relation_delete_pending_cb,
.reset_timestamp_cb = pgstat_relation_reset_timestamp_cb,
},
@@ -1347,10 +1348,11 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
/*
* Flush out pending variable-numbered stats.
*
- * If anytime_only is true, only flushes FLUSH_ANYTIME entries.
+ * If anytime_only is true, only flushes FLUSH_ANYTIME and FLUSH_MIXED entries,
+ * using flush_anytime_cb for FLUSH_MIXED.
* This is safe to call inside transactions.
*
- * If anytime_only is false, flushes all entries.
+ * If anytime_only is false, flushes all entries using flush_pending_cb.
*/
static bool
pgstat_flush_pending_entries(bool nowait, bool anytime_only)
@@ -1378,6 +1380,7 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
PgStat_Kind kind = key.kind;
const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
bool did_flush;
+ bool is_partial_flush = false;
dlist_node *next;
Assert(!kind_info->fixed_amount);
@@ -1397,8 +1400,21 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
continue;
}
- /* flush the stats, if possible */
- did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
+ /* flush the stats (with the appropriate callback), if possible */
+ if (anytime_only &&
+ kind_info->flush_behavior == FLUSH_MIXED &&
+ kind_info->flush_anytime_cb != NULL)
+ {
+ /* Partial flush of non-transactional fields only */
+ did_flush = kind_info->flush_anytime_cb(entry_ref, nowait);
+ is_partial_flush = true;
+ }
+ else
+ {
+ /* Full flush */
+ did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
+ is_partial_flush = false;
+ }
Assert(did_flush || nowait);
@@ -1408,8 +1424,8 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
else
next = NULL;
- /* if successfully flushed, remove entry */
- if (did_flush)
+ /* if successfull non partial flush, remove entry */
+ if (did_flush && !is_partial_flush)
pgstat_delete_pending_entry(entry_ref);
else
have_pending = true;
@@ -1418,8 +1434,10 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
}
/*
- * When in anytime_only mode, the list may not be empty because
- * FLUSH_AT_TXN_BOUNDARY entries were skipped.
+ * When in anytime_only mode, the list may not be empty even after
+ * successful flushes because FLUSH_AT_TXN_BOUNDARY entries were skipped
+ * or FLUSH_MIXED entries had partial flushes and remain for transaction
+ * boundary.
*/
Assert(!anytime_only || dlist_is_empty(&pgStatPending) == !have_pending);
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index feae2ae5f44..6d6f333039e 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -887,6 +887,88 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
return true;
}
+/*
+ * Flush only non-transactional relation stats.
+ *
+ * This is called periodically during running transactions to make some
+ * statistics visible without waiting for the transaction to finish.
+ *
+ * Transactional stats (inserts/updates/deletes and their effects on live/dead
+ * tuple counts) remain in pending until the transaction ends, at which point
+ * pgstat_relation_flush_cb() will flush them.
+ *
+ * If nowait is true and the lock could not be immediately acquired, returns
+ * false without flushing the entry. Otherwise returns true.
+ */
+bool
+pgstat_relation_flush_anytime_cb(PgStat_EntryRef *entry_ref, bool nowait)
+{
+ Oid dboid;
+ PgStat_TableStatus *lstats; /* pending stats entry */
+ PgStatShared_Relation *shtabstats;
+ PgStat_StatTabEntry *tabentry; /* table entry of shared stats */
+ PgStat_StatDBEntry *dbentry; /* pending database entry */
+ bool has_nontxn_stats = false;
+
+ dboid = entry_ref->shared_entry->key.dboid;
+ lstats = (PgStat_TableStatus *) entry_ref->pending;
+ shtabstats = (PgStatShared_Relation *) entry_ref->shared_stats;
+
+ /*
+ * Check if there are any non-transactional stats to flush. Avoid
+ * unnecessarily locking the entry if nothing accumulated.
+ */
+ if (lstats->counts.numscans > 0 ||
+ lstats->counts.tuples_returned > 0 ||
+ lstats->counts.tuples_fetched > 0 ||
+ lstats->counts.blocks_fetched > 0 ||
+ lstats->counts.blocks_hit > 0)
+ has_nontxn_stats = true;
+
+ if (!has_nontxn_stats)
+ return true;
+
+ if (!pgstat_lock_entry(entry_ref, nowait))
+ return false;
+
+ /* Add only the non-transactional values to the shared entry */
+ tabentry = &shtabstats->stats;
+
+ tabentry->numscans += lstats->counts.numscans;
+ if (lstats->counts.numscans)
+ {
+ TimestampTz t = GetCurrentTimestamp();
+
+ if (t > tabentry->lastscan)
+ tabentry->lastscan = t;
+ }
+ tabentry->tuples_returned += lstats->counts.tuples_returned;
+ tabentry->tuples_fetched += lstats->counts.tuples_fetched;
+ tabentry->blocks_fetched += lstats->counts.blocks_fetched;
+ tabentry->blocks_hit += lstats->counts.blocks_hit;
+
+ pgstat_unlock_entry(entry_ref);
+
+ /* Also update the corresponding fields in database stats */
+ dbentry = pgstat_prep_database_pending(dboid);
+ dbentry->tuples_returned += lstats->counts.tuples_returned;
+ dbentry->tuples_fetched += lstats->counts.tuples_fetched;
+ dbentry->blocks_fetched += lstats->counts.blocks_fetched;
+ dbentry->blocks_hit += lstats->counts.blocks_hit;
+
+ /*
+ * Clear the flushed fields from pending stats to prevent double-counting
+ * when pgstat_relation_flush_cb() runs at transaction boundary.
+ */
+ lstats->counts.numscans = 0;
+ lstats->counts.tuples_returned = 0;
+ lstats->counts.tuples_fetched = 0;
+ lstats->counts.blocks_fetched = 0;
+ lstats->counts.blocks_hit = 0;
+
+ return true;
+}
+
void
pgstat_relation_delete_pending_cb(PgStat_EntryRef *entry_ref)
{
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 02f4f13fc0f..85d92f4c945 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -231,6 +231,7 @@ typedef enum PgStat_FlushBehavior
{
FLUSH_ANYTIME, /* All fields can flush anytime */
FLUSH_AT_TXN_BOUNDARY, /* All fields need transaction boundary */
+ FLUSH_MIXED, /* MIXED so needs callbacks */
} PgStat_FlushBehavior;
/*
@@ -262,6 +263,12 @@ typedef struct PgStat_KindInfo
/* Flush behavior */
PgStat_FlushBehavior flush_behavior;
+ /*
+ * For PGSTAT_FLUSH_MIXED kinds: callback to flush only some fields. If
+ * NULL for a MIXED kind, treated as PGSTAT_FLUSH_AT_TXN_BOUNDARY.
+ */
+ bool (*flush_anytime_cb) (PgStat_EntryRef *entry_ref, bool nowait);
+
/*
* The size of an entry in the shared stats hash table (pointed to by
* PgStatShared_HashEntry->body). For fixed-numbered statistics, this is
@@ -774,6 +781,7 @@ extern void AtPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state);
extern void PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state);
extern bool pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
+extern bool pgstat_relation_flush_anytime_cb(PgStat_EntryRef *entry_ref, bool nowait);
extern void pgstat_relation_delete_pending_cb(PgStat_EntryRef *entry_ref);
extern void pgstat_relation_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
diff --git a/src/test/isolation/expected/stats.out b/src/test/isolation/expected/stats.out
index cfad309ccf3..6d62b30e4a7 100644
--- a/src/test/isolation/expected/stats.out
+++ b/src/test/isolation/expected/stats.out
@@ -2245,6 +2245,46 @@ seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum
(1 row)
+starting permutation: s2_begin s2_table_select s1_sleep s1_table_stats s2_table_drop s2_commit
+pg_stat_force_next_flush
+------------------------
+
+(1 row)
+
+step s2_begin: BEGIN;
+step s2_table_select: SELECT * FROM test_stat_tab ORDER BY key, value;
+key|value
+---+-----
+k0 | 1
+(1 row)
+
+step s1_sleep: SELECT pg_sleep(1.5);
+pg_sleep
+--------
+
+(1 row)
+
+step s1_table_stats:
+ SELECT
+ pg_stat_get_numscans(tso.oid) AS seq_scan,
+ pg_stat_get_tuples_returned(tso.oid) AS seq_tup_read,
+ pg_stat_get_tuples_inserted(tso.oid) AS n_tup_ins,
+ pg_stat_get_tuples_updated(tso.oid) AS n_tup_upd,
+ pg_stat_get_tuples_deleted(tso.oid) AS n_tup_del,
+ pg_stat_get_live_tuples(tso.oid) AS n_live_tup,
+ pg_stat_get_dead_tuples(tso.oid) AS n_dead_tup,
+ pg_stat_get_vacuum_count(tso.oid) AS vacuum_count
+ FROM test_stat_oid AS tso
+ WHERE tso.name = 'test_stat_tab'
+
+seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count
+--------+------------+---------+---------+---------+----------+----------+------------
+ 1| 1| 1| 0| 0| 1| 0| 0
+(1 row)
+
+step s2_table_drop: DROP TABLE test_stat_tab;
+step s2_commit: COMMIT;
+
starting permutation: s1_track_counts_off s1_table_stats s1_track_counts_on
pg_stat_force_next_flush
------------------------
diff --git a/src/test/isolation/expected/stats_1.out b/src/test/isolation/expected/stats_1.out
index e1d937784cb..2fade10e817 100644
--- a/src/test/isolation/expected/stats_1.out
+++ b/src/test/isolation/expected/stats_1.out
@@ -2253,6 +2253,46 @@ seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum
(1 row)
+starting permutation: s2_begin s2_table_select s1_sleep s1_table_stats s2_table_drop s2_commit
+pg_stat_force_next_flush
+------------------------
+
+(1 row)
+
+step s2_begin: BEGIN;
+step s2_table_select: SELECT * FROM test_stat_tab ORDER BY key, value;
+key|value
+---+-----
+k0 | 1
+(1 row)
+
+step s1_sleep: SELECT pg_sleep(1.5);
+pg_sleep
+--------
+
+(1 row)
+
+step s1_table_stats:
+ SELECT
+ pg_stat_get_numscans(tso.oid) AS seq_scan,
+ pg_stat_get_tuples_returned(tso.oid) AS seq_tup_read,
+ pg_stat_get_tuples_inserted(tso.oid) AS n_tup_ins,
+ pg_stat_get_tuples_updated(tso.oid) AS n_tup_upd,
+ pg_stat_get_tuples_deleted(tso.oid) AS n_tup_del,
+ pg_stat_get_live_tuples(tso.oid) AS n_live_tup,
+ pg_stat_get_dead_tuples(tso.oid) AS n_dead_tup,
+ pg_stat_get_vacuum_count(tso.oid) AS vacuum_count
+ FROM test_stat_oid AS tso
+ WHERE tso.name = 'test_stat_tab'
+
+seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count
+--------+------------+---------+---------+---------+----------+----------+------------
+ 0| 0| 1| 0| 0| 1| 0| 0
+(1 row)
+
+step s2_table_drop: DROP TABLE test_stat_tab;
+step s2_commit: COMMIT;
+
starting permutation: s1_track_counts_off s1_table_stats s1_track_counts_on
pg_stat_force_next_flush
------------------------
diff --git a/src/test/isolation/specs/stats.spec b/src/test/isolation/specs/stats.spec
index da16710da0f..1b0168e6176 100644
--- a/src/test/isolation/specs/stats.spec
+++ b/src/test/isolation/specs/stats.spec
@@ -50,6 +50,8 @@ step s1_rollback { ROLLBACK; }
step s1_prepare_a { PREPARE TRANSACTION 'a'; }
step s1_commit_prepared_a { COMMIT PREPARED 'a'; }
step s1_rollback_prepared_a { ROLLBACK PREPARED 'a'; }
+# Has to be greater than PGSTAT_ANYTIME_FLUSH_INTERVAL
+step s1_sleep { SELECT pg_sleep(1.5); }
# Function stats steps
step s1_ff { SELECT pg_stat_force_next_flush(); }
@@ -138,6 +140,7 @@ step s2_commit { COMMIT; }
step s2_commit_prepared_a { COMMIT PREPARED 'a'; }
step s2_rollback_prepared_a { ROLLBACK PREPARED 'a'; }
step s2_ff { SELECT pg_stat_force_next_flush(); }
+step s2_table_drop { DROP TABLE test_stat_tab; }
# Function stats steps
step s2_track_funcs_all { SET track_functions = 'all'; }
@@ -435,6 +438,15 @@ permutation
s1_table_drop
s1_table_stats
+### Check that some stats are updated (seq_scan and seq_tup_read)
+### while the transaction is still running
+permutation
+ s2_begin
+ s2_table_select
+ s1_sleep
+ s1_table_stats
+ s2_table_drop
+ s2_commit
### Check that we don't count changes with track counts off, but allow access
### to prior stats
--
2.34.1
Hi,
Thanks for these patches!
I took a quick look at the patches and I have some general comments.
Long running transactions can accumulate significant statistics (WAL, IO, ...)
that remain unflushed until the transaction ends. This delays visibility of
resource usage in monitoring views like pg_stat_io and pg_stat_wal.
+1. I do think this is a good idea. Long-running transactions cause accumulated
stats to appear as spikes in monitoring tools rather than as gradual activity.
This would help level out, though not eliminate, those artificial spikes.
The 1 second flush interval is currently hardcoded but we could imagine increase
it or make it configurable.
Someone may want to turn this off as well. I think a GUC will be needed.
RELATION stats are making use of FLUSH_MIXED:
stats: numscans, tuples_returned, tuples_fetched, blocks_fetched,
blocks_hit
I’m concerned that fields being temporarily out of sync might impact monitoring
calculations, if the formula is dealing with fields that have
different flush strategies.
That said, minor discrepancies are usually tolerable for monitoring
data analysis.
For the numscans, should we not also update the scan timestamp?
--
Sami Imseih
Amazon Web Services (AWS)
Hi,
On Wed, Jan 14, 2026 at 09:54:17PM -0600, Sami Imseih wrote:
I took a quick look at the patches and I have some general comments.
Thanks!
Long running transactions can accumulate significant statistics (WAL, IO, ...)
that remain unflushed until the transaction ends. This delays visibility of
resource usage in monitoring views like pg_stat_io and pg_stat_wal.+1. I do think this is a good idea. Long-running transactions cause accumulated
stats to appear as spikes in monitoring tools rather than as gradual activity.
This would help level out, though not eliminate, those artificial spikes.
Yeah.
The 1 second flush interval is currently hardcoded but we could imagine increase
it or make it configurable.Someone may want to turn this off as well. I think a GUC will be needed.
I gave this more thoughts and I wonder if this should be configurable at all.
I mean, we don't do it for PGSTAT_MIN_INTERVAL, PGSTAT_MAX_INTERVAL and
PGSTAT_IDLE_INTERVAL. We could imagine make it configurable if it produces
noticeable performance impact but that's not what I observed.
RELATION stats are making use of FLUSH_MIXED:
stats: numscans, tuples_returned, tuples_fetched, blocks_fetched,
blocks_hitI’m concerned that fields being temporarily out of sync might impact monitoring
calculations, if the formula is dealing with fields that have
different flush strategies.
That's a good point. Maybe we should document the fields flush strategy?
That said, minor discrepancies are usually tolerable for monitoring
data analysis.For the numscans, should we not also update the scan timestamp?
The problem is that we could not call GetCurrentTransactionStopTimestamp(), so
we would need to call GetCurrentTimestamp() instead. I'm not sure that calling
GetCurrentTimestamp() every second would be a real issue though, and if it is
maybe we could increase this 1s value.
That said I agree that having seq_scan being updated and not last_seq_scan is not
that great.
Maybe we should keep this in mind and see what to do depending where this thread
is going (I mean if the current proposed design has to be changed).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
The 1 second flush interval is currently hardcoded but we could imagine increase
it or make it configurable.Someone may want to turn this off as well. I think a GUC will be needed.
I gave this more thoughts and I wonder if this should be configurable at all.
I mean, we don't do it for PGSTAT_MIN_INTERVAL, PGSTAT_MAX_INTERVAL and
PGSTAT_IDLE_INTERVAL. We could imagine make it configurable if it produces
noticeable performance impact but that's not what I observed.
Is there a reason we need a new constant (PGSTAT_ANYTIME_FLUSH_INTERVAL)
for anytime flushes and can't rely on the existing PGSTAT_MIN_INTERVAL?
Also, How did you benchmark? I am less concerned about long running
transactions,
background processes and more about short/high concurrency transactions seeing
additional overhead due to additional flushing. Is that latter a concern?
stats: numscans, tuples_returned, tuples_fetched, blocks_fetched,
blocks_hitI’m concerned that fields being temporarily out of sync might impact monitoring
calculations, if the formula is dealing with fields that have
different flush strategies.That's a good point. Maybe we should document the fields flush strategy?
Yeah, we will need to document this.
That said, minor discrepancies are usually tolerable for monitoring
data analysis.For the numscans, should we not also update the scan timestamp?
The problem is that we could not call GetCurrentTransactionStopTimestamp(), so
we would need to call GetCurrentTimestamp() instead. I'm not sure that calling
GetCurrentTimestamp() every second would be a real issue though, and if it is
maybe we could increase this 1s value.
That said I agree that having seq_scan being updated and not last_seq_scan is not
that great.
with v3 , I checked by running seq scans in a long running transaction,
and I observed both for these values being updated at the same time. I think
this is OK.
# pgstat_relation_flush_anytime_cb
```
tabentry->numscans += lstats->counts.numscans;
if (lstats->counts.numscans)
{
TimestampTz t = GetCurrentTimestamp();
if (t > tabentry->lastscan)
tabentry->lastscan = t;
}
```
and
# pgstat_relation_flush_cb
```
if (lstats->counts.numscans)
{
TimestampTz t = GetCurrentTransactionStopTimestamp();
if (t > tabentry->lastscan)
tabentry->lastscan = t;
}
```
--
Sami Imseih
Amazon Web Services (AWS)
with v3 , I checked by running seq scans in a long running transaction,
Sorry I mean 0003
--
Sami Imseih
Amazon Web Services (AWS)
Hi,
On Thu, Jan 15, 2026 at 11:25:18AM -0600, Sami Imseih wrote:
The 1 second flush interval is currently hardcoded but we could imagine increase
it or make it configurable.Someone may want to turn this off as well. I think a GUC will be needed.
I gave this more thoughts and I wonder if this should be configurable at all.
I mean, we don't do it for PGSTAT_MIN_INTERVAL, PGSTAT_MAX_INTERVAL and
PGSTAT_IDLE_INTERVAL. We could imagine make it configurable if it produces
noticeable performance impact but that's not what I observed.Is there a reason we need a new constant (PGSTAT_ANYTIME_FLUSH_INTERVAL)
for anytime flushes and can't rely on the existing PGSTAT_MIN_INTERVAL?
It currently gives flexibility for testing. If we agree that 1s is the right value
to set and that it should not be configurable then yeah we could replace it with
PGSTAT_MIN_INTERVAL then.
Also, How did you benchmark? I am less concerned about long running
transactions,
background processes and more about short/high concurrency transactions seeing
additional overhead due to additional flushing. Is that latter a concern?
I ran 3 kinds of tests:
1/
pgbench -c 32 -j 4 -T 60 -f short.sql -n -r $DB
with short.sql:
\set t1 random(1, 100)
\set t2 random(1, 100)
\set t3 random(1, 100)
\set t4 random(1, 100)
\set t5 random(1, 100)
\set t6 random(1, 100)
\set t7 random(1, 100)
\set t8 random(1, 100)
\set t9 random(1, 100)
\set t10 random(1, 100)
\set row random(1, 1000)
BEGIN;
UPDATE t:t1 SET val = val + 1 WHERE id = :row;
UPDATE t:t2 SET val = val + 1 WHERE id = :row;
UPDATE t:t3 SET val = val + 1 WHERE id = :row;
UPDATE t:t4 SET val = val + 1 WHERE id = :row;
UPDATE t:t5 SET val = val + 1 WHERE id = :row;
UPDATE t:t6 SET val = val + 1 WHERE id = :row;
UPDATE t:t7 SET val = val + 1 WHERE id = :row;
UPDATE t:t8 SET val = val + 1 WHERE id = :row;
UPDATE t:t9 SET val = val + 1 WHERE id = :row;
UPDATE t:t10 SET val = val + 1 WHERE id = :row;
COMMIT;
2/
psql $DB -f long.sql
with long.sql:
DO $$
BEGIN
FOR i IN 1..100 LOOP
EXECUTE format('TRUNCATE TABLE t%s', i);
EXECUTE format('INSERT INTO t%s SELECT generate_series(1, 1000000)', i);
EXECUTE format('UPDATE t%s SET val = val + 1', i);
EXECUTE format('SELECT COUNT(1) FROM t%s', i);
END LOOP;
END $$;
3/
pgbench -i -s 50 $DB
pgbench -c 32 -j 4 -T 60 -N -n -r $DB
I don't think this feature could add a noticeable performance impact, so the tests
have been that simple. Do you think we should worry more?
I’m concerned that fields being temporarily out of sync might impact monitoring
calculations, if the formula is dealing with fields that have
different flush strategies.That's a good point. Maybe we should document the fields flush strategy?
Yeah, we will need to document this.
Will do in the next version.
I checked by running seq scans in a long running transaction,
and I observed both for these values being updated at the same time. I think
this is OK.
I do think the same.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
I took a look at 0001 in depth.
I don't think this feature could add a noticeable performance impact, so the tests
have been that simple. Do you think we should worry more?
One observation is there's no coordination between ANYTIME and
TXN_BOUNDARY flushes. While PGSTAT_MIN_INTERVAL
prevents a backend from flushing more than once per second, a backend can
still perform both an ANYTIME flush and a TXN_BOUNDARY flush within
the same 1-second window. Not saying this will be a real problem in
the real-world,
but we definitely took measures in the current implementation to avoid
this scenario.
A few other comments on 0001
+ /* Skip if completely idle */
+ if (!DoingCommandRead || IsTransactionOrTransactionBlock())
+ pgstat_report_anytime_stat(false);
Does this need to be conditional? worst case, we return right away with an empty
list. Best case, is we are consistently flushing.
+ /*
+ * When in anytime_only mode, the list may not be empty because
+ * FLUSH_AT_TXN_BOUNDARY entries were skipped.
+ */
+ Assert(!anytime_only || dlist_is_empty(&pgStatPending) ==
!have_pending);
Checking for !anytime_only is unnecessary here.
"list_is_empty(&pgStatPending) == !have_pending"
should be true regardless of ANYTIME or TXN_BOUNDARY, right?
Below are a couple of edits for comments I felt would improve
readability of the code.
1/
/*
- * Flush non-transactional stats
- *
- * This is safe to call even inside a transaction. It only flushes stats
- * kinds marked as FLUSH_ANYTIME.
- *
- * This allows long running transactions to report activity without waiting
- * for transaction to finish.
+ * Flushes only FLUSH_ANYTIME stats using non-blocking locks. Transactional
+ * stats (FLUSH_AT_TXN_BOUNDARY) remain pending until transaction boundary.
+ * Safe to call inside transactions.
*/
2/
typedef enum PgStat_FlushBehavior
{
- FLUSH_ANYTIME, /* All fields can
flush anytime */
- FLUSH_AT_TXN_BOUNDARY, /* All fields need transaction
boundary */
+ FLUSH_ANYTIME, /* All fields can be
flushed anytime,
+ *
including within transactions */
+ FLUSH_AT_TXN_BOUNDARY, /* All fields can only be flushed at
+ *
transaction boundary */
} PgStat_FlushBehavior;
I will start looking at the remaining patches next.
--
Sami Imseih
Amazon Web Services (AWS)
Hi,
On Fri, Jan 16, 2026 at 10:44:48AM -0600, Sami Imseih wrote:
I took a look at 0001 in depth.
Thanks!
I don't think this feature could add a noticeable performance impact, so the tests
have been that simple. Do you think we should worry more?One observation is there's no coordination between ANYTIME and
TXN_BOUNDARY flushes. While PGSTAT_MIN_INTERVAL
prevents a backend from flushing more than once per second, a backend can
still perform both an ANYTIME flush and a TXN_BOUNDARY flush within
the same 1-second window. Not saying this will be a real problem in
the real-world,
but we definitely took measures in the current implementation to avoid
this scenario.
Right. I think that the PGSTAT_MIN_INTERVAL throttling was put in place to prevent
flushing too frequently when the backend has a high commit rate. But here, while
it's true that we don't follow that rule (means a backend could flush more than one
time per second), that would be a maximum of 2 times (given that ANYTIME is
flushing every second). So, I'm not sure that this single extra flush is worth
worrying about. Plus we'd certainly need an extra GetCurrentTimestamp() call, so
I'm not sure it's worth it.
A few other comments on 0001
+ /* Skip if completely idle */ + if (!DoingCommandRead || IsTransactionOrTransactionBlock()) + pgstat_report_anytime_stat(false);Does this need to be conditional? worst case, we return right away with an empty
list. Best case, is we are consistently flushing.
Yeah, I think we could remove this check and just rely on the ones in
pgstat_report_anytime_stat(). Done in the attached.
+ Assert(!anytime_only || dlist_is_empty(&pgStatPending) ==
!have_pending);Checking for !anytime_only is unnecessary here.
"list_is_empty(&pgStatPending) == !have_pending"
should be true regardless of ANYTIME or TXN_BOUNDARY, right?
Right, thanks for catching it, it was remaining garbage from my dev iterations.
Below are a couple of edits for comments I felt would improve
readability of the code.
Done as suggested.
I will start looking at the remaining patches next.
Thanks!
Note that I also updated the doc in 0003 for the stats that have mixed fields.
BTW, I think that we could also make the Function stat kind as flush any time,
thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Add-pgstat_report_anytime_stat-for-periodic-stats.patchtext/x-diff; charset=us-asciiDownload
From 605cae0291397047b09aa025b742bdcaf9bdd528 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 5 Jan 2026 09:41:39 +0000
Subject: [PATCH v2 1/3] Add pgstat_report_anytime_stat() for periodic stats
flushing
Long running transactions can accumulate significant statistics (WAL, IO, ...)
that remain unflushed until the transaction ends. This delays visibility of
resource usage in monitoring views like pg_stat_io and pg_stat_wal.
This commit introduces pgstat_report_anytime_stat(), which flushes
non transactional statistics even inside active transactions. A new timeout
handler fires every second to call this function, ensuring timely stats visibility
without waiting for transaction completion.
Implementation details:
- Add PgStat_FlushBehavior enum to classify stats kinds:
* FLUSH_ANYTIME: Stats that can always be flushed (WAL, IO, ...)
* FLUSH_AT_TXN_BOUNDARY: Stats requiring transaction boundaries
- Modify pgstat_flush_pending_entries() and pgstat_flush_fixed_stats()
to accept a boolean anytime_only parameter:
* When false: flushes all stats (existing behavior)
* When true: flushes only FLUSH_ANYTIME stats and skips FLUSH_AT_TXN_BOUNDARY stats
- Register ANYTIME_STATS_UPDATE_TIMEOUT that fires every 1 second, calling
pgstat_report_anytime_stat(false)
The force parameter in pgstat_report_anytime_stat() is currently unused (always
called with force=false) but reserved for future use cases requiring immediate
flushing.
---
src/backend/tcop/postgres.c | 16 ++++
src/backend/utils/activity/pgstat.c | 113 ++++++++++++++++++++++++----
src/backend/utils/init/globals.c | 1 +
src/backend/utils/init/postinit.c | 15 ++++
src/include/miscadmin.h | 1 +
src/include/pgstat.h | 4 +
src/include/utils/pgstat_internal.h | 13 ++++
src/include/utils/timeout.h | 1 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 149 insertions(+), 16 deletions(-)
8.2% src/backend/tcop/
69.3% src/backend/utils/activity/
9.7% src/backend/utils/init/
7.7% src/include/utils/
4.5% src/include/
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e54bf1e760f..9c4a9078ee0 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3530,6 +3530,22 @@ ProcessInterrupts(void)
pgstat_report_stat(true);
}
+ /*
+ * Flush stats outside of transaction boundary if the timeout fired.
+ * Unlike transactional stats, these can be flushed even inside a running
+ * transaction.
+ */
+ if (AnytimeStatsUpdateTimeoutPending)
+ {
+ AnytimeStatsUpdateTimeoutPending = false;
+
+ pgstat_report_anytime_stat(false);
+
+ /* Schedule next timeout */
+ enable_timeout_after(ANYTIME_STATS_UPDATE_TIMEOUT,
+ PGSTAT_ANYTIME_FLUSH_INTERVAL);
+ }
+
if (ProcSignalBarrierPending)
ProcessProcSignalBarrier();
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 11bb71cad5a..0f45a7d165e 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -187,7 +187,8 @@ static void pgstat_init_snapshot_fixed(void);
static void pgstat_reset_after_failure(void);
-static bool pgstat_flush_pending_entries(bool nowait);
+static bool pgstat_flush_pending_entries(bool nowait, bool anytime_only);
+static bool pgstat_flush_fixed_stats(bool nowait, bool anytime_only);
static void pgstat_prep_snapshot(void);
static void pgstat_build_snapshot(void);
@@ -288,6 +289,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
/* so pg_stat_database entries can be seen in all databases */
.accessed_across_databases = true,
@@ -305,6 +307,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
.shared_size = sizeof(PgStatShared_Relation),
.shared_data_off = offsetof(PgStatShared_Relation, stats),
@@ -321,6 +324,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
.shared_size = sizeof(PgStatShared_Function),
.shared_data_off = offsetof(PgStatShared_Function, stats),
@@ -336,6 +340,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
.accessed_across_databases = true,
@@ -353,6 +358,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
/* so pg_stat_subscription_stats entries can be seen in all databases */
.accessed_across_databases = true,
@@ -370,6 +376,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = false,
+ .flush_behavior = FLUSH_ANYTIME,
.accessed_across_databases = true,
@@ -388,6 +395,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, archiver),
.shared_ctl_off = offsetof(PgStat_ShmemControl, archiver),
@@ -404,6 +412,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, bgwriter),
.shared_ctl_off = offsetof(PgStat_ShmemControl, bgwriter),
@@ -420,6 +429,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, checkpointer),
.shared_ctl_off = offsetof(PgStat_ShmemControl, checkpointer),
@@ -436,6 +446,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, io),
.shared_ctl_off = offsetof(PgStat_ShmemControl, io),
@@ -453,6 +464,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, slru),
.shared_ctl_off = offsetof(PgStat_ShmemControl, slru),
@@ -470,6 +482,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_behavior = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, wal),
.shared_ctl_off = offsetof(PgStat_ShmemControl, wal),
@@ -775,23 +788,11 @@ pgstat_report_stat(bool force)
partial_flush = false;
/* flush of variable-numbered stats tracked in pending entries list */
- partial_flush |= pgstat_flush_pending_entries(nowait);
+ partial_flush |= pgstat_flush_pending_entries(nowait, false);
/* flush of other stats kinds */
if (pgstat_report_fixed)
- {
- for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
- {
- const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
-
- if (!kind_info)
- continue;
- if (!kind_info->flush_static_cb)
- continue;
-
- partial_flush |= kind_info->flush_static_cb(nowait);
- }
- }
+ partial_flush |= pgstat_flush_fixed_stats(nowait, false);
last_flush = now;
@@ -1345,9 +1346,14 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
/*
* Flush out pending variable-numbered stats.
+ *
+ * If anytime_only is true, only flushes FLUSH_ANYTIME entries.
+ * This is safe to call inside transactions.
+ *
+ * If anytime_only is false, flushes all entries.
*/
static bool
-pgstat_flush_pending_entries(bool nowait)
+pgstat_flush_pending_entries(bool nowait, bool anytime_only)
{
bool have_pending = false;
dlist_node *cur = NULL;
@@ -1377,6 +1383,20 @@ pgstat_flush_pending_entries(bool nowait)
Assert(!kind_info->fixed_amount);
Assert(kind_info->flush_pending_cb != NULL);
+ /* Skip transactional stats if we're in anytime_only mode */
+ if (anytime_only && kind_info->flush_behavior == FLUSH_AT_TXN_BOUNDARY)
+ {
+ have_pending = true;
+
+ if (dlist_has_next(&pgStatPending, cur))
+ next = dlist_next_node(&pgStatPending, cur);
+ else
+ next = NULL;
+
+ cur = next;
+ continue;
+ }
+
/* flush the stats, if possible */
did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
@@ -1397,11 +1417,42 @@ pgstat_flush_pending_entries(bool nowait)
cur = next;
}
+ /*
+ * When in anytime_only mode, the list may not be empty because
+ * FLUSH_AT_TXN_BOUNDARY entries were skipped.
+ */
Assert(dlist_is_empty(&pgStatPending) == !have_pending);
return have_pending;
}
+/*
+ * Flush fixed-amount stats.
+ *
+ * If anytime_only is true, only flushes FLUSH_ANYTIME stats (safe inside transactions).
+ * If anytime_only is false, flushes all stats with flush_static_cb.
+ */
+static bool
+pgstat_flush_fixed_stats(bool nowait, bool anytime_only)
+{
+ bool partial_flush = false;
+
+ for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
+ {
+ const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
+
+ if (!kind_info || !kind_info->flush_static_cb)
+ continue;
+
+ /* Skip transactional stats if we're in anytime_only mode */
+ if (anytime_only && kind_info->flush_behavior == FLUSH_AT_TXN_BOUNDARY)
+ continue;
+
+ partial_flush |= kind_info->flush_static_cb(nowait);
+ }
+
+ return partial_flush;
+}
/* ------------------------------------------------------------
* Helper / infrastructure functions
@@ -2119,3 +2170,33 @@ assign_stats_fetch_consistency(int newval, void *extra)
if (pgstat_fetch_consistency != newval)
force_stats_snapshot_clear = true;
}
+
+/*
+ * Flushes only FLUSH_ANYTIME stats using non-blocking locks. Transactional
+ * stats (FLUSH_AT_TXN_BOUNDARY) remain pending until transaction boundary.
+ * Safe to call inside transactions.
+ */
+void
+pgstat_report_anytime_stat(bool force)
+{
+ bool nowait = !force;
+
+ pgstat_assert_is_up();
+
+ /*
+ * Exit if no pending stats at all. This avoids unnecessary work when
+ * backends are idle or in sessions without stats accumulation.
+ *
+ * Note: This check isn't precise as there might be only transactional
+ * stats pending, which we'll skip during the flush. However, maintaining
+ * precise tracking would add complexity that does not seem worth it from
+ * a performance point of view (no noticeable performance regression has
+ * been observed with the current implementation).
+ */
+ if (dlist_is_empty(&pgStatPending) && !pgstat_report_fixed)
+ return;
+
+ /* Flush stats outside of transaction boundary */
+ pgstat_flush_pending_entries(nowait, true);
+ pgstat_flush_fixed_stats(nowait, true);
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..ad44826c39e 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -40,6 +40,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false;
volatile sig_atomic_t ProcSignalBarrierPending = false;
volatile sig_atomic_t LogMemoryContextPending = false;
volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
+volatile sig_atomic_t AnytimeStatsUpdateTimeoutPending = false;
volatile uint32 InterruptHoldoffCount = 0;
volatile uint32 QueryCancelHoldoffCount = 0;
volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 3f401faf3de..cb0f6aecad1 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -82,6 +82,7 @@ static void TransactionTimeoutHandler(void);
static void IdleSessionTimeoutHandler(void);
static void IdleStatsUpdateTimeoutHandler(void);
static void ClientCheckTimeoutHandler(void);
+static void AnytimeStatsUpdateTimeoutHandler(void);
static bool ThereIsAtLeastOneRole(void);
static void process_startup_options(Port *port, bool am_superuser);
static void process_settings(Oid databaseid, Oid roleid);
@@ -765,6 +766,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
IdleStatsUpdateTimeoutHandler);
+ RegisterTimeout(ANYTIME_STATS_UPDATE_TIMEOUT,
+ AnytimeStatsUpdateTimeoutHandler);
+ enable_timeout_after(ANYTIME_STATS_UPDATE_TIMEOUT, PGSTAT_ANYTIME_FLUSH_INTERVAL);
}
/*
@@ -1446,3 +1450,14 @@ ThereIsAtLeastOneRole(void)
return result;
}
+
+/*
+ * Timeout handler for flushing non-transactional stats.
+ */
+static void
+AnytimeStatsUpdateTimeoutHandler(void)
+{
+ AnytimeStatsUpdateTimeoutPending = true;
+ InterruptPending = true;
+ SetLatch(MyLatch);
+}
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index db559b39c4d..8aeb9628871 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -96,6 +96,7 @@ extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
extern PGDLLIMPORT volatile sig_atomic_t LogMemoryContextPending;
extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t AnytimeStatsUpdateTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t CheckClientConnectionPending;
extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff7ecc2533..86e65397614 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -35,6 +35,9 @@
/* Default directory to store temporary statistics data in */
#define PG_STAT_TMP_DIR "pg_stat_tmp"
+/* When to call pgstat_report_anytime_stat() again */
+#define PGSTAT_ANYTIME_FLUSH_INTERVAL 1000
+
/* Values for track_functions GUC variable --- order is significant! */
typedef enum TrackFunctionsLevel
{
@@ -533,6 +536,7 @@ extern void pgstat_initialize(void);
/* Functions called from backends */
extern long pgstat_report_stat(bool force);
+extern void pgstat_report_anytime_stat(bool force);
extern void pgstat_force_next_flush(void);
extern void pgstat_reset_counters(void);
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9b8fbae00ed..63feae640d1 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -224,6 +224,16 @@ typedef struct PgStat_SubXactStatus
PgStat_TableXactStatus *first; /* head of list for this subxact */
} PgStat_SubXactStatus;
+/*
+ * Flush behavior for statistics kinds.
+ */
+typedef enum PgStat_FlushBehavior
+{
+ FLUSH_ANYTIME, /* All fields can be flushed anytime,
+ * including within transactions */
+ FLUSH_AT_TXN_BOUNDARY, /* All fields can only be flushed at
+ * transaction boundary */
+} PgStat_FlushBehavior;
/*
* Metadata for a specific kind of statistics.
@@ -251,6 +261,9 @@ typedef struct PgStat_KindInfo
*/
bool track_entry_count:1;
+ /* Flush behavior */
+ PgStat_FlushBehavior flush_behavior;
+
/*
* The size of an entry in the shared stats hash table (pointed to by
* PgStatShared_HashEntry->body). For fixed-numbered statistics, this is
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 0965b590b34..10723bb664c 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
IDLE_SESSION_TIMEOUT,
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
+ ANYTIME_STATS_UPDATE_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3f3a888fd0e..610b35a9b31 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2268,6 +2268,7 @@ PgStat_Counter
PgStat_EntryRef
PgStat_EntryRefHashEntry
PgStat_FetchConsistency
+PgStat_FlushBehavior
PgStat_FunctionCallUsage
PgStat_FunctionCounts
PgStat_HashKey
--
2.34.1
v2-0002-Remove-useless-calls-to-flush-some-stats.patchtext/x-diff; charset=us-asciiDownload
From e012b211f86cba606375c8730f12a4d25dae93d4 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 6 Jan 2026 11:06:31 +0000
Subject: [PATCH v2 2/3] Remove useless calls to flush some stats
Now that some stats can be flushed outside of transaction boundaries, remove
useless calls to report/flush some stats. Those calls were in place because
before commit <XXXX> stats were flushed only at transaction boundaries.
Note that:
- it reverts 039549d70f6 (it just keeps its tests)
- it can't be done for checkpointer and bgworker for example because they don't
have a flush callback to call
- it can't be done for auxiliary process (walsummarizer for example) because they
currently do not register the new timeout handler
---
src/backend/replication/walreceiver.c | 10 ------
src/backend/replication/walsender.c | 36 ++------------------
src/backend/utils/activity/pgstat_relation.c | 13 -------
3 files changed, 2 insertions(+), 57 deletions(-)
75.3% src/backend/replication/
24.6% src/backend/utils/activity/
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a41453530a1..266379c780a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -553,16 +553,6 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
*/
bool requestReply = false;
- /*
- * Report pending statistics to the cumulative stats
- * system. This location is useful for the report as it
- * is not within a tight loop in the WAL receiver, to
- * avoid bloating pgstats with requests, while also making
- * sure that the reports happen each time a status update
- * is sent.
- */
- pgstat_report_wal(false);
-
/*
* Check if time since last receive from primary has
* reached the configured limit.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1ab09655a70..c33185bd337 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -94,14 +94,10 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/pg_lsn.h"
-#include "utils/pgstat_internal.h"
#include "utils/ps_status.h"
#include "utils/timeout.h"
#include "utils/timestamp.h"
-/* Minimum interval used by walsender for stats flushes, in ms */
-#define WALSENDER_STATS_FLUSH_INTERVAL 1000
-
/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
@@ -1826,7 +1822,6 @@ WalSndWaitForWal(XLogRecPtr loc)
int wakeEvents;
uint32 wait_event = 0;
static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
- TimestampTz last_flush = 0;
/*
* Fast path to avoid acquiring the spinlock in case we already know we
@@ -1847,7 +1842,6 @@ WalSndWaitForWal(XLogRecPtr loc)
{
bool wait_for_standby_at_stop = false;
long sleeptime;
- TimestampTz now;
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
@@ -1958,8 +1952,7 @@ WalSndWaitForWal(XLogRecPtr loc)
* new WAL to be generated. (But if we have nothing to send, we don't
* want to wake on socket-writable.)
*/
- now = GetCurrentTimestamp();
- sleeptime = WalSndComputeSleeptime(now);
+ sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
wakeEvents = WL_SOCKET_READABLE;
@@ -1968,15 +1961,6 @@ WalSndWaitForWal(XLogRecPtr loc)
Assert(wait_event != 0);
- /* Report IO statistics, if needed */
- if (TimestampDifferenceExceeds(last_flush, now,
- WALSENDER_STATS_FLUSH_INTERVAL))
- {
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
- last_flush = now;
- }
-
WalSndWait(wakeEvents, sleeptime, wait_event);
}
@@ -2879,8 +2863,6 @@ WalSndCheckTimeOut(void)
static void
WalSndLoop(WalSndSendDataCallback send_data)
{
- TimestampTz last_flush = 0;
-
/*
* Initialize the last reply timestamp. That enables timeout processing
* from hereon.
@@ -2975,9 +2957,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
* WalSndWaitForWal() handle any other blocking; idle receivers need
* its additional actions. For physical replication, also block if
* caught up; its send_data does not block.
- *
- * The IO statistics are reported in WalSndWaitForWal() for the
- * logical WAL senders.
*/
if ((WalSndCaughtUp && send_data != XLogSendLogical &&
!streamingDoneSending) ||
@@ -2985,7 +2964,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
{
long sleeptime;
int wakeEvents;
- TimestampTz now;
if (!streamingDoneReceiving)
wakeEvents = WL_SOCKET_READABLE;
@@ -2996,21 +2974,11 @@ WalSndLoop(WalSndSendDataCallback send_data)
* Use fresh timestamp, not last_processing, to reduce the chance
* of reaching wal_sender_timeout before sending a keepalive.
*/
- now = GetCurrentTimestamp();
- sleeptime = WalSndComputeSleeptime(now);
+ sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
if (pq_is_send_pending())
wakeEvents |= WL_SOCKET_WRITEABLE;
- /* Report IO statistics, if needed */
- if (TimestampDifferenceExceeds(last_flush, now,
- WALSENDER_STATS_FLUSH_INTERVAL))
- {
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
- last_flush = now;
- }
-
/* Sleep until something happens or we time out */
WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index bc8c43b96aa..feae2ae5f44 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -260,15 +260,6 @@ pgstat_report_vacuum(Relation rel, PgStat_Counter livetuples,
}
pgstat_unlock_entry(entry_ref);
-
- /*
- * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
- * however this will not be called until after an entire autovacuum cycle
- * is done -- which will likely vacuum many relations -- or until the
- * VACUUM command has processed all tables and committed.
- */
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
}
/*
@@ -360,10 +351,6 @@ pgstat_report_analyze(Relation rel,
}
pgstat_unlock_entry(entry_ref);
-
- /* see pgstat_report_vacuum() */
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
}
/*
--
2.34.1
v2-0003-Add-FLUSH_MIXED-support-and-implement-it-for-RELA.patchtext/x-diff; charset=us-asciiDownload
From d71d13b8e5a938a8a94362121fa937f9026fb51a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 19 Jan 2026 06:27:55 +0000
Subject: [PATCH v2 3/3] Add FLUSH_MIXED support and implement it for RELATION
stats
This commit extends the non transactional stats infrastructure to support statistics
kinds with mixed transaction behavior: some fields are transactional (e.g., tuple
inserts/updates/deletes) while others are non transactional (e.g., sequential scans
blocks read, ...).
It introduces FLUSH_MIXED as a third flush behavior type, alongside FLUSH_ANYTIME
and FLUSH_AT_TXN_BOUNDARY. For FLUSH_MIXED kinds, a new flush_anytime_cb callback
enables partial flushing of only the non transactional fields during running
transactions.
Some tests are also added.
Implementation details:
- Add FLUSH_MIXED to PgStat_FlushBehavior enum
- Add flush_anytime_cb to PgStat_KindInfo for partial flushing callback
- Update pgstat_flush_pending_entries() to call flush_anytime_cb for
FLUSH_MIXED entries when in anytime_only mode
- Keep FLUSH_MIXED entries in the pending list after partial flush, as
transactional fields still need to be flushed at transaction boundary
RELATION stats are making use of FLUSH_MIXED:
- Change RELATION from TXN_ALL to FLUSH_MIXED
- Implement pgstat_relation_flush_anytime_cb() to flush only read related
stats: numscans, tuples_returned, tuples_fetched, blocks_fetched,
blocks_hit
- Clear these fields after flushing to prevent double counting when
pgstat_relation_flush_cb() runs at transaction commit
- Transactional stats (tuples_inserted, tuples_updated, tuples_deleted,
live_tuples, dead_tuples) remain pending until transaction boundary
Remark:
We could also imagine adding a new flush_anytime_static_cb() callback for
future FLUSH_MIXED fixed amount stats.
---
doc/src/sgml/monitoring.sgml | 30 +++++++
src/backend/utils/activity/pgstat.c | 36 ++++++---
src/backend/utils/activity/pgstat_relation.c | 82 ++++++++++++++++++++
src/include/utils/pgstat_internal.h | 9 +++
src/test/isolation/expected/stats.out | 40 ++++++++++
src/test/isolation/expected/stats_1.out | 40 ++++++++++
src/test/isolation/specs/stats.spec | 12 +++
7 files changed, 237 insertions(+), 12 deletions(-)
14.5% doc/src/sgml/
47.2% src/backend/utils/activity/
4.7% src/include/utils/
29.4% src/test/isolation/expected/
4.0% src/test/isolation/specs/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 817fd9f4ca7..15b55016b66 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3730,6 +3730,16 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</tgroup>
</table>
+ <note>
+ <para>
+ All the statistics are updated while the transactions are in progress, except
+ for <structfield>xact_commit</structfield>, <structfield>xact_rollback</structfield>,
+ <structfield>tup_inserted</structfield>, <structfield>tup_updated</structfield> and
+ <structfield>tup_deleted</structfield> that are updated only when the transactions
+ finish.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-stat-database-conflicts-view">
@@ -4186,6 +4196,17 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</tgroup>
</table>
+ <note>
+ <para>
+ The <structfield>seq_scan</structfield>, <structfield>last_seq_scan</structfield>,
+ <structfield>seq_tup_read</structfield>, <structfield>idx_scan</structfield>,
+ <structfield>last_idx_scan</structfield> and <structfield>idx_tup_fetch</structfield>
+ are updated while the transactions are in progress. This means that we can see
+ those statistics being updated without having to wait until the transaction
+ finishes.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-stat-all-indexes-view">
@@ -4367,6 +4388,15 @@ description | Waiting for a newly initialized WAL file to reach durable storage
tuples (see <xref linkend="indexes-multicolumn"/>).
</para>
</note>
+ <note>
+ <para>
+ The <structfield>idx_scan</structfield>, <structfield>last_idx_scan</structfield>,
+ <structfield>idx_tup_read</structfield> and <structfield>idx_tup_fetch</structfield>
+ are updated while the transactions are in progress. This means that we can see
+ those statistics being updated without having to wait until the transaction
+ finishes.
+ </para>
+ </note>
<tip>
<para>
<command>EXPLAIN ANALYZE</command> outputs the total number of index
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0f45a7d165e..5b93683ea9b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -289,7 +289,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
- .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
+ .flush_behavior = FLUSH_ANYTIME,
/* so pg_stat_database entries can be seen in all databases */
.accessed_across_databases = true,
@@ -307,7 +307,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
- .flush_behavior = FLUSH_AT_TXN_BOUNDARY,
+ .flush_behavior = FLUSH_MIXED,
.shared_size = sizeof(PgStatShared_Relation),
.shared_data_off = offsetof(PgStatShared_Relation, stats),
@@ -315,6 +315,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.pending_size = sizeof(PgStat_TableStatus),
.flush_pending_cb = pgstat_relation_flush_cb,
+ .flush_anytime_cb = pgstat_relation_flush_anytime_cb,
.delete_pending_cb = pgstat_relation_delete_pending_cb,
.reset_timestamp_cb = pgstat_relation_reset_timestamp_cb,
},
@@ -1347,10 +1348,11 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
/*
* Flush out pending variable-numbered stats.
*
- * If anytime_only is true, only flushes FLUSH_ANYTIME entries.
+ * If anytime_only is true, only flushes FLUSH_ANYTIME and FLUSH_MIXED entries,
+ * using flush_anytime_cb for FLUSH_MIXED.
* This is safe to call inside transactions.
*
- * If anytime_only is false, flushes all entries.
+ * If anytime_only is false, flushes all entries using flush_pending_cb.
*/
static bool
pgstat_flush_pending_entries(bool nowait, bool anytime_only)
@@ -1378,6 +1380,7 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
PgStat_Kind kind = key.kind;
const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
bool did_flush;
+ bool is_partial_flush = false;
dlist_node *next;
Assert(!kind_info->fixed_amount);
@@ -1397,8 +1400,21 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
continue;
}
- /* flush the stats, if possible */
- did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
+ /* flush the stats (with the appropriate callback), if possible */
+ if (anytime_only &&
+ kind_info->flush_behavior == FLUSH_MIXED &&
+ kind_info->flush_anytime_cb != NULL)
+ {
+ /* Partial flush of non-transactional fields only */
+ did_flush = kind_info->flush_anytime_cb(entry_ref, nowait);
+ is_partial_flush = true;
+ }
+ else
+ {
+ /* Full flush */
+ did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
+ is_partial_flush = false;
+ }
Assert(did_flush || nowait);
@@ -1408,8 +1424,8 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
else
next = NULL;
- /* if successfully flushed, remove entry */
- if (did_flush)
+ /* if successfull non partial flush, remove entry */
+ if (did_flush && !is_partial_flush)
pgstat_delete_pending_entry(entry_ref);
else
have_pending = true;
@@ -1417,10 +1433,6 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
cur = next;
}
- /*
- * When in anytime_only mode, the list may not be empty because
- * FLUSH_AT_TXN_BOUNDARY entries were skipped.
- */
Assert(dlist_is_empty(&pgStatPending) == !have_pending);
return have_pending;
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index feae2ae5f44..6d6f333039e 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -887,6 +887,88 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
return true;
}
+/*
+ * Flush only non-transactional relation stats.
+ *
+ * This is called periodically during running transactions to make some
+ * statistics visible without waiting for the transaction to finish.
+ *
+ * Transactional stats (inserts/updates/deletes and their effects on live/dead
+ * tuple counts) remain in pending until the transaction ends, at which point
+ * pgstat_relation_flush_cb() will flush them.
+ *
+ * If nowait is true and the lock could not be immediately acquired, returns
+ * false without flushing the entry. Otherwise returns true.
+ */
+bool
+pgstat_relation_flush_anytime_cb(PgStat_EntryRef *entry_ref, bool nowait)
+{
+ Oid dboid;
+ PgStat_TableStatus *lstats; /* pending stats entry */
+ PgStatShared_Relation *shtabstats;
+ PgStat_StatTabEntry *tabentry; /* table entry of shared stats */
+ PgStat_StatDBEntry *dbentry; /* pending database entry */
+ bool has_nontxn_stats = false;
+
+ dboid = entry_ref->shared_entry->key.dboid;
+ lstats = (PgStat_TableStatus *) entry_ref->pending;
+ shtabstats = (PgStatShared_Relation *) entry_ref->shared_stats;
+
+ /*
+ * Check if there are any non-transactional stats to flush. Avoid
+ * unnecessarily locking the entry if nothing accumulated.
+ */
+ if (lstats->counts.numscans > 0 ||
+ lstats->counts.tuples_returned > 0 ||
+ lstats->counts.tuples_fetched > 0 ||
+ lstats->counts.blocks_fetched > 0 ||
+ lstats->counts.blocks_hit > 0)
+ has_nontxn_stats = true;
+
+ if (!has_nontxn_stats)
+ return true;
+
+ if (!pgstat_lock_entry(entry_ref, nowait))
+ return false;
+
+ /* Add only the non-transactional values to the shared entry */
+ tabentry = &shtabstats->stats;
+
+ tabentry->numscans += lstats->counts.numscans;
+ if (lstats->counts.numscans)
+ {
+ TimestampTz t = GetCurrentTimestamp();
+
+ if (t > tabentry->lastscan)
+ tabentry->lastscan = t;
+ }
+ tabentry->tuples_returned += lstats->counts.tuples_returned;
+ tabentry->tuples_fetched += lstats->counts.tuples_fetched;
+ tabentry->blocks_fetched += lstats->counts.blocks_fetched;
+ tabentry->blocks_hit += lstats->counts.blocks_hit;
+
+ pgstat_unlock_entry(entry_ref);
+
+ /* Also update the corresponding fields in database stats */
+ dbentry = pgstat_prep_database_pending(dboid);
+ dbentry->tuples_returned += lstats->counts.tuples_returned;
+ dbentry->tuples_fetched += lstats->counts.tuples_fetched;
+ dbentry->blocks_fetched += lstats->counts.blocks_fetched;
+ dbentry->blocks_hit += lstats->counts.blocks_hit;
+
+ /*
+ * Clear the flushed fields from pending stats to prevent double-counting
+ * when pgstat_relation_flush_cb() runs at transaction boundary.
+ */
+ lstats->counts.numscans = 0;
+ lstats->counts.tuples_returned = 0;
+ lstats->counts.tuples_fetched = 0;
+ lstats->counts.blocks_fetched = 0;
+ lstats->counts.blocks_hit = 0;
+
+ return true;
+}
+
void
pgstat_relation_delete_pending_cb(PgStat_EntryRef *entry_ref)
{
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 63feae640d1..c80b8162b37 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -233,6 +233,8 @@ typedef enum PgStat_FlushBehavior
* including within transactions */
FLUSH_AT_TXN_BOUNDARY, /* All fields can only be flushed at
* transaction boundary */
+ FLUSH_MIXED, /* Mix of fields that can be flushed anytime
+ * or only at transaction boundary */
} PgStat_FlushBehavior;
/*
@@ -264,6 +266,12 @@ typedef struct PgStat_KindInfo
/* Flush behavior */
PgStat_FlushBehavior flush_behavior;
+ /*
+ * For PGSTAT_FLUSH_MIXED kinds: callback to flush only some fields. If
+ * NULL for a MIXED kind, treated as PGSTAT_FLUSH_AT_TXN_BOUNDARY.
+ */
+ bool (*flush_anytime_cb) (PgStat_EntryRef *entry_ref, bool nowait);
+
/*
* The size of an entry in the shared stats hash table (pointed to by
* PgStatShared_HashEntry->body). For fixed-numbered statistics, this is
@@ -776,6 +784,7 @@ extern void AtPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state);
extern void PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state);
extern bool pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
+extern bool pgstat_relation_flush_anytime_cb(PgStat_EntryRef *entry_ref, bool nowait);
extern void pgstat_relation_delete_pending_cb(PgStat_EntryRef *entry_ref);
extern void pgstat_relation_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
diff --git a/src/test/isolation/expected/stats.out b/src/test/isolation/expected/stats.out
index cfad309ccf3..6d62b30e4a7 100644
--- a/src/test/isolation/expected/stats.out
+++ b/src/test/isolation/expected/stats.out
@@ -2245,6 +2245,46 @@ seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum
(1 row)
+starting permutation: s2_begin s2_table_select s1_sleep s1_table_stats s2_table_drop s2_commit
+pg_stat_force_next_flush
+------------------------
+
+(1 row)
+
+step s2_begin: BEGIN;
+step s2_table_select: SELECT * FROM test_stat_tab ORDER BY key, value;
+key|value
+---+-----
+k0 | 1
+(1 row)
+
+step s1_sleep: SELECT pg_sleep(1.5);
+pg_sleep
+--------
+
+(1 row)
+
+step s1_table_stats:
+ SELECT
+ pg_stat_get_numscans(tso.oid) AS seq_scan,
+ pg_stat_get_tuples_returned(tso.oid) AS seq_tup_read,
+ pg_stat_get_tuples_inserted(tso.oid) AS n_tup_ins,
+ pg_stat_get_tuples_updated(tso.oid) AS n_tup_upd,
+ pg_stat_get_tuples_deleted(tso.oid) AS n_tup_del,
+ pg_stat_get_live_tuples(tso.oid) AS n_live_tup,
+ pg_stat_get_dead_tuples(tso.oid) AS n_dead_tup,
+ pg_stat_get_vacuum_count(tso.oid) AS vacuum_count
+ FROM test_stat_oid AS tso
+ WHERE tso.name = 'test_stat_tab'
+
+seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count
+--------+------------+---------+---------+---------+----------+----------+------------
+ 1| 1| 1| 0| 0| 1| 0| 0
+(1 row)
+
+step s2_table_drop: DROP TABLE test_stat_tab;
+step s2_commit: COMMIT;
+
starting permutation: s1_track_counts_off s1_table_stats s1_track_counts_on
pg_stat_force_next_flush
------------------------
diff --git a/src/test/isolation/expected/stats_1.out b/src/test/isolation/expected/stats_1.out
index e1d937784cb..2fade10e817 100644
--- a/src/test/isolation/expected/stats_1.out
+++ b/src/test/isolation/expected/stats_1.out
@@ -2253,6 +2253,46 @@ seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum
(1 row)
+starting permutation: s2_begin s2_table_select s1_sleep s1_table_stats s2_table_drop s2_commit
+pg_stat_force_next_flush
+------------------------
+
+(1 row)
+
+step s2_begin: BEGIN;
+step s2_table_select: SELECT * FROM test_stat_tab ORDER BY key, value;
+key|value
+---+-----
+k0 | 1
+(1 row)
+
+step s1_sleep: SELECT pg_sleep(1.5);
+pg_sleep
+--------
+
+(1 row)
+
+step s1_table_stats:
+ SELECT
+ pg_stat_get_numscans(tso.oid) AS seq_scan,
+ pg_stat_get_tuples_returned(tso.oid) AS seq_tup_read,
+ pg_stat_get_tuples_inserted(tso.oid) AS n_tup_ins,
+ pg_stat_get_tuples_updated(tso.oid) AS n_tup_upd,
+ pg_stat_get_tuples_deleted(tso.oid) AS n_tup_del,
+ pg_stat_get_live_tuples(tso.oid) AS n_live_tup,
+ pg_stat_get_dead_tuples(tso.oid) AS n_dead_tup,
+ pg_stat_get_vacuum_count(tso.oid) AS vacuum_count
+ FROM test_stat_oid AS tso
+ WHERE tso.name = 'test_stat_tab'
+
+seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count
+--------+------------+---------+---------+---------+----------+----------+------------
+ 0| 0| 1| 0| 0| 1| 0| 0
+(1 row)
+
+step s2_table_drop: DROP TABLE test_stat_tab;
+step s2_commit: COMMIT;
+
starting permutation: s1_track_counts_off s1_table_stats s1_track_counts_on
pg_stat_force_next_flush
------------------------
diff --git a/src/test/isolation/specs/stats.spec b/src/test/isolation/specs/stats.spec
index da16710da0f..1b0168e6176 100644
--- a/src/test/isolation/specs/stats.spec
+++ b/src/test/isolation/specs/stats.spec
@@ -50,6 +50,8 @@ step s1_rollback { ROLLBACK; }
step s1_prepare_a { PREPARE TRANSACTION 'a'; }
step s1_commit_prepared_a { COMMIT PREPARED 'a'; }
step s1_rollback_prepared_a { ROLLBACK PREPARED 'a'; }
+# Has to be greater than PGSTAT_ANYTIME_FLUSH_INTERVAL
+step s1_sleep { SELECT pg_sleep(1.5); }
# Function stats steps
step s1_ff { SELECT pg_stat_force_next_flush(); }
@@ -138,6 +140,7 @@ step s2_commit { COMMIT; }
step s2_commit_prepared_a { COMMIT PREPARED 'a'; }
step s2_rollback_prepared_a { ROLLBACK PREPARED 'a'; }
step s2_ff { SELECT pg_stat_force_next_flush(); }
+step s2_table_drop { DROP TABLE test_stat_tab; }
# Function stats steps
step s2_track_funcs_all { SET track_functions = 'all'; }
@@ -435,6 +438,15 @@ permutation
s1_table_drop
s1_table_stats
+### Check that some stats are updated (seq_scan and seq_tup_read)
+### while the transaction is still running
+permutation
+ s2_begin
+ s2_table_select
+ s1_sleep
+ s1_table_stats
+ s2_table_drop
+ s2_commit
### Check that we don't count changes with track counts off, but allow access
### to prior stats
--
2.34.1
Thanks for the updates!
I don't think this feature could add a noticeable performance impact, so the tests
have been that simple. Do you think we should worry more?One observation is there's no coordination between ANYTIME and
TXN_BOUNDARY flushes. While PGSTAT_MIN_INTERVAL
prevents a backend from flushing more than once per second, a backend can
still perform both an ANYTIME flush and a TXN_BOUNDARY flush within
the same 1-second window. Not saying this will be a real problem in
the real-world,
but we definitely took measures in the current implementation to avoid
this scenario.Right. I think that the PGSTAT_MIN_INTERVAL throttling was put in place to prevent
flushing too frequently when the backend has a high commit rate. But here, while
it's true that we don't follow that rule (means a backend could flush more than one
time per second), that would be a maximum of 2 times (given that ANYTIME is
flushing every second). So, I'm not sure that this single extra flush is worth
worrying about. Plus we'd certainly need an extra GetCurrentTimestamp() call, so
I'm not sure it's worth it.
Yeah, all PGSTAT_MIN_INTERVAL does is throttle pgstat_flush_pending_entries.
Even in the current state, it does not limit how many kinds are flushed, etc.
I consider the ANYTIME flushes the same as just adding another stats kind.
So, I am not really worried about either.
I have some more comments:
-- v2-0001
#1.
+/* When to call pgstat_report_anytime_stat() again */
+#define PGSTAT_ANYTIME_FLUSH_INTERVAL 1000
+
We should just use PGSTAT_MIN_INTERVAL.
#2.
instead of ".flush_behavior", maybe ".flush_mode"? "mode" in the name is better
for configuration fields.
#3.
+/*
+ * Flush behavior for statistics kinds.
+ */
+typedef enum PgStat_FlushBehavior
+{
+ FLUSH_ANYTIME, /* All fields can be
flushed anytime,
+ *
including within transactions */
+ FLUSH_AT_TXN_BOUNDARY, /* All fields can only be flushed at
+ *
transaction boundary */
+} PgStat_FlushBehavior;
FLUSH_AT_TXN_BOUNDARY should be the first value in PgStat_FlushBehavior.
Otherwise kinds ( built-in or custom ) that do not specify a flush_behavior
will default to FLUSH_ANYTIME. I don't think this is what we want.
FLUSH_AT_TXN_BOUNDARY should be the default.
#4. Can we add a test here? Maybe generate some wal inside a long
running transaction and
make sure the stats are updated after > 1 second
-- v2-0002
No comments for this one. With ANYTIME, indeed those flushes are not needed.
-- v2-0003
#1. Should we maybe make this a bit longer? maybe 2 or 3 seconds?
May make the tests slightly longer, but maybe better for test stability.
```
+step s1_sleep: SELECT pg_sleep(1.5);
+pg_sleep
+--------
```
#2.
+ /*
+ * Check if there are any non-transactional stats to flush. Avoid
+ * unnecessarily locking the entry if nothing accumulated.
+ */
+ if (lstats->counts.numscans > 0 ||
+ lstats->counts.tuples_returned > 0 ||
+ lstats->counts.tuples_fetched > 0 ||
+ lstats->counts.blocks_fetched > 0 ||
+ lstats->counts.blocks_hit > 0)
+ has_nontxn_stats = true;
+
+ if (!has_nontxn_stats)
+ return true;
Can we just do this without a has_nontxn_stats?
This is also the same patter as a regular flush, although
in the case `pg_memory_is_all_zeros` is used.
```
if (lstats->counts.numscans == 0 &&
lstats->counts.tuples_returned == 0 &&
lstats->counts.tuples_fetched == 0 &&
lstats->counts.blocks_fetched == 0 &&
lstats->counts.blocks_hit == 0)
return true;
```
#3.
+ are updated while the transactions are in progress. This means
that we can see
+ those statistics being updated without having to wait until the transaction
+ finishes.
+ </para>
The "This means ...... " line used several times does not add value, IMO.
"are updated while the transactions are in progress." is sufficient.
#4.
+ <note>
+ <para>
+ All the statistics are updated while the transactions are in
progress, except
+ for <structfield>xact_commit</structfield>,
<structfield>xact_rollback</structfield>,
+ <structfield>tup_inserted</structfield>,
<structfield>tup_updated</structfield> and
+ <structfield>tup_deleted</structfield> that are updated only when
the transactions
+ finish.
+ </para>
+ </note>
Only these 5 fields from pgstat_relation_flush_anytime_cb, so only the below are
"All the statistics are updated while the transactions are in progress", right?
numscans
tuples_returned
tuples_fetched
blocks_fetched
blocks_hit
--
Sami Imseih
Amazon Web Services (AWS)
Hello
@@ -264,6 +266,12 @@ typedef struct PgStat_KindInfo
/* Flush behavior */
PgStat_FlushBehavior flush_behavior;
+ /*
+ * For PGSTAT_FLUSH_MIXED kinds: callback to flush only some fields. If
+ * NULL for a MIXED kind, treated as PGSTAT_FLUSH_AT_TXN_BOUNDARY.
+ */
+ bool (*flush_anytime_cb) (PgStat_EntryRef *entry_ref, bool nowait);
+
The comment seems to use incorrect names, shouldn't be FLUSH_MIXED and
FLUSH_AT_TXN_BOUNDARY without PGSTAT_?
Hi,
On Tue, Jan 20, 2026 at 01:27:55PM -0600, Sami Imseih wrote:
I have some more comments:
Thanks!
-- v2-0001
#1.
+/* When to call pgstat_report_anytime_stat() again */ +#define PGSTAT_ANYTIME_FLUSH_INTERVAL 1000 +We should just use PGSTAT_MIN_INTERVAL.
Okay, done. We can still switch to a dedicated one if we feel the need later on.
#2.
instead of ".flush_behavior", maybe ".flush_mode"? "mode" in the name is better
for configuration fields.
Sounds good.
#3.
FLUSH_AT_TXN_BOUNDARY should be the first value in PgStat_FlushBehavior.
Otherwise kinds ( built-in or custom ) that do not specify a flush_behavior
will default to FLUSH_ANYTIME. I don't think this is what we want.
FLUSH_AT_TXN_BOUNDARY should be the default.
Good point, agreed and done.
#4. Can we add a test here? Maybe generate some wal inside a long
running transaction and
make sure the stats are updated after > 1 second
I'm not sure, that's also somehow the purpose of 0002 (with 039549d70f6 being
reverted).
0001 and 0002 could be merged and pushed as one commit. That said I'm not opposed
if you feel strongly about it.
-- v2-0003
#1. Should we maybe make this a bit longer? maybe 2 or 3 seconds?
May make the tests slightly longer, but maybe better for test stability.``` +step s1_sleep: SELECT pg_sleep(1.5); +pg_sleep +-------- ```
Not sure, we could increase if we see the test failing.
#2. + /* + * Check if there are any non-transactional stats to flush. Avoid + * unnecessarily locking the entry if nothing accumulated. + */ + if (lstats->counts.numscans > 0 || + lstats->counts.tuples_returned > 0 || + lstats->counts.tuples_fetched > 0 || + lstats->counts.blocks_fetched > 0 || + lstats->counts.blocks_hit > 0) + has_nontxn_stats = true; + + if (!has_nontxn_stats) + return true;
Can we just do this without a has_nontxn_stats?
Yeah.
#3. + are updated while the transactions are in progress. This means that we can see + those statistics being updated without having to wait until the transaction + finishes. + </para>The "This means ...... " line used several times does not add value, IMO.
"are updated while the transactions are in progress." is sufficient.
Removed.
#4. + <note> + <para> + All the statistics are updated while the transactions are in progress, except + for <structfield>xact_commit</structfield>, <structfield>xact_rollback</structfield>, + <structfield>tup_inserted</structfield>, <structfield>tup_updated</structfield> and + <structfield>tup_deleted</structfield> that are updated only when the transactions + finish. + </para> + </note>Only these 5 fields from pgstat_relation_flush_anytime_cb, so only the below are
"All the statistics are updated while the transactions are in progress", right?numscans
tuples_returned
tuples_fetched
blocks_fetched
blocks_hit
No, 0003 also changes the flush mode for the database KIND. All the fields that
I mentioned are inherited from relations stats and are flushed only at transaction
boundaries (so they don't appear in pg_stat_database until the transaction
finishes). Does that make sense? (if the database kind is not switched to
flush any time then none would appear while the transaction is in progress, even
the ones inherited from relations stats).
PFA v3, also taking care of Zsolt's comment (thanks!) done up-thread.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-Add-pgstat_report_anytime_stat-for-periodic-stats.patchtext/x-diff; charset=us-asciiDownload
From b9652e6e6031ff88f9b789f99c5c19a38fa61d0c Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 5 Jan 2026 09:41:39 +0000
Subject: [PATCH v3 1/3] Add pgstat_report_anytime_stat() for periodic stats
flushing
Long running transactions can accumulate significant statistics (WAL, IO, ...)
that remain unflushed until the transaction ends. This delays visibility of
resource usage in monitoring views like pg_stat_io and pg_stat_wal.
This commit introduces pgstat_report_anytime_stat(), which flushes
non transactional statistics even inside active transactions. A new timeout
handler fires every second to call this function, ensuring timely stats visibility
without waiting for transaction completion.
Implementation details:
- Add PgStat_FlushMode enum to classify stats kinds:
* FLUSH_ANYTIME: Stats that can always be flushed (WAL, IO, ...)
* FLUSH_AT_TXN_BOUNDARY: Stats requiring transaction boundaries
- Modify pgstat_flush_pending_entries() and pgstat_flush_fixed_stats()
to accept a boolean anytime_only parameter:
* When false: flushes all stats (existing behavior)
* When true: flushes only FLUSH_ANYTIME stats and skips FLUSH_AT_TXN_BOUNDARY stats
- This relies on the existing PGSTAT_MIN_INTERVAL to fire every 1 second, calling
pgstat_report_anytime_stat(false)
The force parameter in pgstat_report_anytime_stat() is currently unused (always
called with force=false) but reserved for future use cases requiring immediate
flushing.
---
src/backend/tcop/postgres.c | 16 ++++
src/backend/utils/activity/pgstat.c | 111 +++++++++++++++++++++++-----
src/backend/utils/init/globals.c | 1 +
src/backend/utils/init/postinit.c | 15 ++++
src/include/miscadmin.h | 1 +
src/include/pgstat.h | 4 +
src/include/utils/pgstat_internal.h | 16 ++++
src/include/utils/timeout.h | 1 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 148 insertions(+), 18 deletions(-)
8.1% src/backend/tcop/
68.4% src/backend/utils/activity/
9.6% src/backend/utils/init/
9.2% src/include/utils/
4.1% src/include/
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e54bf1e760f..132fae61423 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3530,6 +3530,22 @@ ProcessInterrupts(void)
pgstat_report_stat(true);
}
+ /*
+ * Flush stats outside of transaction boundary if the timeout fired.
+ * Unlike transactional stats, these can be flushed even inside a running
+ * transaction.
+ */
+ if (AnytimeStatsUpdateTimeoutPending)
+ {
+ AnytimeStatsUpdateTimeoutPending = false;
+
+ pgstat_report_anytime_stat(false);
+
+ /* Schedule next timeout */
+ enable_timeout_after(ANYTIME_STATS_UPDATE_TIMEOUT,
+ PGSTAT_MIN_INTERVAL);
+ }
+
if (ProcSignalBarrierPending)
ProcessProcSignalBarrier();
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 11bb71cad5a..ab4d9088a9a 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -122,8 +122,6 @@
* ----------
*/
-/* minimum interval non-forced stats flushes.*/
-#define PGSTAT_MIN_INTERVAL 1000
/* how long until to block flushing pending stats updates */
#define PGSTAT_MAX_INTERVAL 60000
/* when to call pgstat_report_stat() again, even when idle */
@@ -187,7 +185,8 @@ static void pgstat_init_snapshot_fixed(void);
static void pgstat_reset_after_failure(void);
-static bool pgstat_flush_pending_entries(bool nowait);
+static bool pgstat_flush_pending_entries(bool nowait, bool anytime_only);
+static bool pgstat_flush_fixed_stats(bool nowait, bool anytime_only);
static void pgstat_prep_snapshot(void);
static void pgstat_build_snapshot(void);
@@ -288,6 +287,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_mode = FLUSH_AT_TXN_BOUNDARY,
/* so pg_stat_database entries can be seen in all databases */
.accessed_across_databases = true,
@@ -305,6 +305,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_mode = FLUSH_AT_TXN_BOUNDARY,
.shared_size = sizeof(PgStatShared_Relation),
.shared_data_off = offsetof(PgStatShared_Relation, stats),
@@ -321,6 +322,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_mode = FLUSH_AT_TXN_BOUNDARY,
.shared_size = sizeof(PgStatShared_Function),
.shared_data_off = offsetof(PgStatShared_Function, stats),
@@ -336,6 +338,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_mode = FLUSH_AT_TXN_BOUNDARY,
.accessed_across_databases = true,
@@ -353,6 +356,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
+ .flush_mode = FLUSH_AT_TXN_BOUNDARY,
/* so pg_stat_subscription_stats entries can be seen in all databases */
.accessed_across_databases = true,
@@ -370,6 +374,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = false,
+ .flush_mode = FLUSH_ANYTIME,
.accessed_across_databases = true,
@@ -388,6 +393,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_mode = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, archiver),
.shared_ctl_off = offsetof(PgStat_ShmemControl, archiver),
@@ -404,6 +410,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_mode = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, bgwriter),
.shared_ctl_off = offsetof(PgStat_ShmemControl, bgwriter),
@@ -420,6 +427,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_mode = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, checkpointer),
.shared_ctl_off = offsetof(PgStat_ShmemControl, checkpointer),
@@ -436,6 +444,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_mode = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, io),
.shared_ctl_off = offsetof(PgStat_ShmemControl, io),
@@ -453,6 +462,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_mode = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, slru),
.shared_ctl_off = offsetof(PgStat_ShmemControl, slru),
@@ -470,6 +480,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = true,
.write_to_file = true,
+ .flush_mode = FLUSH_ANYTIME,
.snapshot_ctl_off = offsetof(PgStat_Snapshot, wal),
.shared_ctl_off = offsetof(PgStat_ShmemControl, wal),
@@ -775,23 +786,11 @@ pgstat_report_stat(bool force)
partial_flush = false;
/* flush of variable-numbered stats tracked in pending entries list */
- partial_flush |= pgstat_flush_pending_entries(nowait);
+ partial_flush |= pgstat_flush_pending_entries(nowait, false);
/* flush of other stats kinds */
if (pgstat_report_fixed)
- {
- for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
- {
- const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
-
- if (!kind_info)
- continue;
- if (!kind_info->flush_static_cb)
- continue;
-
- partial_flush |= kind_info->flush_static_cb(nowait);
- }
- }
+ partial_flush |= pgstat_flush_fixed_stats(nowait, false);
last_flush = now;
@@ -1345,9 +1344,14 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
/*
* Flush out pending variable-numbered stats.
+ *
+ * If anytime_only is true, only flushes FLUSH_ANYTIME entries.
+ * This is safe to call inside transactions.
+ *
+ * If anytime_only is false, flushes all entries.
*/
static bool
-pgstat_flush_pending_entries(bool nowait)
+pgstat_flush_pending_entries(bool nowait, bool anytime_only)
{
bool have_pending = false;
dlist_node *cur = NULL;
@@ -1377,6 +1381,20 @@ pgstat_flush_pending_entries(bool nowait)
Assert(!kind_info->fixed_amount);
Assert(kind_info->flush_pending_cb != NULL);
+ /* Skip transactional stats if we're in anytime_only mode */
+ if (anytime_only && kind_info->flush_mode == FLUSH_AT_TXN_BOUNDARY)
+ {
+ have_pending = true;
+
+ if (dlist_has_next(&pgStatPending, cur))
+ next = dlist_next_node(&pgStatPending, cur);
+ else
+ next = NULL;
+
+ cur = next;
+ continue;
+ }
+
/* flush the stats, if possible */
did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
@@ -1402,6 +1420,33 @@ pgstat_flush_pending_entries(bool nowait)
return have_pending;
}
+/*
+ * Flush fixed-amount stats.
+ *
+ * If anytime_only is true, only flushes FLUSH_ANYTIME stats (safe inside transactions).
+ * If anytime_only is false, flushes all stats with flush_static_cb.
+ */
+static bool
+pgstat_flush_fixed_stats(bool nowait, bool anytime_only)
+{
+ bool partial_flush = false;
+
+ for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
+ {
+ const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
+
+ if (!kind_info || !kind_info->flush_static_cb)
+ continue;
+
+ /* Skip transactional stats if we're in anytime_only mode */
+ if (anytime_only && kind_info->flush_mode == FLUSH_AT_TXN_BOUNDARY)
+ continue;
+
+ partial_flush |= kind_info->flush_static_cb(nowait);
+ }
+
+ return partial_flush;
+}
/* ------------------------------------------------------------
* Helper / infrastructure functions
@@ -2119,3 +2164,33 @@ assign_stats_fetch_consistency(int newval, void *extra)
if (pgstat_fetch_consistency != newval)
force_stats_snapshot_clear = true;
}
+
+/*
+ * Flushes only FLUSH_ANYTIME stats using non-blocking locks. Transactional
+ * stats (FLUSH_AT_TXN_BOUNDARY) remain pending until transaction boundary.
+ * Safe to call inside transactions.
+ */
+void
+pgstat_report_anytime_stat(bool force)
+{
+ bool nowait = !force;
+
+ pgstat_assert_is_up();
+
+ /*
+ * Exit if no pending stats at all. This avoids unnecessary work when
+ * backends are idle or in sessions without stats accumulation.
+ *
+ * Note: This check isn't precise as there might be only transactional
+ * stats pending, which we'll skip during the flush. However, maintaining
+ * precise tracking would add complexity that does not seem worth it from
+ * a performance point of view (no noticeable performance regression has
+ * been observed with the current implementation).
+ */
+ if (dlist_is_empty(&pgStatPending) && !pgstat_report_fixed)
+ return;
+
+ /* Flush stats outside of transaction boundary */
+ pgstat_flush_pending_entries(nowait, true);
+ pgstat_flush_fixed_stats(nowait, true);
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..ad44826c39e 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -40,6 +40,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false;
volatile sig_atomic_t ProcSignalBarrierPending = false;
volatile sig_atomic_t LogMemoryContextPending = false;
volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
+volatile sig_atomic_t AnytimeStatsUpdateTimeoutPending = false;
volatile uint32 InterruptHoldoffCount = 0;
volatile uint32 QueryCancelHoldoffCount = 0;
volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 3f401faf3de..6076f531c4a 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -82,6 +82,7 @@ static void TransactionTimeoutHandler(void);
static void IdleSessionTimeoutHandler(void);
static void IdleStatsUpdateTimeoutHandler(void);
static void ClientCheckTimeoutHandler(void);
+static void AnytimeStatsUpdateTimeoutHandler(void);
static bool ThereIsAtLeastOneRole(void);
static void process_startup_options(Port *port, bool am_superuser);
static void process_settings(Oid databaseid, Oid roleid);
@@ -765,6 +766,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
IdleStatsUpdateTimeoutHandler);
+ RegisterTimeout(ANYTIME_STATS_UPDATE_TIMEOUT,
+ AnytimeStatsUpdateTimeoutHandler);
+ enable_timeout_after(ANYTIME_STATS_UPDATE_TIMEOUT, PGSTAT_MIN_INTERVAL);
}
/*
@@ -1446,3 +1450,14 @@ ThereIsAtLeastOneRole(void)
return result;
}
+
+/*
+ * Timeout handler for flushing non-transactional stats.
+ */
+static void
+AnytimeStatsUpdateTimeoutHandler(void)
+{
+ AnytimeStatsUpdateTimeoutPending = true;
+ InterruptPending = true;
+ SetLatch(MyLatch);
+}
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index db559b39c4d..8aeb9628871 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -96,6 +96,7 @@ extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
extern PGDLLIMPORT volatile sig_atomic_t LogMemoryContextPending;
extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t AnytimeStatsUpdateTimeoutPending;
extern PGDLLIMPORT volatile sig_atomic_t CheckClientConnectionPending;
extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff7ecc2533..1651f16f966 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -35,6 +35,9 @@
/* Default directory to store temporary statistics data in */
#define PG_STAT_TMP_DIR "pg_stat_tmp"
+/* Minimum interval non-forced stats flushes */
+#define PGSTAT_MIN_INTERVAL 1000
+
/* Values for track_functions GUC variable --- order is significant! */
typedef enum TrackFunctionsLevel
{
@@ -533,6 +536,7 @@ extern void pgstat_initialize(void);
/* Functions called from backends */
extern long pgstat_report_stat(bool force);
+extern void pgstat_report_anytime_stat(bool force);
extern void pgstat_force_next_flush(void);
extern void pgstat_reset_counters(void);
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9b8fbae00ed..46ce90c9624 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -224,6 +224,19 @@ typedef struct PgStat_SubXactStatus
PgStat_TableXactStatus *first; /* head of list for this subxact */
} PgStat_SubXactStatus;
+/*
+ * Flush mode for statistics kinds.
+ *
+ * FLUSH_AT_TXN_BOUNDARY has to be the first because we want it to be the
+ * default value.
+ */
+typedef enum PgStat_FlushMode
+{
+ FLUSH_AT_TXN_BOUNDARY, /* All fields can only be flushed at
+ * transaction boundary */
+ FLUSH_ANYTIME, /* All fields can be flushed anytime,
+ * including within transactions */
+} PgStat_FlushMode;
/*
* Metadata for a specific kind of statistics.
@@ -251,6 +264,9 @@ typedef struct PgStat_KindInfo
*/
bool track_entry_count:1;
+ /* Flush mode */
+ PgStat_FlushMode flush_mode;
+
/*
* The size of an entry in the shared stats hash table (pointed to by
* PgStatShared_HashEntry->body). For fixed-numbered statistics, this is
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 0965b590b34..10723bb664c 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
IDLE_SESSION_TIMEOUT,
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
+ ANYTIME_STATS_UPDATE_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3f3a888fd0e..d3912b43fdc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2268,6 +2268,7 @@ PgStat_Counter
PgStat_EntryRef
PgStat_EntryRefHashEntry
PgStat_FetchConsistency
+PgStat_FlushMode
PgStat_FunctionCallUsage
PgStat_FunctionCounts
PgStat_HashKey
--
2.34.1
v3-0002-Remove-useless-calls-to-flush-some-stats.patchtext/x-diff; charset=us-asciiDownload
From 6bce329c6597b7bbf72dfa1674e1f0e796a7621d Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 6 Jan 2026 11:06:31 +0000
Subject: [PATCH v3 2/3] Remove useless calls to flush some stats
Now that some stats can be flushed outside of transaction boundaries, remove
useless calls to report/flush some stats. Those calls were in place because
before commit <XXXX> stats were flushed only at transaction boundaries.
Note that:
- it reverts 039549d70f6 (it just keeps its tests)
- it can't be done for checkpointer and bgworker for example because they don't
have a flush callback to call
- it can't be done for auxiliary process (walsummarizer for example) because they
currently do not register the new timeout handler
---
src/backend/replication/walreceiver.c | 10 ------
src/backend/replication/walsender.c | 36 ++------------------
src/backend/utils/activity/pgstat_relation.c | 13 -------
3 files changed, 2 insertions(+), 57 deletions(-)
75.3% src/backend/replication/
24.6% src/backend/utils/activity/
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a41453530a1..266379c780a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -553,16 +553,6 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
*/
bool requestReply = false;
- /*
- * Report pending statistics to the cumulative stats
- * system. This location is useful for the report as it
- * is not within a tight loop in the WAL receiver, to
- * avoid bloating pgstats with requests, while also making
- * sure that the reports happen each time a status update
- * is sent.
- */
- pgstat_report_wal(false);
-
/*
* Check if time since last receive from primary has
* reached the configured limit.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1ab09655a70..c33185bd337 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -94,14 +94,10 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/pg_lsn.h"
-#include "utils/pgstat_internal.h"
#include "utils/ps_status.h"
#include "utils/timeout.h"
#include "utils/timestamp.h"
-/* Minimum interval used by walsender for stats flushes, in ms */
-#define WALSENDER_STATS_FLUSH_INTERVAL 1000
-
/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
@@ -1826,7 +1822,6 @@ WalSndWaitForWal(XLogRecPtr loc)
int wakeEvents;
uint32 wait_event = 0;
static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
- TimestampTz last_flush = 0;
/*
* Fast path to avoid acquiring the spinlock in case we already know we
@@ -1847,7 +1842,6 @@ WalSndWaitForWal(XLogRecPtr loc)
{
bool wait_for_standby_at_stop = false;
long sleeptime;
- TimestampTz now;
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
@@ -1958,8 +1952,7 @@ WalSndWaitForWal(XLogRecPtr loc)
* new WAL to be generated. (But if we have nothing to send, we don't
* want to wake on socket-writable.)
*/
- now = GetCurrentTimestamp();
- sleeptime = WalSndComputeSleeptime(now);
+ sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
wakeEvents = WL_SOCKET_READABLE;
@@ -1968,15 +1961,6 @@ WalSndWaitForWal(XLogRecPtr loc)
Assert(wait_event != 0);
- /* Report IO statistics, if needed */
- if (TimestampDifferenceExceeds(last_flush, now,
- WALSENDER_STATS_FLUSH_INTERVAL))
- {
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
- last_flush = now;
- }
-
WalSndWait(wakeEvents, sleeptime, wait_event);
}
@@ -2879,8 +2863,6 @@ WalSndCheckTimeOut(void)
static void
WalSndLoop(WalSndSendDataCallback send_data)
{
- TimestampTz last_flush = 0;
-
/*
* Initialize the last reply timestamp. That enables timeout processing
* from hereon.
@@ -2975,9 +2957,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
* WalSndWaitForWal() handle any other blocking; idle receivers need
* its additional actions. For physical replication, also block if
* caught up; its send_data does not block.
- *
- * The IO statistics are reported in WalSndWaitForWal() for the
- * logical WAL senders.
*/
if ((WalSndCaughtUp && send_data != XLogSendLogical &&
!streamingDoneSending) ||
@@ -2985,7 +2964,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
{
long sleeptime;
int wakeEvents;
- TimestampTz now;
if (!streamingDoneReceiving)
wakeEvents = WL_SOCKET_READABLE;
@@ -2996,21 +2974,11 @@ WalSndLoop(WalSndSendDataCallback send_data)
* Use fresh timestamp, not last_processing, to reduce the chance
* of reaching wal_sender_timeout before sending a keepalive.
*/
- now = GetCurrentTimestamp();
- sleeptime = WalSndComputeSleeptime(now);
+ sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
if (pq_is_send_pending())
wakeEvents |= WL_SOCKET_WRITEABLE;
- /* Report IO statistics, if needed */
- if (TimestampDifferenceExceeds(last_flush, now,
- WALSENDER_STATS_FLUSH_INTERVAL))
- {
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
- last_flush = now;
- }
-
/* Sleep until something happens or we time out */
WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN);
}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index bc8c43b96aa..feae2ae5f44 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -260,15 +260,6 @@ pgstat_report_vacuum(Relation rel, PgStat_Counter livetuples,
}
pgstat_unlock_entry(entry_ref);
-
- /*
- * Flush IO statistics now. pgstat_report_stat() will flush IO stats,
- * however this will not be called until after an entire autovacuum cycle
- * is done -- which will likely vacuum many relations -- or until the
- * VACUUM command has processed all tables and committed.
- */
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
}
/*
@@ -360,10 +351,6 @@ pgstat_report_analyze(Relation rel,
}
pgstat_unlock_entry(entry_ref);
-
- /* see pgstat_report_vacuum() */
- pgstat_flush_io(false);
- (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
}
/*
--
2.34.1
v3-0003-Add-FLUSH_MIXED-support-and-implement-it-for-RELA.patchtext/x-diff; charset=us-asciiDownload
From 121f38f0fd1105a15d7acf8c25db23a4196b84da Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 19 Jan 2026 06:27:55 +0000
Subject: [PATCH v3 3/3] Add FLUSH_MIXED support and implement it for RELATION
stats
This commit extends the non transactional stats infrastructure to support statistics
kinds with mixed transaction behavior: some fields are transactional (e.g., tuple
inserts/updates/deletes) while others are non transactional (e.g., sequential scans
blocks read, ...).
It introduces FLUSH_MIXED as a third flush mode type, alongside FLUSH_ANYTIME
and FLUSH_AT_TXN_BOUNDARY. For FLUSH_MIXED kinds, a new flush_anytime_cb callback
enables partial flushing of only the non transactional fields during running
transactions.
Some tests are also added.
Implementation details:
- Add FLUSH_MIXED to PgStat_FlushMode enum
- Add flush_anytime_cb to PgStat_KindInfo for partial flushing callback
- Update pgstat_flush_pending_entries() to call flush_anytime_cb for
FLUSH_MIXED entries when in anytime_only mode
- Keep FLUSH_MIXED entries in the pending list after partial flush, as
transactional fields still need to be flushed at transaction boundary
RELATION stats are making use of FLUSH_MIXED:
- Change RELATION from FLUSH_AT_TXN_BOUNDARY to FLUSH_MIXED
- Implement pgstat_relation_flush_anytime_cb() to flush only read related
stats: numscans, tuples_returned, tuples_fetched, blocks_fetched,
blocks_hit
- Clear these fields after flushing to prevent double counting when
pgstat_relation_flush_cb() runs at transaction commit
- Transactional stats (tuples_inserted, tuples_updated, tuples_deleted,
live_tuples, dead_tuples) remain pending until transaction boundary
The DATABASE kind is also changed from FLUSH_AT_TXN_BOUNDARY to FLUSH_ANYTIME, so
that some stats inherited from relations stats are also visible while the transaction
is in progress.
Remark:
We could also imagine adding a new flush_anytime_static_cb() callback for
future FLUSH_MIXED fixed amount stats.
---
doc/src/sgml/monitoring.sgml | 26 +++++++
src/backend/utils/activity/pgstat.c | 32 ++++++--
src/backend/utils/activity/pgstat_relation.c | 78 ++++++++++++++++++++
src/include/utils/pgstat_internal.h | 9 +++
src/test/isolation/expected/stats.out | 40 ++++++++++
src/test/isolation/expected/stats_1.out | 40 ++++++++++
src/test/isolation/specs/stats.spec | 12 +++
7 files changed, 229 insertions(+), 8 deletions(-)
12.4% doc/src/sgml/
47.3% src/backend/utils/activity/
4.8% src/include/utils/
31.0% src/test/isolation/expected/
4.2% src/test/isolation/specs/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 817fd9f4ca7..94fd2b76136 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3730,6 +3730,16 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</tgroup>
</table>
+ <note>
+ <para>
+ All the statistics are updated while the transactions are in progress, except
+ for <structfield>xact_commit</structfield>, <structfield>xact_rollback</structfield>,
+ <structfield>tup_inserted</structfield>, <structfield>tup_updated</structfield> and
+ <structfield>tup_deleted</structfield> that are updated only when the transactions
+ finish.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-stat-database-conflicts-view">
@@ -4186,6 +4196,15 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</tgroup>
</table>
+ <note>
+ <para>
+ The <structfield>seq_scan</structfield>, <structfield>last_seq_scan</structfield>,
+ <structfield>seq_tup_read</structfield>, <structfield>idx_scan</structfield>,
+ <structfield>last_idx_scan</structfield> and <structfield>idx_tup_fetch</structfield>
+ are updated while the transactions are in progress.
+ </para>
+ </note>
+
</sect2>
<sect2 id="monitoring-pg-stat-all-indexes-view">
@@ -4367,6 +4386,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
tuples (see <xref linkend="indexes-multicolumn"/>).
</para>
</note>
+ <note>
+ <para>
+ The <structfield>idx_scan</structfield>, <structfield>last_idx_scan</structfield>,
+ <structfield>idx_tup_read</structfield> and <structfield>idx_tup_fetch</structfield>
+ are updated while the transactions are in progress.
+ </para>
+ </note>
<tip>
<para>
<command>EXPLAIN ANALYZE</command> outputs the total number of index
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index ab4d9088a9a..6733a739c56 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -287,7 +287,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
- .flush_mode = FLUSH_AT_TXN_BOUNDARY,
+ .flush_mode = FLUSH_ANYTIME,
/* so pg_stat_database entries can be seen in all databases */
.accessed_across_databases = true,
@@ -305,7 +305,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.fixed_amount = false,
.write_to_file = true,
- .flush_mode = FLUSH_AT_TXN_BOUNDARY,
+ .flush_mode = FLUSH_MIXED,
.shared_size = sizeof(PgStatShared_Relation),
.shared_data_off = offsetof(PgStatShared_Relation, stats),
@@ -313,6 +313,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
.pending_size = sizeof(PgStat_TableStatus),
.flush_pending_cb = pgstat_relation_flush_cb,
+ .flush_anytime_cb = pgstat_relation_flush_anytime_cb,
.delete_pending_cb = pgstat_relation_delete_pending_cb,
.reset_timestamp_cb = pgstat_relation_reset_timestamp_cb,
},
@@ -1345,10 +1346,11 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
/*
* Flush out pending variable-numbered stats.
*
- * If anytime_only is true, only flushes FLUSH_ANYTIME entries.
+ * If anytime_only is true, only flushes FLUSH_ANYTIME and FLUSH_MIXED entries,
+ * using flush_anytime_cb for FLUSH_MIXED.
* This is safe to call inside transactions.
*
- * If anytime_only is false, flushes all entries.
+ * If anytime_only is false, flushes all entries using flush_pending_cb.
*/
static bool
pgstat_flush_pending_entries(bool nowait, bool anytime_only)
@@ -1376,6 +1378,7 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
PgStat_Kind kind = key.kind;
const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
bool did_flush;
+ bool is_partial_flush = false;
dlist_node *next;
Assert(!kind_info->fixed_amount);
@@ -1395,8 +1398,21 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
continue;
}
- /* flush the stats, if possible */
- did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
+ /* flush the stats (with the appropriate callback), if possible */
+ if (anytime_only &&
+ kind_info->flush_mode == FLUSH_MIXED &&
+ kind_info->flush_anytime_cb != NULL)
+ {
+ /* Partial flush of non-transactional fields only */
+ did_flush = kind_info->flush_anytime_cb(entry_ref, nowait);
+ is_partial_flush = true;
+ }
+ else
+ {
+ /* Full flush */
+ did_flush = kind_info->flush_pending_cb(entry_ref, nowait);
+ is_partial_flush = false;
+ }
Assert(did_flush || nowait);
@@ -1406,8 +1422,8 @@ pgstat_flush_pending_entries(bool nowait, bool anytime_only)
else
next = NULL;
- /* if successfully flushed, remove entry */
- if (did_flush)
+ /* if successfull non partial flush, remove entry */
+ if (did_flush && !is_partial_flush)
pgstat_delete_pending_entry(entry_ref);
else
have_pending = true;
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index feae2ae5f44..d6b799c4354 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -887,6 +887,84 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
return true;
}
+/*
+ * Flush only non-transactional relation stats.
+ *
+ * This is called periodically during running transactions to make some
+ * statistics visible without waiting for the transaction to finish.
+ *
+ * Transactional stats (inserts/updates/deletes and their effects on live/dead
+ * tuple counts) remain in pending until the transaction ends, at which point
+ * pgstat_relation_flush_cb() will flush them.
+ *
+ * If nowait is true and the lock could not be immediately acquired, returns
+ * false without flushing the entry. Otherwise returns true.
+ */
+bool
+pgstat_relation_flush_anytime_cb(PgStat_EntryRef *entry_ref, bool nowait)
+{
+ Oid dboid;
+ PgStat_TableStatus *lstats; /* pending stats entry */
+ PgStatShared_Relation *shtabstats;
+ PgStat_StatTabEntry *tabentry; /* table entry of shared stats */
+ PgStat_StatDBEntry *dbentry; /* pending database entry */
+
+ dboid = entry_ref->shared_entry->key.dboid;
+ lstats = (PgStat_TableStatus *) entry_ref->pending;
+ shtabstats = (PgStatShared_Relation *) entry_ref->shared_stats;
+
+ /*
+ * Check if there are any non-transactional stats to flush. Avoid
+ * unnecessarily locking the entry if nothing accumulated.
+ */
+ if (!(lstats->counts.numscans > 0 ||
+ lstats->counts.tuples_returned > 0 ||
+ lstats->counts.tuples_fetched > 0 ||
+ lstats->counts.blocks_fetched > 0 ||
+ lstats->counts.blocks_hit > 0))
+ return true;
+
+ if (!pgstat_lock_entry(entry_ref, nowait))
+ return false;
+
+ /* Add only the non-transactional values to the shared entry */
+ tabentry = &shtabstats->stats;
+
+ tabentry->numscans += lstats->counts.numscans;
+ if (lstats->counts.numscans)
+ {
+ TimestampTz t = GetCurrentTimestamp();
+
+ if (t > tabentry->lastscan)
+ tabentry->lastscan = t;
+ }
+ tabentry->tuples_returned += lstats->counts.tuples_returned;
+ tabentry->tuples_fetched += lstats->counts.tuples_fetched;
+ tabentry->blocks_fetched += lstats->counts.blocks_fetched;
+ tabentry->blocks_hit += lstats->counts.blocks_hit;
+
+ pgstat_unlock_entry(entry_ref);
+
+ /* Also update the corresponding fields in database stats */
+ dbentry = pgstat_prep_database_pending(dboid);
+ dbentry->tuples_returned += lstats->counts.tuples_returned;
+ dbentry->tuples_fetched += lstats->counts.tuples_fetched;
+ dbentry->blocks_fetched += lstats->counts.blocks_fetched;
+ dbentry->blocks_hit += lstats->counts.blocks_hit;
+
+ /*
+ * Clear the flushed fields from pending stats to prevent double-counting
+ * when pgstat_relation_flush_cb() runs at transaction boundary.
+ */
+ lstats->counts.numscans = 0;
+ lstats->counts.tuples_returned = 0;
+ lstats->counts.tuples_fetched = 0;
+ lstats->counts.blocks_fetched = 0;
+ lstats->counts.blocks_hit = 0;
+
+ return true;
+}
+
void
pgstat_relation_delete_pending_cb(PgStat_EntryRef *entry_ref)
{
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 46ce90c9624..5f339f6d2ef 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -236,6 +236,8 @@ typedef enum PgStat_FlushMode
* transaction boundary */
FLUSH_ANYTIME, /* All fields can be flushed anytime,
* including within transactions */
+ FLUSH_MIXED, /* Mix of fields that can be flushed anytime
+ * or only at transaction boundary */
} PgStat_FlushMode;
/*
@@ -267,6 +269,12 @@ typedef struct PgStat_KindInfo
/* Flush mode */
PgStat_FlushMode flush_mode;
+ /*
+ * For FLUSH_MIXED kinds: callback to flush only some fields. If NULL for
+ * a MIXED kind, treated as FLUSH_AT_TXN_BOUNDARY.
+ */
+ bool (*flush_anytime_cb) (PgStat_EntryRef *entry_ref, bool nowait);
+
/*
* The size of an entry in the shared stats hash table (pointed to by
* PgStatShared_HashEntry->body). For fixed-numbered statistics, this is
@@ -779,6 +787,7 @@ extern void AtPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state);
extern void PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state);
extern bool pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
+extern bool pgstat_relation_flush_anytime_cb(PgStat_EntryRef *entry_ref, bool nowait);
extern void pgstat_relation_delete_pending_cb(PgStat_EntryRef *entry_ref);
extern void pgstat_relation_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts);
diff --git a/src/test/isolation/expected/stats.out b/src/test/isolation/expected/stats.out
index cfad309ccf3..6d62b30e4a7 100644
--- a/src/test/isolation/expected/stats.out
+++ b/src/test/isolation/expected/stats.out
@@ -2245,6 +2245,46 @@ seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum
(1 row)
+starting permutation: s2_begin s2_table_select s1_sleep s1_table_stats s2_table_drop s2_commit
+pg_stat_force_next_flush
+------------------------
+
+(1 row)
+
+step s2_begin: BEGIN;
+step s2_table_select: SELECT * FROM test_stat_tab ORDER BY key, value;
+key|value
+---+-----
+k0 | 1
+(1 row)
+
+step s1_sleep: SELECT pg_sleep(1.5);
+pg_sleep
+--------
+
+(1 row)
+
+step s1_table_stats:
+ SELECT
+ pg_stat_get_numscans(tso.oid) AS seq_scan,
+ pg_stat_get_tuples_returned(tso.oid) AS seq_tup_read,
+ pg_stat_get_tuples_inserted(tso.oid) AS n_tup_ins,
+ pg_stat_get_tuples_updated(tso.oid) AS n_tup_upd,
+ pg_stat_get_tuples_deleted(tso.oid) AS n_tup_del,
+ pg_stat_get_live_tuples(tso.oid) AS n_live_tup,
+ pg_stat_get_dead_tuples(tso.oid) AS n_dead_tup,
+ pg_stat_get_vacuum_count(tso.oid) AS vacuum_count
+ FROM test_stat_oid AS tso
+ WHERE tso.name = 'test_stat_tab'
+
+seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count
+--------+------------+---------+---------+---------+----------+----------+------------
+ 1| 1| 1| 0| 0| 1| 0| 0
+(1 row)
+
+step s2_table_drop: DROP TABLE test_stat_tab;
+step s2_commit: COMMIT;
+
starting permutation: s1_track_counts_off s1_table_stats s1_track_counts_on
pg_stat_force_next_flush
------------------------
diff --git a/src/test/isolation/expected/stats_1.out b/src/test/isolation/expected/stats_1.out
index e1d937784cb..2fade10e817 100644
--- a/src/test/isolation/expected/stats_1.out
+++ b/src/test/isolation/expected/stats_1.out
@@ -2253,6 +2253,46 @@ seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum
(1 row)
+starting permutation: s2_begin s2_table_select s1_sleep s1_table_stats s2_table_drop s2_commit
+pg_stat_force_next_flush
+------------------------
+
+(1 row)
+
+step s2_begin: BEGIN;
+step s2_table_select: SELECT * FROM test_stat_tab ORDER BY key, value;
+key|value
+---+-----
+k0 | 1
+(1 row)
+
+step s1_sleep: SELECT pg_sleep(1.5);
+pg_sleep
+--------
+
+(1 row)
+
+step s1_table_stats:
+ SELECT
+ pg_stat_get_numscans(tso.oid) AS seq_scan,
+ pg_stat_get_tuples_returned(tso.oid) AS seq_tup_read,
+ pg_stat_get_tuples_inserted(tso.oid) AS n_tup_ins,
+ pg_stat_get_tuples_updated(tso.oid) AS n_tup_upd,
+ pg_stat_get_tuples_deleted(tso.oid) AS n_tup_del,
+ pg_stat_get_live_tuples(tso.oid) AS n_live_tup,
+ pg_stat_get_dead_tuples(tso.oid) AS n_dead_tup,
+ pg_stat_get_vacuum_count(tso.oid) AS vacuum_count
+ FROM test_stat_oid AS tso
+ WHERE tso.name = 'test_stat_tab'
+
+seq_scan|seq_tup_read|n_tup_ins|n_tup_upd|n_tup_del|n_live_tup|n_dead_tup|vacuum_count
+--------+------------+---------+---------+---------+----------+----------+------------
+ 0| 0| 1| 0| 0| 1| 0| 0
+(1 row)
+
+step s2_table_drop: DROP TABLE test_stat_tab;
+step s2_commit: COMMIT;
+
starting permutation: s1_track_counts_off s1_table_stats s1_track_counts_on
pg_stat_force_next_flush
------------------------
diff --git a/src/test/isolation/specs/stats.spec b/src/test/isolation/specs/stats.spec
index da16710da0f..1b0168e6176 100644
--- a/src/test/isolation/specs/stats.spec
+++ b/src/test/isolation/specs/stats.spec
@@ -50,6 +50,8 @@ step s1_rollback { ROLLBACK; }
step s1_prepare_a { PREPARE TRANSACTION 'a'; }
step s1_commit_prepared_a { COMMIT PREPARED 'a'; }
step s1_rollback_prepared_a { ROLLBACK PREPARED 'a'; }
+# Has to be greater than PGSTAT_ANYTIME_FLUSH_INTERVAL
+step s1_sleep { SELECT pg_sleep(1.5); }
# Function stats steps
step s1_ff { SELECT pg_stat_force_next_flush(); }
@@ -138,6 +140,7 @@ step s2_commit { COMMIT; }
step s2_commit_prepared_a { COMMIT PREPARED 'a'; }
step s2_rollback_prepared_a { ROLLBACK PREPARED 'a'; }
step s2_ff { SELECT pg_stat_force_next_flush(); }
+step s2_table_drop { DROP TABLE test_stat_tab; }
# Function stats steps
step s2_track_funcs_all { SET track_functions = 'all'; }
@@ -435,6 +438,15 @@ permutation
s1_table_drop
s1_table_stats
+### Check that some stats are updated (seq_scan and seq_tup_read)
+### while the transaction is still running
+permutation
+ s2_begin
+ s2_table_select
+ s1_sleep
+ s1_table_stats
+ s2_table_drop
+ s2_commit
### Check that we don't count changes with track counts off, but allow access
### to prior stats
--
2.34.1
Thanks for the updated patches!
No, 0003 also changes the flush mode for the database KIND. All the fields that
I mentioned are inherited from relations stats and are flushed only at transaction
boundaries (so they don't appear in pg_stat_database until the transaction
finishes). Does that make sense?
yes, I understand it clearly now.
But, the Note under pg_stat_database reads like this:
"All the statistics are updated while the transactions are in progress,
except for xact_commit, xact_rollback, tup_inserted, tup_updated
and tup_deleted that are updated only when the transactions finish."
But that is not true for all pg_stat_database fields, such as session_time,
active_time, idle_in_transaction_time, etc. From what I can tell some of their
fields are updated when the connection is closed. For example
in one session run "select pg_sleep(10)" and in another session monitor
pg_stat_database.active_time. That will not be updated until the session
is closed.
This is because these are not relation stats, which makes sense. The
Note section should elaborate more on this, right?
--
Sami Imseih
Amazon Web Services (AWS)
On Wed, Jan 21, 2026 at 10:34:09AM +0000, Bertrand Drouvot wrote:
No, 0003 also changes the flush mode for the database KIND. All the fields that
I mentioned are inherited from relations stats and are flushed only at transaction
boundaries (so they don't appear in pg_stat_database until the transaction
finishes). Does that make sense? (if the database kind is not switched to
flush any time then none would appear while the transaction is in progress, even
the ones inherited from relations stats).PFA v3, also taking care of Zsolt's comment (thanks!) done up-thread.
While reading through 0001, I got to question on which properties
and/or assumptions of a stats kind one has to rely on to decide to
what flush_mode should be set. To put is simpler, why don't we just
do a periodic pgstat_report_stat(false) call that would flush all the
stats for all stats kinds based on the new timeout registered,
expanding a bit the flush we currently do when idle in
ProcessInterrupts()? It seems that one point of contention should be
that we should be careful with entries in the shmem hash table that
have been created in a transactional way, but we may already flush
them while we are in a transaction state, no? Are there any fields in
a stats kind that we do may not want to flush? If yes, it sounds to
me that it would be better to document these in the structures to
explain the reason why a flush mode is chosen over the other.
I am also not convinced that we have to be that aggressive with these
extra flushes. The target is long-running analytical queries, that
could take minutes or even hours. Using the same value as
PGSTAT_IDLE_INTERVAL (10s), perhaps renaming the value while on it,
would be a more natural fit. A 1s vs 10s report interval does not
really matter for long analytical queries, where I'd imagine data
being picked up on at least a 30s interval, at the shortest. Of
course, one may want to get a more "live" representation of the data
with more aggressive flushes, but is that really helpful for
long-running queries to have more granularity, stressing more the
shmem state?
--
Michael
No, 0003 also changes the flush mode for the database KIND. All the fields that
I mentioned are inherited from relations stats and are flushed only at transaction
boundaries (so they don't appear in pg_stat_database until the transaction
finishes). Does that make sense? (if the database kind is not switched to
flush any time then none would appear while the transaction is in progress, even
the ones inherited from relations stats).PFA v3, also taking care of Zsolt's comment (thanks!) done up-thread.
While reading through 0001, I got to question on which properties
and/or assumptions of a stats kind one has to rely on to decide to
what flush_mode should be set. To put is simpler, why don't we just
do a periodic pgstat_report_stat(false) call that would flush all the
stats for all stats kinds based on the new timeout registered,
expanding a bit the flush we currently do when idle in
ProcessInterrupts()?
There are some important cases in which we would want to
distinguish between a "transaction boundary" flush vs an
"anytime" flush.
For example, xact_commit/rollback. I would want those
fields to be in sync with tuples_inserted/updated/deleted
to allow for accurate calculations like number of inserts
per commit, etc.
Another one would be n_mod_since_analyze, That should
only be updated after commit (or not after rollback). Otherwise,
it may throw autovanalyze threshold calculations way off. Same
for n_dead_tup and autovacuum.
I am also not convinced that we have to be that aggressive with these
extra flushes. The target is long-running analytical queries, that
could take minutes or even hours. Using the same value as
PGSTAT_IDLE_INTERVAL (10s),
PGSTAT_IDLE_INTERVAL is flushing an idle backend every 10 seconds
IIUC. So this value only applies when outside of a transaction.
A 1s vs 10s report interval does not really matter for long analytical queries.
Sure, Bertrand mentioned early in the thread that the anytime flushes
could be made configurable. Perhaps that is a good idea where we can
default with something large like 10s intervals for anytime flushes, but allow
the user to configure a more frequent flushes ( although I would think
that 1 sec is the minimum we should allow ).
--
Sami Imseih
Amazon Web Services (AWS)
On Thu, Jan 22, 2026 at 10:41 AM Sami Imseih <samimseih@gmail.com> wrote:
No, 0003 also changes the flush mode for the database KIND. All the fields that
I mentioned are inherited from relations stats and are flushed only at transaction
boundaries (so they don't appear in pg_stat_database until the transaction
finishes). Does that make sense? (if the database kind is not switched to
flush any time then none would appear while the transaction is in progress, even
the ones inherited from relations stats).PFA v3, also taking care of Zsolt's comment (thanks!) done up-thread.
While reading through 0001, I got to question on which properties
and/or assumptions of a stats kind one has to rely on to decide to
what flush_mode should be set. To put is simpler, why don't we just
do a periodic pgstat_report_stat(false) call that would flush all the
stats for all stats kinds based on the new timeout registered,
expanding a bit the flush we currently do when idle in
ProcessInterrupts()?There are some important cases in which we would want to
distinguish between a "transaction boundary" flush vs an
"anytime" flush.For example, xact_commit/rollback. I would want those
fields to be in sync with tuples_inserted/updated/deleted
to allow for accurate calculations like number of inserts
per commit, etc.Another one would be n_mod_since_analyze, That should
only be updated after commit (or not after rollback). Otherwise,
it may throw autovanalyze threshold calculations way off. Same
for n_dead_tup and autovacuum.I am also not convinced that we have to be that aggressive with these
extra flushes. The target is long-running analytical queries, that
could take minutes or even hours. Using the same value as
PGSTAT_IDLE_INTERVAL (10s),PGSTAT_IDLE_INTERVAL is flushing an idle backend every 10 seconds
IIUC. So this value only applies when outside of a transaction.A 1s vs 10s report interval does not really matter for long analytical queries.
Sure, Bertrand mentioned early in the thread that the anytime flushes
could be made configurable. Perhaps that is a good idea where we can
default with something large like 10s intervals for anytime flushes, but allow
the user to configure a more frequent flushes ( although I would think
that 1 sec is the minimum we should allow ).
+1 on adding an option to control the interval. With a fixed interval
(for example, 1s), log_lock_waits messages could be emitted that frequently,
which may be annoying for some users.
Of course, it would be even better if these periodic wakeups did not trigger
log_lock_waits messages at all, though.
Regards,
--
Fujii Masao
On Wed, Jan 21, 2026 at 07:41:30PM -0600, Sami Imseih wrote:
Another one would be n_mod_since_analyze, That should
only be updated after commit (or not after rollback). Otherwise,
it may throw autovanalyze threshold calculations way off. Same
for n_dead_tup and autovacuum.
Point taken. It sounds like it is going to be super important to
document in the patch these kind of current expectations, so as one
does not flip the flush mode one way or another incorrectly, or
assigns an incorrect flush mode when adding a new stats kind. It's
probably worth documenting that the end-of-transaction flush should be
the default norm, while the out-of-transaction case should be an
exception one needs to be careful of.
Sure, Bertrand mentioned early in the thread that the anytime flushes
could be made configurable. Perhaps that is a good idea where we can
default with something large like 10s intervals for anytime flushes, but allow
the user to configure a more frequent flushes ( although I would think
that 1 sec is the minimum we should allow ).
Sure, I am just mentioning that we should not be that aggressive for
everybody. If this can be made configurable on a call-basis, even if
it means a new GUC, that may be better in the long run.
--
Michael
Hi,
On Thu, Jan 22, 2026 at 10:56:48AM +0900, Fujii Masao wrote:
On Thu, Jan 22, 2026 at 10:41 AM Sami Imseih <samimseih@gmail.com> wrote:
Sure, Bertrand mentioned early in the thread that the anytime flushes
could be made configurable. Perhaps that is a good idea where we can
default with something large like 10s intervals for anytime flushes, but allow
the user to configure a more frequent flushes ( although I would think
that 1 sec is the minimum we should allow ).+1 on adding an option to control the interval. With a fixed interval
(for example, 1s), log_lock_waits messages could be emitted that frequently,
which may be annoying for some users.Of course, it would be even better if these periodic wakeups did not trigger
log_lock_waits messages at all, though.
pgstat_report_anytime_stat() is called with the force parameter set to false,
means that the flushes are done with nowait = true means that LWLockConditionalAcquire()
is used. In that case, do you still see cases where log_lock_waits messages could
be triggered due to the new flush?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Thu, Jan 22, 2026 at 11:28:06AM +0900, Michael Paquier wrote:
On Wed, Jan 21, 2026 at 07:41:30PM -0600, Sami Imseih wrote:
Another one would be n_mod_since_analyze, That should
only be updated after commit (or not after rollback). Otherwise,
it may throw autovanalyze threshold calculations way off. Same
for n_dead_tup and autovacuum.Point taken. It sounds like it is going to be super important to
document in the patch these kind of current expectations, so as one
does not flip the flush mode one way or another incorrectly, or
assigns an incorrect flush mode when adding a new stats kind. It's
probably worth documenting that the end-of-transaction flush should be
the default norm, while the out-of-transaction case should be an
exception one needs to be careful of.
Agreed, I'll add more explanations around that.
Sure, Bertrand mentioned early in the thread that the anytime flushes
could be made configurable. Perhaps that is a good idea where we can
default with something large like 10s intervals for anytime flushes, but allow
the user to configure a more frequent flushes ( although I would think
that 1 sec is the minimum we should allow ).Sure, I am just mentioning that we should not be that aggressive for
everybody.
I'm not opposed to increase the flush frequency but I suppose most of the monitoring
tools are sampling at a 1s frequency. So, if we set the flush frequency to say 10s,
that would result in "spikes" every 10s. That's misleading, because it's not a
spike in activity, it's a delay in reporting.
I think that would make sense if we expect the 1s interval to have a negative
impact, but that's not what I expect and observed.
If this can be made configurable on a call-basis, even if
it means a new GUC, that may be better in the long run.
If we think that the 1s interval is a problem, we could go in that direction.
Though it might be better to hardcode a larger value instead of letting the users
set values that could be problematic.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Wed, Jan 21, 2026 at 05:41:13PM -0600, Sami Imseih wrote:
Thanks for the updated patches!
No, 0003 also changes the flush mode for the database KIND. All the fields that
I mentioned are inherited from relations stats and are flushed only at transaction
boundaries (so they don't appear in pg_stat_database until the transaction
finishes). Does that make sense?yes, I understand it clearly now.
But, the Note under pg_stat_database reads like this:
"All the statistics are updated while the transactions are in progress,
except for xact_commit, xact_rollback, tup_inserted, tup_updated
and tup_deleted that are updated only when the transactions finish."But that is not true for all pg_stat_database fields, such as session_time,
active_time, idle_in_transaction_time, etc. From what I can tell some of their
fields are updated when the connection is closed. For example
in one session run "select pg_sleep(10)" and in another session monitor
pg_stat_database.active_time. That will not be updated until the session
is closed.This is because these are not relation stats, which makes sense. The
Note section should elaborate more on this, right?
Yeah, so, while pgstat_database_flush_cb() is now called every second (if there
are pending stats), not all the stats would have their pending entries updated.
For example, pgstat_update_dbstats() updates some of them: xact_commit, xact_rollback,
blk_read_time, blk_write_time, session_time, active_time and idle_in_transaction_time
but only at transaction boundaries. Indeed, pgstat_update_dbstats() is only called
during pgstat_report_stat() and not during pgstat_report_anytime_stat().
I think that we could:
1. Update the doc as you suggest
or
2. Call a modified version of pgstat_update_dbstats() in pgstat_report_anytime_stat()
that would update blk_read_time, blk_write_time, session_time, active_time and
idle_in_transaction_time but that would require an extra GetCurrentTimestamp()
call.
or
3. Call a modified version of pgstat_update_dbstats() in pgstat_report_anytime_stat()
that would update the same as in 2. except session_time then avoiding the need
of a GetCurrentTimestamp() extra call.
I'm tempted to vote for 1. as I'm not sure of the added value of 2. and 3.,
thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Jan 22, 2026 at 4:43 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Thu, Jan 22, 2026 at 10:56:48AM +0900, Fujii Masao wrote:
On Thu, Jan 22, 2026 at 10:41 AM Sami Imseih <samimseih@gmail.com> wrote:
Sure, Bertrand mentioned early in the thread that the anytime flushes
could be made configurable. Perhaps that is a good idea where we can
default with something large like 10s intervals for anytime flushes, but allow
the user to configure a more frequent flushes ( although I would think
that 1 sec is the minimum we should allow ).+1 on adding an option to control the interval. With a fixed interval
(for example, 1s), log_lock_waits messages could be emitted that frequently,
which may be annoying for some users.Of course, it would be even better if these periodic wakeups did not trigger
log_lock_waits messages at all, though.pgstat_report_anytime_stat() is called with the force parameter set to false,
means that the flushes are done with nowait = true means that LWLockConditionalAcquire()
is used. In that case, do you still see cases where log_lock_waits messages could
be triggered due to the new flush?
I haven't read the patch in detail yet, but after applying patch 0001 and
causing a lock wait (for example, using the steps below), I observed that
log_lock_waits messages are emitted every second.
[session 1]
create table tbl as select id from generate_series(1, 10) id;
begin;
select * from tbl where id = 1 for update;
[session 2]
begin;
select * from tbl where id = 1 for update;
With this setup, the following messages were logged once per second:
LOG: process 72199 still waiting for ShareLock on transaction 771
after 63034.119 ms
DETAIL: Process holding the lock: 72190. Wait queue: 72199.
Regards,
--
Fujii Masao
Hi,
On Thu, Jan 22, 2026 at 09:12:18PM +0900, Fujii Masao wrote:
On Thu, Jan 22, 2026 at 4:43 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:pgstat_report_anytime_stat() is called with the force parameter set to false,
means that the flushes are done with nowait = true means that LWLockConditionalAcquire()
is used. In that case, do you still see cases where log_lock_waits messages could
be triggered due to the new flush?I haven't read the patch in detail yet, but after applying patch 0001 and
causing a lock wait (for example, using the steps below), I observed that
log_lock_waits messages are emitted every second.[session 1]
create table tbl as select id from generate_series(1, 10) id;
begin;
select * from tbl where id = 1 for update;[session 2]
begin;
select * from tbl where id = 1 for update;With this setup, the following messages were logged once per second:
LOG: process 72199 still waiting for ShareLock on transaction 771
after 63034.119 ms
DETAIL: Process holding the lock: 72190. Wait queue: 72199.
Thanks!
I see, the WaitLatch() in ProcSleep() is "woken up" every 1s due to the
enable_timeout_after(ANYTIME_STATS_UPDATE_TIMEOUT,...) being set unconditionally
in ProcessInterrupts(). We need to be more restrictive as to when to enable the
timeout, I'll fix in the next version.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
For example, pgstat_update_dbstats() updates some of them: xact_commit, xact_rollback,
blk_read_time, blk_write_time, session_time, active_time and idle_in_transaction_time
but only at transaction boundaries. Indeed, pgstat_update_dbstats() is only called
during pgstat_report_stat() and not during pgstat_report_anytime_stat().I think that we could:
1. Update the doc as you suggest
I am thinking the _time related fields are OK to be non-anytime
fields, since they
have overhead and also they can be actively monitored from pg_stat_activity
if someone really needs real time information.
The other session related counters don't need need special consideration.
parallel counters are anytime.
So, the documentation can mention the _time related fields that are flushed
only at their appropriate times.
Maybe something general like this:
"Some statistics are updated while a transaction is in progress.
Statistics that either do
not depend on transactions or require transactional consistency are
updated only
when the transaction ends. Statistics that require transactional consistency
include xact_commit, xact_rollback, tup_inserted, tup_updated, and tup_deleted."
What do you think?
--
Sami Imseih
Amazon Web Services (AWS)