Improve pg_stat_statements scalability

Started by Sami Imseih2 months ago25 messageshackers

samimseih@gmail.com

2 months ago

Hi,

pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.

The three scaling problems:

1. Per-entry SpinLock contention. Every counter update acquires
a SpinLock on the entry. Hot queries executed across many
backends contend on the same lock, and the hold time grows as
we add more counters to the struct. Higher core counts mean
more CPUs contending for the same spinlock, making the problem
worse on modern hardware. See this complaint here [3]/messages/by-id/btsjlfnqge3y6yypkwe7yvhv2tcopt6pug7gigz6xaha2iemkw@lflv3psi7xoz.

2. Deallocation under exclusive LWLock. The hash table is
fixed-size, bounded by pg_stat_statements.max. When full, a
single backend performs least-frequently-used eviction of
the bottom 5% while holding an exclusive LWLock, blocking all
other backends from reading or writing pg_stat_statements.
This happens inline during query execution, and workloads with
many unique queries trigger it frequently. Making matters
worse, pg_stat_statements.max is a static GUC, requiring a
full server restart to change, and that is disruptive in
production systems.

3. Query text file bloat and GC. Deallocated entries leave dead
text in the external file. When the file grows to 2x the size
of live text, GC rewrites the entire file under exclusive lock.
Disk I/O under exclusive lock is the worst combination, and
frequent deallocations (problem #2) trigger more GC.

Using the pgstat, "statistics collector", system to improve
pg_stat_statements scalability has been discussed previously
[4]: /messages/by-id/aKF0V-T8-XAxj47T@paquier.xyz
to shared memory, avoiding spinlock contention on the write
path. Also, the storage is a dshash (partitioned hash table),
which reduces contention on lookups and allows dynamic resizing
without restart. The prerequisite work to make this usable by
extensions has been building over two release cycles:

- PG18: Pluggable cumulative statistics API
(pgstat_register_kind) [7949d959]

- PG19: Serialization callbacks (to_serialized_data,
from_serialized_data, finish) for custom stats kinds
[4ba012a8ed9]

This patch builds on the prerequisite work to address the
scalability issues mentioned above.

The series consists of two patches:

---

[0001]: pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid)

Adds infrastructure for flushing pending statistics to shared
memory on demand. pgstat_flush_pending() flushes all pending
entries in the calling backend immediately. Unlike
pgstat_report_stat(), it can be called mid-transaction, making
it suitable for view functions that need fresh shared stats
before the transaction ends.

pg_stat_flush_pending(pid) is the SQL-callable interface to
the same function.

This patch is related to the discussion in [1]/messages/by-id/acNTfL1xO_UUXkZQ@paquier.xyz about flushing
stats within running transactions. The pg_stat_statements
modernization provides a good example where this is useful.
At least for the pg_stat_statements tests, we need a way to
flush statistics within a transaction. I plan to spin this off
to a dedicated thread, but include it here for now so this
patchset can be tested.

[0002]: pg_stat_statements: modernize entry storage with pgstat kind
pgstat kind

This is the main patch. It replaces the private shared-memory
hash table with the pgstat subsystem's dshash (registered as a
custom stats kind).

The pgstats hash key has an objid:

typedef struct PgStat_HashKey
{
PgStat_Kind kind; /* statistics entry kind */
Oid dboid; /* database ID. InvalidOid for shared
objects. */
uint64 objid; /* object ID (table, function, etc.), or
* identifier. */
} PgStat_HashKey;

So the entry objid is computed by combining hashes from userid, queryid and
toplevel.

Note that because of 0001 there are no changes required to the
existing regression tests. We may need to add some new tests,
but I have not thought too much about that yet.

Key design changes:

- No SpinLock on the execution path. Counters accumulate
per-backend and merge into shared memory on flush. Stddev
and related fields use Welford's algorithm as before, but since the stats
are first updated locally, we need
a way to merge the stats on flush, so we use Chan's parallel
algorithm for this purpose [5]https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm.

- No fixed shared memory allocation. Entries live in the pgstat
dshash, which grows dynamically. pg_stat_statements.max
becomes PGC_SIGHUP, and therefore can be changed dynamically.

- Throttled inline eviction. When entry count reaches
pgss_max, a backend attempts eviction using a conditional
lock and a shared timestamp that ensures at most one eviction
cycle per 10 seconds. Other backends simply skip entry
creation without blocking. This is acceptable because the
current upstream behavior already suffers from the same data
loss under heavy churn. Newly created entries are immediately
candidates for eviction and are frequently removed in the
next deallocation cycle. The throttled approach makes this
trade-off explicit and avoids the cost of blocking all backends
behind an exclusive lock to create space for entries that may
just be removed shortly. See benchmark results below for
measuring the retention of "hot" entries.

The aging decay mechanism remains in place, but the sticky
entries are no longer needed for this patch, and we simply
don't evict calls == 0.

A background worker for eviction was also considered. The
current inline approach was chosen because it avoids the
added complexity while still preventing other backends from
blocking. If a background worker has more merit, I am open
to that discussion.

- Query texts in DSA memory with disk fallback. A new GUC
pg_stat_statements.query_text_memory (default 64 MB) controls
DSA shared memory for query text storage. When enabled and
not exhausted, new texts are stored in DSA instead of the
external file, eliminating file I/O on the read path. If DSA
is disabled (set to 0) or full, texts fall back to the
existing file-based storage. Entry eviction and reset properly
free DSA-allocated texts, and GC of the text file skips
DSA-backed entries.

- Serialization via pgstat callbacks. to_serialized_data /
from_serialized_data handle saving and restoring stats across
restarts, replacing the module's bespoke shutdown/startup
logic.

- pg_stat_statements_info gains columns. num_entries (current
count), last_eviction_time, query_text_memory_bytes (DSA
currently allocated), and query_text_file_bytes (overflow
file size) provide operational visibility into eviction
behavior and memory usage. Some of these fields were discussed
in this thread [6]/messages/by-id/CAP53PkzYZ8YxH0o+Garw9fWdFRoEtmQKT09-q=2RVMW8uVS5Nw@mail.gmail.com as they are relevant even to the current
state of pg_stat_statements.

The LWLock is retained but narrowed to protect only query text
file operations and eviction. Entry-level locking is handled
by the pgstat subsystem's built-in mechanism.

---

Benchmark:

Attached is a benchmark script that runs three workloads:

- 5k: 80% hot (1000 distinct) + 20% churn (4000 distinct).
Total fits within max, no eviction expected.
- 100k: 80% hot (1000 distinct) + 20% churn (100000 distinct).
Exceeds max, continuous eviction pressure.
- spinlock: single SELECT; from all 256 clients. Pure
contention on one entry.

The benchmark collects pg_stat_activity in the background to
Measure wait events. It also polls pg_stat_statements to check
For hot and churn query retention, so we can measure hot
Query retention and also measure the overhead of querying
pg_stat_statements.

Benchmark Results:

Environment: x86_64, 16 CPUs, 29 GB RAM, Linux 6.1 (EC2).
PostgreSQL 19devel, release build, cassert=off.
pg_stat_statements.max = 5500, 256 clients, 16 threads,
5 min per test.

Test | un-patched TPS | Patch TPS | Delta
---------+----------------+-----------+------
5k | 241,078 | 239,001 | -0.9%
100k | 180,817 | 240,314 | +33%
spinlock | 328,075 | 319,466 | -3%

Wait event samples (collected every 1s for 300s):

Notice the huge difference in wait events. Essentially, the
pg_stat_statements waits are eliminated with this design and
there is a 33% performance improvement.

Entry retention under 100k (heavy eviction):

Under heavy churn, the unpatched eviction behavior retains
hot entries and discards churn entries after 1-5 calls. The patched
design throttles eviction to at most once per 10 seconds, so entries
survive longer between cycles and accumulate more calls. In the patched
design, the churn entries reach 1-192 calls before eviction. Hot entries
benefit from the elimination of exclusive-lock blocking, accumulating 52-58k
calls versus 42-44k.

In the non-churn case (5k test), both designs are equivalent
in performance.

More benchmarking across different patterns is needed, but I believe this
is a good start in terms of numbers and accuracy.

---

Open questions:

1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the
custom kind ID. For commit, we'd want a proper kind number.

2. The 10-second eviction throttle is a compile-time constant
(EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed
interval sufficient?

3. The current design skips entry creation when at capacity and
eviction is throttled. An alternative would be to allow
temporary overshoot (soft limit) and never reject entries.
Thoughts on the trade-off? However, for heavy churn and
many unique entries we can easily overshoot the max by
magnitudes higher which I don't think is a good idea.

4. The default size of query text memory. The patch has it at 64MB.

[1]: /messages/by-id/acNTfL1xO_UUXkZQ@paquier.xyz
[3]: /messages/by-id/btsjlfnqge3y6yypkwe7yvhv2tcopt6pug7gigz6xaha2iemkw@lflv3psi7xoz
[4]: /messages/by-id/aKF0V-T8-XAxj47T@paquier.xyz
[5]: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
[6]: /messages/by-id/CAP53PkzYZ8YxH0o+Garw9fWdFRoEtmQKT09-q=2RVMW8uVS5Nw@mail.gmail.com

Patches and benchmark script attached.

--
Sami Imseih
Amazon Web Services (AWS)

Sami Imseih

samimseih@gmail.com

2 months ago

In reply to: Sami Imseih (#1)

Re: Improve pg_stat_statements scalability

spinlock | 328,075 | 319,466 | -3%

One thing I forgot to mention above is for the spinlock contention
test, I plan on running this on a machine with much higher
core count, where I have seen this be a much larger issue in
the past.

--
Sami Imseih
Amazon Web Services (AWS)

Sami Imseih

samimseih@gmail.com

about 2 months ago

In reply to: Sami Imseih (#2)

Re: Improve pg_stat_statements scalability

Hi,

I had to fix an issue with the serialization callback that was causing
CI failures.

I also made re-organized the patches a bit:

v1-0001: This will not be part of the main patchset, but will be be a dependency
for the remaining patches. This is, as mentioned above, an in-transaction flush
required [1]https://commitfest.postgresql.org/patch/6781/

v1-0002: This now only moves the pg_stat_statements hash to a custom
pgstat kind

v1-0003: This moves the query text file to a DSA + file storage.

v1-0004: Adds new columns to pg_stat_statements_info

[1]: https://commitfest.postgresql.org/patch/6781/

--
Sami Imseih
Amazon Web Services (AWS)

Sami Imseih

samimseih@gmail.com

about 2 months ago

In reply to: Sami Imseih (#3)

Re: Improve pg_stat_statements scalability

There was a failure on FreeBSD [1]https://cirrus-ci.com/task/5948422076760064. The test uses
debug_parallel_query=regress which forces parallel plans. What
happens is the parallel worker calls pg_stat_statements() (marked
PARALLEL SAFE), tries to flush pending stats, but the leader is the one
that actually accumulated those stats.

I fixed this by:

1. Setting max_parallel_workers_per_gather = 0 in
pg_stat_statements.conf, and only enabling it during
parallel.sql when we actually want to track a parallel query.

2. Bumping pg_s_s to version 1.14 and marking pg_stat_statements()
and pg_stat_statements_reset() as PARALLEL RESTRICTED to ensure
these functions only execute in the leader, which is the process
that accumulates the pending stats. These could also be marked UNSAFE,
but RESTRICTED seems better since it doesn't completely prevent
parallel plans if these functions are used with other tables; although
it's hard to imagine a real-world case where this would matter.

This also means old versions would have this issue with
debug_parallel_query, but I don't think we should change
function definitions for older versions, in case a user
downgrades pg_s_s versions. Maybe others have a
different opinion?

[1]: https://cirrus-ci.com/task/5948422076760064

--
Sami

Lukas Fittl

lukas@fittl.com

about 2 months ago

In reply to: Sami Imseih (#1)

Re: Improve pg_stat_statements scalability

On Mon, May 11, 2026 at 4:54 PM Sami Imseih <samimseih@gmail.com> wrote:

pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.

Agreed. Let's fix this for Postgres 20.

Thank you for putting together initial patches!

The three scaling problems:

1. Per-entry SpinLock contention. Every counter update acquires
a SpinLock on the entry. Hot queries executed across many
backends contend on the same lock, and the hold time grows as
we add more counters to the struct. Higher core counts mean
more CPUs contending for the same spinlock, making the problem
worse on modern hardware. See this complaint here [3].

2. Deallocation under exclusive LWLock. The hash table is
fixed-size, bounded by pg_stat_statements.max. When full, a
single backend performs least-frequently-used eviction of
the bottom 5% while holding an exclusive LWLock, blocking all
other backends from reading or writing pg_stat_statements.
This happens inline during query execution, and workloads with
many unique queries trigger it frequently. Making matters
worse, pg_stat_statements.max is a static GUC, requiring a
full server restart to change, and that is disruptive in
production systems.

3. Query text file bloat and GC. Deallocated entries leave dead
text in the external file. When the file grows to 2x the size
of live text, GC rewrites the entire file under exclusive lock.
Disk I/O under exclusive lock is the worst combination, and
frequent deallocations (problem #2) trigger more GC.

Yep, agreed on these problems.

Using the pgstat, "statistics collector", system to improve
pg_stat_statements scalability has been discussed previously
[4]. It writes stats locally first and flushes periodically
to shared memory, avoiding spinlock contention on the write
path. Also, the storage is a dshash (partitioned hash table),
which reduces contention on lookups and allows dynamic resizing
without restart. The prerequisite work to make this usable by
extensions has been building over two release cycles:

- PG18: Pluggable cumulative statistics API
(pgstat_register_kind) [7949d959]

- PG19: Serialization callbacks (to_serialized_data,
from_serialized_data, finish) for custom stats kinds
[4ba012a8ed9]

This patch builds on the prerequisite work to address the
scalability issues mentioned above.

The series consists of two patches:

---

[0001] pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid)

Adds infrastructure for flushing pending statistics to shared
memory on demand. pgstat_flush_pending() flushes all pending
entries in the calling backend immediately. Unlike
pgstat_report_stat(), it can be called mid-transaction, making
it suitable for view functions that need fresh shared stats
before the transaction ends.

pg_stat_flush_pending(pid) is the SQL-callable interface to
the same function.

This patch is related to the discussion in [1] about flushing
stats within running transactions. The pg_stat_statements
modernization provides a good example where this is useful.
At least for the pg_stat_statements tests, we need a way to
flush statistics within a transaction. I plan to spin this off
to a dedicated thread, but include it here for now so this
patchset can be tested.

I noticed the following in the 0001 patch (to be clear, looking at the
v2 version you posted later), which I wasn't sure about:

diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index b2ca28f83ba..848687a9f7e 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -828,64 +828,76 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)

...
if (t > tabentry->lastscan)
tabentry->lastscan = t;
}
tabentry->tuples_returned += lstats->counts.tuples_returned;
tabentry->tuples_fetched += lstats->counts.tuples_fetched;
-    tabentry->tuples_inserted += lstats->counts.tuples_inserted;
-    tabentry->tuples_updated += lstats->counts.tuples_updated;
-    tabentry->tuples_deleted += lstats->counts.tuples_deleted;
tabentry->tuples_hot_updated += lstats->counts.tuples_hot_updated;
tabentry->tuples_newpage_updated += lstats->counts.tuples_newpage_updated;
+    tabentry->blocks_fetched += lstats->counts.blocks_fetched;
+    tabentry->blocks_hit += lstats->counts.blocks_hit;

If I follow correctly, you're moving tuples_(inserted|updated|deleted)
here to special case them so they respect transaction rollbacks, but
you don't do that for tuples_hot_updated and tuples_newpage_updated.
Shouldn't those also be ignored if a transaction rolls back?

[0002] pg_stat_statements: modernize entry storage with
pgstat kind

This is the main patch. It replaces the private shared-memory
hash table with the pgstat subsystem's dshash (registered as a
custom stats kind).

I've taken a first pass at this, and noticed an issue (again, in the
v2 patch for clarity):

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c
b/contrib/pg_stat_statements/pg_stat_statements.c

index 92315627916..0e6e65e3e51 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
...
@@ -1254,9 +1066,62 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
}

...
+/*
+ * Store query text into a shared entry via the external text file.
+ *
+ * Caller must hold the entry lock.  Does nothing if text is already present.
+ */
+static void
+pgss_store_query_text(PgStatShared_Pgss *shared_entry,
+                      const char *query, int query_len, int encoding)
+{
+    Size        query_offset;
+    int            gc_count;
+
+    if (shared_entry->query_len >= 0)
+        return;
+
+    LWLockAcquire(&pgss->lock.lock, LW_SHARED);
+    if (qtext_store(query, query_len, &query_offset, &gc_count))
+    {
+        shared_entry->query_offset = query_offset;
+        shared_entry->query_len = query_len;
+        shared_entry->encoding = encoding;
+    }
+    LWLockRelease(&pgss->lock.lock);
+}
+

...
/*
@@ -1313,192 +1180,171 @@ pgss_store(const char *query, int64 queryId,
key.queryid = queryId;
key.toplevel = (nesting_level == 0);

-    /* Lookup the hash table entry with shared lock. */
-    LWLockAcquire(&pgss->lock.lock, LW_SHARED);
-
-    entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL);
-
-    /* Create new entry, if not present */
-    if (!entry)
+    /*
+     * If jstate is set, create the shared entry and store normalized query
+     * text.  Don't increment counters; entries with zero calls are not
+     * returned by pg_stat_statements_internal().
+     */
+    if (jstate)
{
-        Size        query_offset;
-        int            gc_count;
-        bool        stored;
-        bool        do_gc;
+        const char *store_query;

- /*
- * Create a new, normalized query string if caller asked. We don't
- * need to hold the lock while doing this work. (Note: in any case,
- * it's possible that someone else creates a duplicate hashtable entry
- * in the interval where we don't hold the lock below. That case is
- * handled by entry_alloc.)
- */
- if (jstate)
...

-        /* Append new query text to file with only shared lock held */
-        stored = qtext_store(norm_query ? norm_query : query, query_len,
-                             &query_offset, &gc_count);
-
-        /*
-         * Determine whether we need to garbage collect external query texts
-         * while the shared lock is still held.  This micro-optimization
-         * avoids taking the time to decide this while holding exclusive lock.
-         */
-        do_gc = need_gc_qtexts();
-
-        /* Need exclusive lock to make a new hashtable entry - promote */
-        LWLockRelease(&pgss->lock.lock);
-        LWLockAcquire(&pgss->lock.lock, LW_EXCLUSIVE);
+        norm_query = generate_normalized_query(jstate, query,
+                                               query_location,
+                                               &query_len);
+        store_query = norm_query ? norm_query : query;

-        /*
-         * A garbage collection may have occurred while we weren't holding the
-         * lock.  In the unlikely event that this happens, the query text we
-         * stored above will have been garbage collected, so write it again.
-         * This should be infrequent enough that doing it while holding
-         * exclusive lock isn't a performance problem.
-         */
-        if (!stored || pgss->gc_count != gc_count)
-            stored = qtext_store(norm_query ? norm_query : query, query_len,
-                                 &query_offset, NULL);
+        entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_PGSS,
+                                                key.dbid,
+                                                pgss_objid(&key),
+                                                true);
+        if (!entry_ref)
+        {
+            if (norm_query)
+                pfree(norm_query);
+            return;
+        }

-        /* If we failed to write to the text file, give up */
-        if (!stored)
-            goto done;
+        shared_entry = (PgStatShared_Pgss *) entry_ref->shared_stats;
+        pgss_entry_init(shared_entry, &key, encoding);
+        pgss_store_query_text(shared_entry, store_query, query_len, encoding);

It appears you've moved the equivalent of the "if (!entry)" check into
the pgss_store_query_text function, and we now unconditionally call
generate_normalized_query.

Is there a reason we couldn't just defer that work until we've
confirmed we actually need to save the query text?

The pgstats hash key has an objid:

typedef struct PgStat_HashKey
{
PgStat_Kind kind; /* statistics entry kind */
Oid dboid; /* database ID. InvalidOid for shared
objects. */
uint64 objid; /* object ID (table, function, etc.), or
* identifier. */
} PgStat_HashKey;

So the entry objid is computed by combining hashes from userid, queryid and
toplevel.

Makes sense.

Note that because of 0001 there are no changes required to the
existing regression tests. We may need to add some new tests,
but I have not thought too much about that yet.

Key design changes:

- No SpinLock on the execution path. Counters accumulate
per-backend and merge into shared memory on flush. Stddev
and related fields use Welford's algorithm as before, but since the stats
are first updated locally, we need
a way to merge the stats on flush, so we use Chan's parallel
algorithm for this purpose [5].

- No fixed shared memory allocation. Entries live in the pgstat
dshash, which grows dynamically. pg_stat_statements.max
becomes PGC_SIGHUP, and therefore can be changed dynamically.

Big +1 on these first two design changes - its clear that we get a
benefit here, and its nice to simplify/streamline the code as well.

- Throttled inline eviction. When entry count reaches
pgss_max, a backend attempts eviction using a conditional
lock and a shared timestamp that ensures at most one eviction
cycle per 10 seconds. Other backends simply skip entry
creation without blocking. This is acceptable because the
current upstream behavior already suffers from the same data
loss under heavy churn. Newly created entries are immediately
candidates for eviction and are frequently removed in the
next deallocation cycle.

I feel this is problematic as presented (looking at pgss_maybe_evict
in v2), because even on a somewhat steady workload you have a high
risk of losing the 5001st entry (if max is 5000).

What if we instead trigger deallocation early, e.g. at 90% of entries
being full? I think this could also go well with a background worker
doing the work.

The throttled approach makes this
trade-off explicit and avoids the cost of blocking all backends
behind an exclusive lock to create space for entries that may
just be removed shortly. See benchmark results below for
measuring the retention of "hot" entries.

The aging decay mechanism remains in place, but the sticky
entries are no longer needed for this patch, and we simply
don't evict calls == 0.

A background worker for eviction was also considered. The
current inline approach was chosen because it avoids the
added complexity while still preventing other backends from
blocking. If a background worker has more merit, I am open
to that discussion.

I think a background worker might simplify things around deallocation
and query text GC handling - we have precedent that contrib extensions
utilize background workers, e.g. pg_prewarm, pg_stash_advice, etc.

- Query texts in DSA memory with disk fallback. A new GUC
pg_stat_statements.query_text_memory (default 64 MB) controls
DSA shared memory for query text storage. When enabled and
not exhausted, new texts are stored in DSA instead of the
external file, eliminating file I/O on the read path. If DSA
is disabled (set to 0) or full, texts fall back to the
existing file-based storage. Entry eviction and reset properly
free DSA-allocated texts, and GC of the text file skips
DSA-backed entries.

I feel like this design introduces a performance cliff at the
in-memory => disk boundary that will be surprising and hard to
understand. We had previously discussed doing an in-memory only
design, but I could see that being harder to get agreement on in
practice with the large query text files many of us have seen on
customer systems.

What if we instead designed this as an in-memory buffering mechanism?

i.e. new query texts are first written to the in-memory buffer, and
then flushed to disk. The in-memory buffer is kept as DSA like you
have it, so many concurrent sessions suddenly running a new query only
have one of them actually saving it to the buffer. We could then have
a background writer write that out to disk in batches, to make room
for new entries in the in-memory buffer.

If we wanted to get fancy we could even introduce compression when we
write it out to disk, since that work would no longer have to be
happening inline with the query execution - it could instead be
happening in the background worker.

Open questions:

1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the
custom kind ID. For commit, we'd want a proper kind number.

Makes sense to me - I assume the question is when we assign the real ID?

I think we could separately consider whether pg_stat_statements should
live in core (maybe?), but that seems like a different patch set to
me.

2. The 10-second eviction throttle is a compile-time constant
(EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed
interval sufficient?

I suspect its probably slightly too frequent, but its a bit hard to
say for sure. Personally, I don't think we need a GUC for it
necessarily, at least not from how I currently look at it.

3. The current design skips entry creation when at capacity and
eviction is throttled. An alternative would be to allow
temporary overshoot (soft limit) and never reject entries.
Thoughts on the trade-off? However, for heavy churn and
many unique entries we can easily overshoot the max by
magnitudes higher which I don't think is a good idea.

Per earlier note, I think we should change this so that we trigger
eviction early before actually reaching the max.

It seems reasonable to still drop entries when we actually exceed the
max, instead of blocking on the deallocation finishing (as we do
today).

4. The default size of query text memory. The patch has it at 64MB.

I think we can work on other parts of the design first, but I wonder
if we should auto-scale that based on shared_buffers (and allow
overriding) - 64MB default seems high for very small instances.

Thanks,
Lukas

--
Lukas Fittl

Lukas Fittl

lukas@fittl.com

about 2 months ago

In reply to: Lukas Fittl (#5)

Re: Improve pg_stat_statements scalability

On Thu, May 21, 2026 at 9:38 PM Lukas Fittl <lukas@fittl.com> wrote:

On Mon, May 11, 2026 at 4:54 PM Sami Imseih <samimseih@gmail.com> wrote:

pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.

Agreed. Let's fix this for Postgres 20.

Thank you for putting together initial patches!

For archive's sake and others reading along, we had a productive
discussion about this today at PGConf.Dev.

I've created a new wiki page combining the prior 2025 discussion, and
notes from today:

https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements

Thanks,
Lukas

--
Lukas Fittl

Julien Rouhaud

rjuju123@gmail.com

about 2 months ago

In reply to: Lukas Fittl (#6)

Re: Improve pg_stat_statements scalability

Hi,

On Fri, May 22, 2026 at 06:15:00PM -0700, Lukas Fittl wrote:

For archive's sake and others reading along, we had a productive
discussion about this today at PGConf.Dev.

I've created a new wiki page combining the prior 2025 discussion, and
notes from today:

https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements

Thanks a lot Lukas!

Just a small clarification, I wasn't asking for a way to remember the last time
a query was executed (although I think it's a very good thing to have), but
remembering the time each query text was saved.

In a reasonable system (that is a system where the number of entries doesn't
grow much more than pg_stat_statements.max, ie, you can actually use the
current version pg_stat_statements) you will likely get a lot of entries even
if you filter out the entries that didn't get executed in the last X minutes.
However, the vast majority of them should not be new queries. So if you
maintain an external system that snapshot pg_stat_statements once in a while
you only care about the (possibly) missing query texts, which should be a very
small fraction of all the records you otherwise need.

Sami Imseih

samimseih@gmail.com

about 2 months ago

In reply to: Lukas Fittl (#6)

Re: Improve pg_stat_statements scalability

pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.

Agreed. Let's fix this for Postgres 20.

Thank you for putting together initial patches!

For archive's sake and others reading along, we had a productive
discussion about this today at PGConf.Dev.

I've created a new wiki page combining the prior 2025 discussion, and
notes from today:

https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements

Thank you, Lukas!

--
Sami

Sami Imseih

samimseih@gmail.com

about 2 months ago

In reply to: Sami Imseih (#8)

Re: Improve pg_stat_statements scalability

I've created a new wiki page combining the prior 2025 discussion, and
notes from today:

https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements

Thank you, Lukas!

Thanks for the feedback on v2, and for the productive unconference
discussion in PGConf.dev. I've been iterating on the patches/benchmarks
and wanted to post an updated series that addresses the major design
points raised.

Addressing the feedback from the unconference session:

1/ Eviction

The concern in the design presented in which entries are skipped until
a single threaded eviction completed was that on a steady workload with
max=5000, the 5001st unique query would always be immediately evicted.
This was considered unacceptable, and rightly so. It may be fine for
high churn, pathological cases, but for slow churn cases it would be
unacceptable.

Andres mentioned that eviction should occur in parallel, which I
understood as multiple backends should evict a subset of the entries in
the dshash concurrently. To do this, I implemented a parallel
clock-sweep in which eviction sweeps a partition decrementing
refcounts, and only entries that have decayed to zero are evicted. A
genuinely new query that just arrived has a fresh refcount and survives
the sweep, and if it becomes popular in the meantime, it will survive
for longer. So the algorithm wants an entry to prove itself to remain
considered "hot", else it's swept.

There's a 5% headroom using the already existing
USAGE_DEALLOC_PERCENT in which we evict until count (across all
partitions) drops to 95% of max, so there's always room for new
arrivals before the next sweep.

I chose not to use a background worker for eviction, as discussed
earlier, based on the consensus that backpressure is important, and a
background worker being asynchronous in nature will not provide that.
Also, it could be complicated where a background worker cannot be spun
up for whatever reason.

However, to implement parallel eviction as described above, I needed 2
core changes:

- pgstat_drop_entry must be able to optionally tolerate a dropped
entry, as there could be a delay between the time an entry is marked
as dropped and garbage collection, and within that time multiple
evictions may attempt to drop the same entry.

- Implement a seq scan API for dshash that scans a specific partition.
This will allow eviction to cycle through partitions, i.e.
clock-sweep.

The eviction does not attempt to make room in the bucket for an entry
that triggered it. The main point is to keep forward progress in
making room. We don't need to be more strict here. Also the fact that
pg_stat_statements.max is dynamic means a user can increase this value
to manage high churn without a restart.

The dealloc counter in pg_stat_statements_info now counts individual
entries evicted rather than the number of times eviction was invoked.
I think this is more useful, but it does change the semantics between
versions. With the new design, a single eviction pass can remove many
entries across a partition, so counting evictions no longer tells
you much. Open to other thoughts here on this.

Profiling revealed that pgstat_request_entry_refs_gc(), which is the
standard pattern for less frequently removed entries in other places,
was too expensive when called on every entry drop or every partition
sweep. Now it is only called once per full rotation across all
partitions, which showed much better results in benchmarks.

2/ DSA only query text storage

Rather than a "performance cliff" that Lukas mentioned above when we
switch between DSA and disk for query text storage, the consensus is
to just store all query text in memory. Andres made a point that even
now, transient memory usage for loading the query text for the purpose
of garbage-collection or reading the query stats in
pg_stat_statements means that a user's machine must have enough memory
to handle this. So, why not just throw all query text in memory. The
memory is capped by a new GUC pg_stat_statements.query_text_memory
(default of 4MB but up for discussion).

It is also possible to store more entries due to .max than
.query_text_memory can support, so empty query text columns could be
possible. In the case a user increases .query_text_memory if they
observe empty query columns, the next time an entry is touched, it
will backfill the query text and normalize the string if it can. The
last part could be improved and actually guaranteed if we make
JumbleState available to all hooks (I did not work on this part, but
open for discussion).

Some other comments from Lukas's earlier review:

It appears you've moved the equivalent of the "if (!entry)" check
into the pgss_store_query_text function, and we now unconditionally
call generate_normalized_query.

Fixed.

v3 series is now 5 patches:

0001: pgstat: Introduce pg_stat_report_anytime()

Nothing changes from v2.

0002: pgstat: tolerate already-dropped entries in pgstat_drop_entry()

Required for the parallel eviction design. With multiple backends
sweeping different partitions concurrently, the same entry can be
targeted for drop more than once before garbage collection runs.
A skip_dropped flag makes this safe rather than throwing ERROR.

0003: dshash: add partition-scoped sequential scan

Adds dshash_seq_init_partition() to restrict a scan to a single
partition. This is the building block for per-partition clock-sweep;
a backend only locks and sweeps one partition at a time.

0004: pg_stat_statements: modernize entry storage with pgstat kind

The main patch. Replaces ShmemInitHash with dshash via DSM registry,
and replaces per-entry spinlock counter updates with a custom pgstat
kind that uses the core pgstat infrastructure.

Eviction changes from qsort-all-entries to clock-sweep with an atomic
rotating hand. Each entry carries a refcount (capped at 10) that
decays on sweep; entries reaching zero are evicted. Hot queries keep
their refcount topped up proportionally to access frequency.

pg_stat_statements.max becomes PGC_SIGHUP.

0005: pg_stat_statements: store query text in DSA instead of file

Moves query text from pgss_query_texts.stat into a DSA area via
GetNamedDSA. Adds pg_stat_statements.query_text_memory (PGC_SIGHUP,
default 4MB) controlling DSA size. Eliminates the GC machinery
entirely. When DSA is exhausted, entries are still tracked but query
text is stored as NULL. A backfill mechanism recovers text on
subsequent executions once space becomes available.

Benchmark:

Attached are the benchmark scripts I used for v3 (and will keep
using going forward) with the results in the benchmark_v3.txt file comparing
patch vs upstream.

The benchmark performs various workloads: "high churn", "light churn",
"multi stmt", and simple "select1". I am also tracking query retention
(hot/cold entry retention) to verify the clock-sweep behaves as
expected. I attached the .sql scripts used to benchmark.

The select1 result shows a 1.0% regression. This workload has no
contention, so it purely measures pgstat infrastructure overhead;
perf profiling shows pgstat_get_entry_ref() at the top. However, on
machines with higher core count the upstream spinlock on the counters
becomes a bottleneck, which is where the dshash design should win
back this overhead and maybe more. I still plan to benchmark this on
a larger machine.

You will notice that cold_calls in the patched churn case are much
lower (805 vs 4,458). This is because entries get evicted sooner
under per-partition sweep. Hot and cold query retention lines up w
ith current upstream (1000/1000 hot entries survive continuous churn).

The deallocs count is much higher in the patch (11.9M vs 38.5K in
high churn). per-partition sweep fires frequently to keep the table
at target capacity, whereas upstream batches fewer, larger deallocations.
We can maybe look into reducing USAGE_DEALLOC_PERCENT to
increase retention of "colder" entries.

We also see much less LWLock contention in the patched churn case.
The top wait is PgStatsDSA (502 total) vs pg_stat_statements (7,757)
in upstream, a 15x reduction.

Looking forward to your feedback!

--
Sami Imseih
Amazon Web Services (AWS)

#10

Zsolt Parragi

zsolt.parragi@percona.com

about 2 months ago

In reply to: Sami Imseih (#9)

Re: Improve pg_stat_statements scalability

Hello!

I noticed that the patch has some issues with minmax_only, and these differences seem to be unintended to me:

1. minmax only resets updates mean/sum_var_time (but then when it calculates them again, it also uses the data before the minmax reset)

+		/*
+		 * Reset only min/max timing values and update minmax_stats_since.
+		 * Iterate matching entries in the dshash, reset the corresponding
+		 * pgstat counters' min/max fields.
+		 */
...
+						shared->counters.mean_time[kind] = 0;
+						shared->counters.sum_var_time[kind] = 0;

2. min_time doesn't recover after a minmax reset:

+		if (pending->min_time[kind] < shared->min_time[kind])
+			shared->min_time[kind] = pending->min_time[kind];

As it is only updated if it's smaller, but it is 0 after it, and because the call count remains > 0 it doesn't get initialized with the current value.

+		if ((int) pg_atomic_read_u64(&pgss_shared->nentries) <= pgss_max * (100 - USAGE_DEALLOC_PERCENT) / 100)
+			break;

isn't there a possible overflow here? The similar deleted check had explicit uint64 casts for pgss_max multiplication.

- /*
- * Don't proceed if file does not exceed 512 bytes per possible entry.
- */
- if ((uint64) extent < (uint64) 512 * pgss_max)
- return false;

#11

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 2 months ago

In reply to: Sami Imseih (#9)

Re: Improve pg_stat_statements scalability

Hello.

I only reviewed v3-0001 so far, and the comments below are limited to
that patch. Some of them are higher-level observations rather than
comments on individual code fragments.

+Datum
+pg_stat_report_anytime(PG_FUNCTION_ARGS)

The name pg_stat_report_anytime() feels a bit confusing to me.

While "report" makes sense from an implementation perspective because
the function ultimately reports local statistics to the shared
statistics subsystem, that meaning is not very obvious from the
perspective of a SQL-visible function.

Likewise, I'm not sure that "report_anytime" clearly conveys what the
function actually does. Although this is not intended for general
users, we already have pg_stat_force_next_flush(), and it seems
preferable to follow a similar naming convention.

With that in mind, perhaps something like
pg_stat_flush_anytime_stats() would be more descriptive.

+void
+pgstat_report_anytime_stat(void)

In the previous thread, I understood that there was discussion about
introducing a flush interval of around one second, and that the
direction later shifted toward reusing PGSTAT_MIN_INTERVAL.

While the introduction of a manually triggered API changes when
flushes are initiated, it does not seem to change the need for
limiting the flush frequency itself.

As far as I can see, the current implementation does not apply any
throttling to calls of this API. In theory, a large number of backends
could invoke it frequently and generate a high flush rate. For
example, if 1000 backends were to call it once per second, that would
result in 1000 shared-stats updates per second.

Whether such a usage pattern would occur in practice is a separate
question. However, given that existing pgstat code uses
PGSTAT_MIN_INTERVAL to limit flush frequency, it seems to me that
anytime stats should probably retain a similar restriction.

-		Assert(did_flush || nowait);
+		/*
+		 * When nowait is false we block for the lock, so the only reason a
+		 * flush_pending_cb can legitimately return false is that the entry
+		 * has active transaction state that must not be freed yet (e.g.
+		 * relation stats with trans != NULL).  That situation only arises
+		 * mid-transaction, hence the IsTransactionOrTransactionBlock() check.
+		 */
+		Assert(did_flush || nowait || IsTransactionOrTransactionBlock());

Given the current design, the assertion itself looks reasonable to me,
since we're now allowing flush_pending_cb() to return false during
mid-transaction flushes. However, I find the comment a bit difficult
to reconcile with the actual condition. The comment explains a
specific reason why flush_pending_cb() may legitimately return false,
while the assertion itself only distinguishes between transaction and
non-transaction contexts.

As I understand it, the intent here is to keep the existing
requirement outside a transaction while relaxing it for
mid-transaction flushes. If so, I wonder whether expressing the
assertion as

Assert(IsTransactionOrTransactionBlock() ||
(did_flush || nowait));

would make that intent a bit more obvious.

If that's indeed the intent, perhaps the comment could also be phrased
in terms of "the assertion is relaxed during a transaction" rather
than describing a specific reason why flush_pending_cb() may return
false.

 	 * Ignore entries that didn't accumulate any actual counts, such as
-	 * indexes that were opened by the planner but not used.
+	 * indexes that were opened by the planner but not used.  The entry cannot
+	 * be freed if there is active transaction state, since
+	 * AtEOXact_PgStat_Relations will still merge counters into it.
 	 */
 	if (pg_memory_is_all_zeros(&lstats->counts,
 							   sizeof(struct PgStat_TableCounts)))
-		return true;
+		return (lstats->trans == NULL);

I may be missing something, but I think there may be a more
fundamental issue behind this change.

Historically, flushing and freeing an entry were effectively the same
decision because flushing only happened after transaction end. With
anytime flushes, however, those become separate concerns. A callback
may be able to flush everything that is appropriate for the current
flush context, while the caller may still need to keep the entry
around for later transaction-end processing.

That makes it hard to reason about checks such as
pg_memory_is_all_zeros(&lstats->counts, ...). This check still appears
to tell whether the counters themselves are zero, but the proposed
change makes it look as if that is no longer enough to determine whether
the entry is "empty", because entry lifetime is also folded into the
callback result.

I wonder whether it would be cleaner for the caller to make the lifetime
decision, based on the kind of flush being performed, and for the
callback to be told that flush context explicitly. For example, the
callback could be passed whether this is an anytime flush or a
transaction-boundary flush, flush the counters that are appropriate for
that context, and then return whether any counters remain.

That way, the callback would not need to infer entry lifetime from
transaction state, and the return value could keep a simpler meaning.

Regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12

Michael Paquier

michael@paquier.xyz

about 1 month ago

In reply to: Kyotaro Horiguchi (#11)

Re: Improve pg_stat_statements scalability

On Mon, Jun 01, 2026 at 01:58:58PM +0900, Kyotaro Horiguchi wrote:

As far as I can see, the current implementation does not apply any
throttling to calls of this API. In theory, a large number of backends
could invoke it frequently and generate a high flush rate. For
example, if 1000 backends were to call it once per second, that would
result in 1000 shared-stats updates per second.

Whether such a usage pattern would occur in practice is a separate
question. However, given that existing pgstat code uses
PGSTAT_MIN_INTERVAL to limit flush frequency, it seems to me that
anytime stats should probably retain a similar restriction.

Hmm, OK. One cost in this decision is that it could randomly make the
tests randomly slower, which is one reason why this patch has been
added to this thread.

Historically, flushing and freeing an entry were effectively the same
decision because flushing only happened after transaction end. With
anytime flushes, however, those become separate concerns. A callback
may be able to flush everything that is appropriate for the current
flush context, while the caller may still need to keep the entry
around for later transaction-end processing.

Noted.

That makes it hard to reason about checks such as
pg_memory_is_all_zeros(&lstats->counts, ...). This check still appears
to tell whether the counters themselves are zero, but the proposed
change makes it look as if that is no longer enough to determine whether
the entry is "empty", because entry lifetime is also folded into the
callback result.

I wonder whether it would be cleaner for the caller to make the lifetime
decision, based on the kind of flush being performed, and for the
callback to be told that flush context explicitly. For example, the
callback could be passed whether this is an anytime flush or a
transaction-boundary flush, flush the counters that are appropriate for
that context, and then return whether any counters remain.

So you would suggest to extend the flush callback(s) with a context
value instead of having each flush callback do their own decisions
depending on if we are in a transaction block or not, right?

Another question that I would have for you is: do you think that we
should try to keep pgstat_report_stat() as the sole entry point for
the flush of the stats? Or should we try to keep the same approach as
what this patch is doing with a new routine that loops through the
flush callbacks? FWIW, I am kind of finding the approach of having a
single entry point rather than two as more appealing with the
long-term picture in mind. That's just a single take, where we could
just pgstat_report_stat() an extra argument based on the context where
it is called, lifting its current !IsTransactionOrTransactionBlock()
requirement.

As you are the original author of this area of this code, I'd be
really interested in your opinion here.
--
Michael

#13

Sami Imseih

samimseih@gmail.com

about 1 month ago

In reply to: Michael Paquier (#12)

Re: Improve pg_stat_statements scalability

As far as I can see, the current implementation does not apply any
throttling to calls of this API. In theory, a large number of backends
could invoke it frequently and generate a high flush rate. For
example, if 1000 backends were to call it once per second, that would
result in 1000 shared-stats updates per second.

Whether such a usage pattern would occur in practice is a separate
question. However, given that existing pgstat code uses
PGSTAT_MIN_INTERVAL to limit flush frequency, it seems to me that
anytime stats should probably retain a similar restriction.

Hmm, OK. One cost in this decision is that it could randomly make the
tests randomly slower, which is one reason why this patch has been
added to this thread.

The throttling does make the tests longer, but we can do similar to what
was done in f1e251be80a, and disable PGSTAT_MIN_INTERVAL if
injection points are enabled.

-       /* 50 tries with 100ms sleep between tries makes 5 sec total wait */
-       for (tries = 0; tries < 50; tries++)
+       /*
+        * Retry up to 50 times with 100ms between attempts (max 5s
total). Can be
+        * reduced to 3 attempts (max 0.3s total) to speed up tests.
+        */
+       int                     ntries = 50;
+
+#ifdef USE_INJECTION_POINTS
+       if (IS_INJECTION_POINT_ATTACHED("procarray-reduce-count"))
+               ntries = 3;
+#endif
+

--
Sami Imseih
Amazon Web Services (AWS)

#14

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 1 month ago

In reply to: Michael Paquier (#12)

Re: Improve pg_stat_statements scalability

Hello,

At Tue, 2 Jun 2026 15:30:40 +0900, Michael Paquier <michael@paquier.xyz> wrote in

Hmm, OK. One cost in this decision is that it could randomly make the
tests randomly slower, which is one reason why this patch has been
added to this thread.

That's a fair point.

If this interface is primarily intended for testing purposes, then I
don't think throttling is particularly important.

On the other hand, if it is expected to be used outside testing as
well, I think some form of throttling would still make sense. In that
case, I think pg_stat_force_next_flush() should probably apply to
anytime flushes as well.

I wonder whether it would be cleaner for the caller to make the lifetime
decision, based on the kind of flush being performed, and for the
callback to be told that flush context explicitly. For example, the
callback could be passed whether this is an anytime flush or a
transaction-boundary flush, flush the counters that are appropriate for
that context, and then return whether any counters remain.

So you would suggest to extend the flush callback(s) with a context
value instead of having each flush callback do their own decisions
depending on if we are in a transaction block or not, right?

Yes, that's roughly what I had in mind.

My concern was that the callback result used to answer a simple
question: whether anything remained after the flush.

The flush context is already known by the caller, so it seems cleaner
to keep that information separate. The callback can simply report
whether anything remains.

That keeps the meaning of the callback result straightforward. The
caller can make decisions based on the flush context, while the
callback only reports the state of the counters themselves.

One alternative would be to introduce a separate callback for anytime
flushes. However, since transaction-boundary flushes ultimately need
to flush everything anyway, it seems to me that passing a context flag
would be sufficient.

Another question that I would have for you is: do you think that we
should try to keep pgstat_report_stat() as the sole entry point for
the flush of the stats? Or should we try to keep the same approach as
what this patch is doing with a new routine that loops through the
flush callbacks? FWIW, I am kind of finding the approach of having a
single entry point rather than two as more appealing with the
long-term picture in mind. That's just a single take, where we could
just pgstat_report_stat() an extra argument based on the context where
it is called, lifting its current !IsTransactionOrTransactionBlock()
requirement.

As you are the original author of this area of this code, I'd be
really interested in your opinion here.

I tend to agree with the single-entry-point approach.

The basic operation is common; the anytime case differs only in the
context in which the flush is requested. The flush-timing logic in
pgstat_report_stat() may become slightly more complicated with anytime
flushes, but after that point I think the existing path should
continue to work with only minor adjustments, by carrying an explicit
context value down to the callbacks.

Regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15

Michael Paquier

michael@paquier.xyz

about 1 month ago

In reply to: Kyotaro Horiguchi (#14)

Re: Improve pg_stat_statements scalability

On Wed, Jun 03, 2026 at 05:10:39PM +0900, Kyotaro Horiguchi wrote:

One alternative would be to introduce a separate callback for anytime
flushes. However, since transaction-boundary flushes ultimately need
to flush everything anyway, it seems to me that passing a context flag
would be sufficient.

Nah. I am not really in favor of an extra callback. I feel that this
could lead to more mistakes when implementing new stats kinds, so
doing your approach of using a value based on the context of the flush
works for me.

I tend to agree with the single-entry-point approach.

The basic operation is common; the anytime case differs only in the
context in which the flush is requested. The flush-timing logic in
pgstat_report_stat() may become slightly more complicated with anytime
flushes, but after that point I think the existing path should
continue to work with only minor adjustments, by carrying an explicit
context value down to the callbacks.

Yeah, same feeling here. That would be the way I'd use to approach
this design, and to aim for one single entry point for all the stats
reports and the flushes, for all the contexts.
--
Michael

#16

Sami Imseih

samimseih@gmail.com

about 1 month ago

In reply to: Michael Paquier (#15)

Re: Improve pg_stat_statements scalability

Hi,

Sorry for the delay, and thank you for the comments!

On Wed, Jun 03, 2026 at 05:10:39PM +0900, Kyotaro Horiguchi wrote:

One alternative would be to introduce a separate callback for anytime
flushes. However, since transaction-boundary flushes ultimately need
to flush everything anyway, it seems to me that passing a context flag
would be sufficient.

Nah. I am not really in favor of an extra callback. I feel that this
could lead to more mistakes when implementing new stats kinds, so
doing your approach of using a value based on the context of the flush
works for me.

I will proceed with this approach.

I tend to agree with the single-entry-point approach.

The basic operation is common; the anytime case differs only in the
context in which the flush is requested. The flush-timing logic in
pgstat_report_stat() may become slightly more complicated with anytime
flushes, but after that point I think the existing path should
continue to work with only minor adjustments, by carrying an explicit
context value down to the callbacks.

Yeah, same feeling here. That would be the way I'd use to approach
this design, and to aim for one single entry point for all the stats
reports and the flushes, for all the contexts.

We can keep the scope of the work for now limited to the testing
infrastructure and allow pg_stat_force_next_flush() to call
pgstat_report_stat() while within a transaction. This will not require
throttling.

0004: pg_stat_statements: modernize entry storage with pgstat kind

The main patch. Replaces ShmemInitHash with dshash via DSM registry,

v3-0004 introduces a secondary dshash meant to track the query
dsa_pointer, but its larger purpose is to avoid having
pg_stat_statements_internal or eviction do a full scan on the
core dshash to look up its own entries; which means filtering
out entries for other kinds.

After discussing this approach with several people at recent
conferences (Lukas Fittl and Robert Haas), I think we should do
something more robust in the core stats system instead.

I have implemented an approach, which I will share in the next
revision, where every stats kind is written to its own dedicated
dshash rather than into the common one. This is a new
configuration in PgStat_KindInfo where an extension (or even a
core stats kind) can request to be placed in a dedicated hash,
else in the common hash.

I like this better, especially for kinds with unique access
patterns like pg_stat_statements, where the view function and
eviction need to iterate over all their own entries frequently.
A dedicated hash eliminates the full-scan problem without a
secondary coordination dshash.

The implementation is straightforward and does not require
significant API changes.

--
Sami Imseih
Amazon Web Services (AWS)

#17

Michael Paquier

michael@paquier.xyz

22 days ago

In reply to: Sami Imseih (#16)

Re: Improve pg_stat_statements scalability

On Fri, Jun 12, 2026 at 06:21:42PM -0500, Sami Imseih wrote:

I have implemented an approach, which I will share in the next
revision, where every stats kind is written to its own dedicated
dshash rather than into the common one. This is a new
configuration in PgStat_KindInfo where an extension (or even a
core stats kind) can request to be placed in a dedicated hash,
else in the common hash.

I like this better, especially for kinds with unique access
patterns like pg_stat_statements, where the view function and
eviction need to iterate over all their own entries frequently.
A dedicated hash eliminates the full-scan problem without a
secondary coordination dshash.

I'm finding this alternative quite appealing, indeed. For PGSS, I was
wondering about how it would be possible to avoid a secondary hash,
and I did not consider an approach where we could bypass the main one.
--
Michael

#18

Sami Imseih

samimseih@gmail.com

19 days ago

In reply to: Michael Paquier (#17)

Re: Improve pg_stat_statements scalability

Hi,

Sorry for the long silence, but I have been spending time on
this and would like to share my updates.

On Wed, Jun 03, 2026 at 05:10:39PM +0900, Kyotaro Horiguchi wrote:

One alternative would be to introduce a separate callback for anytime
flushes. However, since transaction-boundary flushes ultimately need
to flush everything anyway, it seems to me that passing a context flag
would be sufficient.

Nah. I am not really in favor of an extra callback. I feel that this
could lead to more mistakes when implementing new stats kinds, so
doing your approach of using a value based on the context of the flush
works for me.

I will proceed with this approach.

After further thought, I don't think passing a context flag to the
callback is necessary here.

0001 makes pg_stat_force_next_flush() a single entry point for force
flushing. If called inside a transaction, it flushes whatever can
safely be flushed at that point.

The only restriction is in pgstat_relation_flush_cb(), where transactional
counters cannot be flushed while there is active transaction state
for that entry (lstats->trans != NULL). That state is already directly
available to the callback, so I don't see what value passing a context
that most callbacks will not have a need for will add?

I like this better, especially for kinds with unique access
patterns like pg_stat_statements, where the view function and
eviction need to iterate over all their own entries frequently.
A dedicated hash eliminates the full-scan problem without a
secondary coordination dshash.

I'm finding this alternative quite appealing, indeed. For PGSS, I was
wondering about how it would be possible to avoid a secondary hash,
and I did not consider an approach where we could bypass the main one.

0002 does this. own_hash is the name I chose for the option, but maybe
there is something better. This can only be defined for variable-numbered
stats. There is an API used by core and available to extensions,
pgstat_get_hash_for_kind, so they can retrieve the hash which is
needed for passing to dshash_* related calls.

Also, I realized that we need an API to drop a stats entry during an
in-progress dshash scan. The existing APIs (pgstat_drop_entry()
and pgstat_drop_matching_entries()) cannot be used for this because
they initiate their own dshash lookup or scan internally, which
conflicts with a scan the caller already has open. pgstat_drop_current()
operates directly on the entry the caller already holds from dshash_seq_next().

Also, pgstat_has_pending() lets the caller check whether an entry
has unflushed local data without doing a dshash lookup, which would
also conflict with the open scan.

This will be needed to implement a pg_stat_statements_reset() because
we must be able to scan through the kind's dshash and delete directly
and without undoing a pending change. But I also think there are
other places where this will be needed depending on the final
design of eviction. Attached 0003 does this.

There is also a piece of infrastructure missing that I have not yet
included. Currently, setting pg_stat_statements.save to false prevents
auxiliary data (query text, etc.) from being saved to disk, but does
not prevent the stat entries themselves from being written to the core
stats file, since .write_to_file is set at kind registration time and
cannot change.

I think we need a way for a kind to override .write_to_file at
runtime. PgStat_KindInfo must be a defined as a const.
One option would be a callback to override .write_to_file, which
an extension can then perhaps use to make more granular decisions
about which stats to serialize to the core stats file. Another
idea is an override options API, but I feel better about the
callback.

Now, as far as eviction goes. I have spent time with the help of
my benchmarking suite [1]https://github.com/samimseih/pg_stat_statements_benchmark looking into various strategies. My main
finding is that performing parallel evictions as discussed and
proposed in v3 will not work. In the high churn case, we will
not be able to put enough back pressure to keep the growth
of the dshash in check.

So, the idea that I have now is what I am attaching in 0004.

The important design points are:

1. We need to hard-cap growth of the dshash, so we cannot avoid
skipping new entries if we hit the cap. Right now the hard-cap
is pgss_max * 2.

2. We need to give new entries time to live before they become
eviction candidates. This helps in low-churn-but-some-churn cases.

3. Least-recently-used will be the eviction strategy, with a
threshold to keep new entries alive long enough to build up
their popularity before becoming candidates.

0004 is the work-in-progress idea for now. It uses a LWLock to
serialize eviction and new entries, but it also does eviction
throttling. The numbers I have on the benchmarks I have done
show really good numbers/scalability and I will be running
more benchmarks across all the workloads I have in [1]https://github.com/samimseih/pg_stat_statements_benchmark and
will share the results.

I also have some code on my machine that implements something
inspired by Redis LFU [2]https://github.com/redis/redis/blob/unstable/src/evict.c, in which instead of a single process
performing an eviction, the processes are constantly sampling
and updating eviction threshold and evictions are done in-line.
This requires a small ring buffer which we track new entries
( they are also in dshash ) and if they do not gain enough
popularity, they become candidates from removal from the dshash and
room is made for the next new entry. Not exactly the Redis
implementation, but something like that that can work for
pg_stat_statements. The bonus points for this is it can be done
without a global lock as evictions are in line. I will also include
this in my benchmarks.

I think the infrastructure patches are going to be needed regardless
of which direction to take, so I would like to see those get
committed while we work out the eviction details.

[1]: https://github.com/samimseih/pg_stat_statements_benchmark
[2]: https://github.com/redis/redis/blob/unstable/src/evict.c

--
Sami Imseih
Amazon Web Services (AWS)

#19

Sami Imseih

samimseih@gmail.com

19 days ago

In reply to: Sami Imseih (#18)

Re: Improve pg_stat_statements scalability

Now, as far as eviction goes. I have spent time with the help of
my benchmarking suite [1] looking into various strategies. My main
finding is that performing parallel evictions as discussed and
proposed in v3 will not work. In the high churn case, we will
not be able to put enough back pressure to keep the growth
of the dshash in check.

So, the idea that I have now is what I am attaching in 0004.

The important design points are:

1. We need to hard-cap growth of the dshash, so we cannot avoid
skipping new entries if we hit the cap. Right now the hard-cap
is pgss_max * 2.

2. We need to give new entries time to live before they become
eviction candidates. This helps in low-churn-but-some-churn cases.

3. Least-recently-used will be the eviction strategy, with a
threshold to keep new entries alive long enough to build up
their popularity before becoming candidates.

One subtle but important point is that eviction decisions are
driven by a new atomic counter that tracks live entries, which
is incremented when a new entry is created and decremented when
and an entry is dropped by eviction. This is a counter that
pg_stat_statements defines. The reason being is the core
pgstat_get_entry_count() is the number of entries in the dshash,
which also includes dropped-but-still-referenced by other backend
entries. So this core counter cannot be used to drive when an eviction
occurs, but it's used to set a hard cap on the growth of the dshash.

--
Sami Imseih
Amazon Web Services (AWS)

#20

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

15 days ago

In reply to: Sami Imseih (#18)

Re: Improve pg_stat_statements scalability

Hi,

On Fri, Jun 26, 2026 at 05:33:28PM -0500, Sami Imseih wrote:

Hi,

Sorry for the long silence, but I have been spending time on
this and would like to share my updates.

On Wed, Jun 03, 2026 at 05:10:39PM +0900, Kyotaro Horiguchi wrote:

One alternative would be to introduce a separate callback for anytime
flushes. However, since transaction-boundary flushes ultimately need
to flush everything anyway, it seems to me that passing a context flag
would be sufficient.

Nah. I am not really in favor of an extra callback. I feel that this
could lead to more mistakes when implementing new stats kinds, so
doing your approach of using a value based on the context of the flush
works for me.

I will proceed with this approach.

After further thought, I don't think passing a context flag to the
callback is necessary here.

I think this is needed so that custom stats extensions know that it exists (as
they would get a compilation errors and would need to make use of it).

Indeed, if we don't add a flag then it's easy for a custom stats kind to
crash on assert-enabled build or produces weird behavior on non-assert enabled
build. All it has to do is, for example, call GetCurrentTransactionStopTimestamp()
at flush time, say:

static bool
my_extension_flush_pending_cb(PgStat_EntryRef *entry_ref, bool nowait)
{
....
PgStatShared_myextensiontEntry *shared;
....
shared->last_flush_time = GetCurrentTransactionStopTimestamp();
...
}

Then:

BEGIN;
do stuff that would produce the extension to flush stats;
SELECT pg_stat_force_next_flush();

Would:

TRAP: failed Assert("s->state == TRANS_DEFAULT || s->state == TRANS_COMMIT || s->state == TRANS_ABORT || s->state == TRANS_PREPARE"), File: "../src/backend/access/transam/xact.c", Line: 901, PID: 1111147
postgres: postgres postgres [local] SELECT(ExceptionalCondition+0xa4)[0x646bb1bbf2d4]
postgres: postgres postgres [local] SELECT(GetCurrentTransactionStopTimestamp+0x53)[0x646bb1546683]

By adding a flag, extensions owner would be aware that a change is needed.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#21

Kyotaro Horiguchi

horikyota.ntt@gmail.com

14 days ago

In reply to: Sami Imseih (#18)

#22

Sami Imseih

samimseih@gmail.com

14 days ago