Improve pg_stat_statements scalability
Hi,
pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.
The three scaling problems:
1. Per-entry SpinLock contention. Every counter update acquires
a SpinLock on the entry. Hot queries executed across many
backends contend on the same lock, and the hold time grows as
we add more counters to the struct. Higher core counts mean
more CPUs contending for the same spinlock, making the problem
worse on modern hardware. See this complaint here [3]/messages/by-id/btsjlfnqge3y6yypkwe7yvhv2tcopt6pug7gigz6xaha2iemkw@lflv3psi7xoz.
2. Deallocation under exclusive LWLock. The hash table is
fixed-size, bounded by pg_stat_statements.max. When full, a
single backend performs least-frequently-used eviction of
the bottom 5% while holding an exclusive LWLock, blocking all
other backends from reading or writing pg_stat_statements.
This happens inline during query execution, and workloads with
many unique queries trigger it frequently. Making matters
worse, pg_stat_statements.max is a static GUC, requiring a
full server restart to change, and that is disruptive in
production systems.
3. Query text file bloat and GC. Deallocated entries leave dead
text in the external file. When the file grows to 2x the size
of live text, GC rewrites the entire file under exclusive lock.
Disk I/O under exclusive lock is the worst combination, and
frequent deallocations (problem #2) trigger more GC.
Using the pgstat, "statistics collector", system to improve
pg_stat_statements scalability has been discussed previously
[4]: /messages/by-id/aKF0V-T8-XAxj47T@paquier.xyz
to shared memory, avoiding spinlock contention on the write
path. Also, the storage is a dshash (partitioned hash table),
which reduces contention on lookups and allows dynamic resizing
without restart. The prerequisite work to make this usable by
extensions has been building over two release cycles:
- PG18: Pluggable cumulative statistics API
(pgstat_register_kind) [7949d959]
- PG19: Serialization callbacks (to_serialized_data,
from_serialized_data, finish) for custom stats kinds
[4ba012a8ed9]
This patch builds on the prerequisite work to address the
scalability issues mentioned above.
The series consists of two patches:
---
[0001]: pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid)
Adds infrastructure for flushing pending statistics to shared
memory on demand. pgstat_flush_pending() flushes all pending
entries in the calling backend immediately. Unlike
pgstat_report_stat(), it can be called mid-transaction, making
it suitable for view functions that need fresh shared stats
before the transaction ends.
pg_stat_flush_pending(pid) is the SQL-callable interface to
the same function.
This patch is related to the discussion in [1]/messages/by-id/acNTfL1xO_UUXkZQ@paquier.xyz about flushing
stats within running transactions. The pg_stat_statements
modernization provides a good example where this is useful.
At least for the pg_stat_statements tests, we need a way to
flush statistics within a transaction. I plan to spin this off
to a dedicated thread, but include it here for now so this
patchset can be tested.
[0002]: pg_stat_statements: modernize entry storage with pgstat kind
pgstat kind
This is the main patch. It replaces the private shared-memory
hash table with the pgstat subsystem's dshash (registered as a
custom stats kind).
The pgstats hash key has an objid:
typedef struct PgStat_HashKey
{
PgStat_Kind kind; /* statistics entry kind */
Oid dboid; /* database ID. InvalidOid for shared
objects. */
uint64 objid; /* object ID (table, function, etc.), or
* identifier. */
} PgStat_HashKey;
So the entry objid is computed by combining hashes from userid, queryid and
toplevel.
Note that because of 0001 there are no changes required to the
existing regression tests. We may need to add some new tests,
but I have not thought too much about that yet.
Key design changes:
- No SpinLock on the execution path. Counters accumulate
per-backend and merge into shared memory on flush. Stddev
and related fields use Welford's algorithm as before, but since the stats
are first updated locally, we need
a way to merge the stats on flush, so we use Chan's parallel
algorithm for this purpose [5]https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm.
- No fixed shared memory allocation. Entries live in the pgstat
dshash, which grows dynamically. pg_stat_statements.max
becomes PGC_SIGHUP, and therefore can be changed dynamically.
- Throttled inline eviction. When entry count reaches
pgss_max, a backend attempts eviction using a conditional
lock and a shared timestamp that ensures at most one eviction
cycle per 10 seconds. Other backends simply skip entry
creation without blocking. This is acceptable because the
current upstream behavior already suffers from the same data
loss under heavy churn. Newly created entries are immediately
candidates for eviction and are frequently removed in the
next deallocation cycle. The throttled approach makes this
trade-off explicit and avoids the cost of blocking all backends
behind an exclusive lock to create space for entries that may
just be removed shortly. See benchmark results below for
measuring the retention of "hot" entries.
The aging decay mechanism remains in place, but the sticky
entries are no longer needed for this patch, and we simply
don't evict calls == 0.
A background worker for eviction was also considered. The
current inline approach was chosen because it avoids the
added complexity while still preventing other backends from
blocking. If a background worker has more merit, I am open
to that discussion.
- Query texts in DSA memory with disk fallback. A new GUC
pg_stat_statements.query_text_memory (default 64 MB) controls
DSA shared memory for query text storage. When enabled and
not exhausted, new texts are stored in DSA instead of the
external file, eliminating file I/O on the read path. If DSA
is disabled (set to 0) or full, texts fall back to the
existing file-based storage. Entry eviction and reset properly
free DSA-allocated texts, and GC of the text file skips
DSA-backed entries.
- Serialization via pgstat callbacks. to_serialized_data /
from_serialized_data handle saving and restoring stats across
restarts, replacing the module's bespoke shutdown/startup
logic.
- pg_stat_statements_info gains columns. num_entries (current
count), last_eviction_time, query_text_memory_bytes (DSA
currently allocated), and query_text_file_bytes (overflow
file size) provide operational visibility into eviction
behavior and memory usage. Some of these fields were discussed
in this thread [6]/messages/by-id/CAP53PkzYZ8YxH0o+Garw9fWdFRoEtmQKT09-q=2RVMW8uVS5Nw@mail.gmail.com as they are relevant even to the current
state of pg_stat_statements.
The LWLock is retained but narrowed to protect only query text
file operations and eviction. Entry-level locking is handled
by the pgstat subsystem's built-in mechanism.
---
Benchmark:
Attached is a benchmark script that runs three workloads:
- 5k: 80% hot (1000 distinct) + 20% churn (4000 distinct).
Total fits within max, no eviction expected.
- 100k: 80% hot (1000 distinct) + 20% churn (100000 distinct).
Exceeds max, continuous eviction pressure.
- spinlock: single SELECT; from all 256 clients. Pure
contention on one entry.
The benchmark collects pg_stat_activity in the background to
Measure wait events. It also polls pg_stat_statements to check
For hot and churn query retention, so we can measure hot
Query retention and also measure the overhead of querying
pg_stat_statements.
Benchmark Results:
Environment: x86_64, 16 CPUs, 29 GB RAM, Linux 6.1 (EC2).
PostgreSQL 19devel, release build, cassert=off.
pg_stat_statements.max = 5500, 256 clients, 16 threads,
5 min per test.
Test | un-patched TPS | Patch TPS | Delta
---------+----------------+-----------+------
5k | 241,078 | 239,001 | -0.9%
100k | 180,817 | 240,314 | +33%
spinlock | 328,075 | 319,466 | -3%
Wait event samples (collected every 1s for 300s):
Test | un-patched | Patch
---------+-----------------------------+-----------------
5k | ClientRead 1907 | ClientRead 2106
100k | pg_stat_statements 20416 | ClientRead 1978
| ClientRead 12874 |
spinlock | ClientRead 1804 | ClientRead 1749
Notice the huge difference in wait events. Essentially, the
pg_stat_statements waits are eliminated with this design and
there is a 33% performance improvement.
Entry retention under 100k (heavy eviction):
query type | un-patched | Patch
-----------+-----------------------------+---------------------------
hot | 1000 entries, 42-44k calls | 1000 entries, 52-58k calls
churn | 4436 entries, 1-5 calls | 4430 entries, 1-192 calls
deallocs | 37,972 | 30
Under heavy churn, the unpatched eviction behavior retains
hot entries and discards churn entries after 1-5 calls. The patched
design throttles eviction to at most once per 10 seconds, so entries
survive longer between cycles and accumulate more calls. In the patched
design, the churn entries reach 1-192 calls before eviction. Hot entries
benefit from the elimination of exclusive-lock blocking, accumulating 52-58k
calls versus 42-44k.
In the non-churn case (5k test), both designs are equivalent
in performance.
More benchmarking across different patterns is needed, but I believe this
is a good start in terms of numbers and accuracy.
---
Open questions:
1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the
custom kind ID. For commit, we'd want a proper kind number.
2. The 10-second eviction throttle is a compile-time constant
(EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed
interval sufficient?
3. The current design skips entry creation when at capacity and
eviction is throttled. An alternative would be to allow
temporary overshoot (soft limit) and never reject entries.
Thoughts on the trade-off? However, for heavy churn and
many unique entries we can easily overshoot the max by
magnitudes higher which I don't think is a good idea.
4. The default size of query text memory. The patch has it at 64MB.
[1]: /messages/by-id/acNTfL1xO_UUXkZQ@paquier.xyz
[3]: /messages/by-id/btsjlfnqge3y6yypkwe7yvhv2tcopt6pug7gigz6xaha2iemkw@lflv3psi7xoz
[4]: /messages/by-id/aKF0V-T8-XAxj47T@paquier.xyz
[5]: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
[6]: /messages/by-id/CAP53PkzYZ8YxH0o+Garw9fWdFRoEtmQKT09-q=2RVMW8uVS5Nw@mail.gmail.com
Patches and benchmark script attached.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
0002-pg_stat_statements-modernize-entry-storage-with-pgst.patchapplication/octet-stream; name=0002-pg_stat_statements-modernize-entry-storage-with-pgst.patchDownload+1295-803
0001-pgstat-add-pgstat_flush_pending-and-pg_stat_flush_pe.patchapplication/octet-stream; name=0001-pgstat-add-pgstat_flush_pending-and-pg_stat_flush_pe.patchDownload+118-1
spinlock | 328,075 | 319,466 | -3%
One thing I forgot to mention above is for the spinlock contention
test, I plan on running this on a machine with much higher
core count, where I have seen this be a much larger issue in
the past.
--
Sami Imseih
Amazon Web Services (AWS)
Hi,
I had to fix an issue with the serialization callback that was causing
CI failures.
I also made re-organized the patches a bit:
v1-0001: This will not be part of the main patchset, but will be be a dependency
for the remaining patches. This is, as mentioned above, an in-transaction flush
required [1]https://commitfest.postgresql.org/patch/6781/
v1-0002: This now only moves the pg_stat_statements hash to a custom
pgstat kind
v1-0003: This moves the query text file to a DSA + file storage.
v1-0004: Adds new columns to pg_stat_statements_info
[1]: https://commitfest.postgresql.org/patch/6781/
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
v1-0003-pg_stat_statements-add-DSA-based-query-text-stora.patchapplication/octet-stream; name=v1-0003-pg_stat_statements-add-DSA-based-query-text-stora.patchDownload+278-41
v1-0002-pg_stat_statements-modernize-entry-storage-with-p.patchapplication/octet-stream; name=v1-0002-pg_stat_statements-modernize-entry-storage-with-p.patchDownload+998-816
v1-0004-pg_stat_statements-extend-pg_stat_statements_info.patchapplication/octet-stream; name=v1-0004-pg_stat_statements-extend-pg_stat_statements_info.patchDownload+75-2
v1-0001-pgstat-Introduce-pg_stat_report_anytime-for-mid-t.patchapplication/octet-stream; name=v1-0001-pgstat-Introduce-pg_stat_report_anytime-for-mid-t.patchDownload+431-36
There was a failure on FreeBSD [1]https://cirrus-ci.com/task/5948422076760064. The test uses
debug_parallel_query=regress which forces parallel plans. What
happens is the parallel worker calls pg_stat_statements() (marked
PARALLEL SAFE), tries to flush pending stats, but the leader is the one
that actually accumulated those stats.
I fixed this by:
1. Setting max_parallel_workers_per_gather = 0 in
pg_stat_statements.conf, and only enabling it during
parallel.sql when we actually want to track a parallel query.
2. Bumping pg_s_s to version 1.14 and marking pg_stat_statements()
and pg_stat_statements_reset() as PARALLEL RESTRICTED to ensure
these functions only execute in the leader, which is the process
that accumulates the pending stats. These could also be marked UNSAFE,
but RESTRICTED seems better since it doesn't completely prevent
parallel plans if these functions are used with other tables; although
it's hard to imagine a real-world case where this would matter.
This also means old versions would have this issue with
debug_parallel_query, but I don't think we should change
function definitions for older versions, in case a user
downgrades pg_s_s versions. Maybe others have a
different opinion?
[1]: https://cirrus-ci.com/task/5948422076760064
--
Sami
Attachments:
v2-0004-pg_stat_statements-extend-pg_stat_statements_info.patchapplication/octet-stream; name=v2-0004-pg_stat_statements-extend-pg_stat_statements_info.patchDownload+75-2
v2-0001-pgstat-Introduce-pg_stat_report_anytime-for-mid-t.patchapplication/octet-stream; name=v2-0001-pgstat-Introduce-pg_stat_report_anytime-for-mid-t.patchDownload+431-36
v2-0003-pg_stat_statements-add-DSA-based-query-text-stora.patchapplication/octet-stream; name=v2-0003-pg_stat_statements-add-DSA-based-query-text-stora.patchDownload+278-41
v2-0002-pg_stat_statements-modernize-entry-storage-with-p.patchapplication/octet-stream; name=v2-0002-pg_stat_statements-modernize-entry-storage-with-p.patchDownload+1011-817
On Mon, May 11, 2026 at 4:54 PM Sami Imseih <samimseih@gmail.com> wrote:
pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.
Agreed. Let's fix this for Postgres 20.
Thank you for putting together initial patches!
The three scaling problems:
1. Per-entry SpinLock contention. Every counter update acquires
a SpinLock on the entry. Hot queries executed across many
backends contend on the same lock, and the hold time grows as
we add more counters to the struct. Higher core counts mean
more CPUs contending for the same spinlock, making the problem
worse on modern hardware. See this complaint here [3].2. Deallocation under exclusive LWLock. The hash table is
fixed-size, bounded by pg_stat_statements.max. When full, a
single backend performs least-frequently-used eviction of
the bottom 5% while holding an exclusive LWLock, blocking all
other backends from reading or writing pg_stat_statements.
This happens inline during query execution, and workloads with
many unique queries trigger it frequently. Making matters
worse, pg_stat_statements.max is a static GUC, requiring a
full server restart to change, and that is disruptive in
production systems.3. Query text file bloat and GC. Deallocated entries leave dead
text in the external file. When the file grows to 2x the size
of live text, GC rewrites the entire file under exclusive lock.
Disk I/O under exclusive lock is the worst combination, and
frequent deallocations (problem #2) trigger more GC.
Yep, agreed on these problems.
Using the pgstat, "statistics collector", system to improve
pg_stat_statements scalability has been discussed previously
[4]. It writes stats locally first and flushes periodically
to shared memory, avoiding spinlock contention on the write
path. Also, the storage is a dshash (partitioned hash table),
which reduces contention on lookups and allows dynamic resizing
without restart. The prerequisite work to make this usable by
extensions has been building over two release cycles:- PG18: Pluggable cumulative statistics API
(pgstat_register_kind) [7949d959]- PG19: Serialization callbacks (to_serialized_data,
from_serialized_data, finish) for custom stats kinds
[4ba012a8ed9]This patch builds on the prerequisite work to address the
scalability issues mentioned above.The series consists of two patches:
---
[0001] pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid)
Adds infrastructure for flushing pending statistics to shared
memory on demand. pgstat_flush_pending() flushes all pending
entries in the calling backend immediately. Unlike
pgstat_report_stat(), it can be called mid-transaction, making
it suitable for view functions that need fresh shared stats
before the transaction ends.pg_stat_flush_pending(pid) is the SQL-callable interface to
the same function.This patch is related to the discussion in [1] about flushing
stats within running transactions. The pg_stat_statements
modernization provides a good example where this is useful.
At least for the pg_stat_statements tests, we need a way to
flush statistics within a transaction. I plan to spin this off
to a dedicated thread, but include it here for now so this
patchset can be tested.
I noticed the following in the 0001 patch (to be clear, looking at the
v2 version you posted later), which I wasn't sure about:
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c index b2ca28f83ba..848687a9f7e 100644 --- a/src/backend/utils/activity/pgstat_relation.c +++ b/src/backend/utils/activity/pgstat_relation.c @@ -828,64 +828,76 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)... if (t > tabentry->lastscan) tabentry->lastscan = t; } tabentry->tuples_returned += lstats->counts.tuples_returned; tabentry->tuples_fetched += lstats->counts.tuples_fetched; - tabentry->tuples_inserted += lstats->counts.tuples_inserted; - tabentry->tuples_updated += lstats->counts.tuples_updated; - tabentry->tuples_deleted += lstats->counts.tuples_deleted; tabentry->tuples_hot_updated += lstats->counts.tuples_hot_updated; tabentry->tuples_newpage_updated += lstats->counts.tuples_newpage_updated; + tabentry->blocks_fetched += lstats->counts.blocks_fetched; + tabentry->blocks_hit += lstats->counts.blocks_hit;
If I follow correctly, you're moving tuples_(inserted|updated|deleted)
here to special case them so they respect transaction rollbacks, but
you don't do that for tuples_hot_updated and tuples_newpage_updated.
Shouldn't those also be ignored if a transaction rolls back?
[0002] pg_stat_statements: modernize entry storage with
pgstat kindThis is the main patch. It replaces the private shared-memory
hash table with the pgstat subsystem's dshash (registered as a
custom stats kind).
I've taken a first pass at this, and noticed an issue (again, in the
v2 patch for clarity):
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c
b/contrib/pg_stat_statements/pg_stat_statements.c
index 92315627916..0e6e65e3e51 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c ... @@ -1254,9 +1066,62 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString, } }... +/* + * Store query text into a shared entry via the external text file. + * + * Caller must hold the entry lock. Does nothing if text is already present. + */ +static void +pgss_store_query_text(PgStatShared_Pgss *shared_entry, + const char *query, int query_len, int encoding) +{ + Size query_offset; + int gc_count; + + if (shared_entry->query_len >= 0) + return; + + LWLockAcquire(&pgss->lock.lock, LW_SHARED); + if (qtext_store(query, query_len, &query_offset, &gc_count)) + { + shared_entry->query_offset = query_offset; + shared_entry->query_len = query_len; + shared_entry->encoding = encoding; + } + LWLockRelease(&pgss->lock.lock); +} +...
/*
@@ -1313,192 +1180,171 @@ pgss_store(const char *query, int64 queryId,
key.queryid = queryId;
key.toplevel = (nesting_level == 0);- /* Lookup the hash table entry with shared lock. */ - LWLockAcquire(&pgss->lock.lock, LW_SHARED); - - entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL); - - /* Create new entry, if not present */ - if (!entry) + /* + * If jstate is set, create the shared entry and store normalized query + * text. Don't increment counters; entries with zero calls are not + * returned by pg_stat_statements_internal(). + */ + if (jstate) { - Size query_offset; - int gc_count; - bool stored; - bool do_gc; + const char *store_query;- /*
- * Create a new, normalized query string if caller asked. We don't
- * need to hold the lock while doing this work. (Note: in any case,
- * it's possible that someone else creates a duplicate hashtable entry
- * in the interval where we don't hold the lock below. That case is
- * handled by entry_alloc.)
- */
- if (jstate)
...- /* Append new query text to file with only shared lock held */ - stored = qtext_store(norm_query ? norm_query : query, query_len, - &query_offset, &gc_count); - - /* - * Determine whether we need to garbage collect external query texts - * while the shared lock is still held. This micro-optimization - * avoids taking the time to decide this while holding exclusive lock. - */ - do_gc = need_gc_qtexts(); - - /* Need exclusive lock to make a new hashtable entry - promote */ - LWLockRelease(&pgss->lock.lock); - LWLockAcquire(&pgss->lock.lock, LW_EXCLUSIVE); + norm_query = generate_normalized_query(jstate, query, + query_location, + &query_len); + store_query = norm_query ? norm_query : query;- /* - * A garbage collection may have occurred while we weren't holding the - * lock. In the unlikely event that this happens, the query text we - * stored above will have been garbage collected, so write it again. - * This should be infrequent enough that doing it while holding - * exclusive lock isn't a performance problem. - */ - if (!stored || pgss->gc_count != gc_count) - stored = qtext_store(norm_query ? norm_query : query, query_len, - &query_offset, NULL); + entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_PGSS, + key.dbid, + pgss_objid(&key), + true); + if (!entry_ref) + { + if (norm_query) + pfree(norm_query); + return; + }- /* If we failed to write to the text file, give up */ - if (!stored) - goto done; + shared_entry = (PgStatShared_Pgss *) entry_ref->shared_stats; + pgss_entry_init(shared_entry, &key, encoding); + pgss_store_query_text(shared_entry, store_query, query_len, encoding);
It appears you've moved the equivalent of the "if (!entry)" check into
the pgss_store_query_text function, and we now unconditionally call
generate_normalized_query.
Is there a reason we couldn't just defer that work until we've
confirmed we actually need to save the query text?
The pgstats hash key has an objid:
typedef struct PgStat_HashKey
{
PgStat_Kind kind; /* statistics entry kind */
Oid dboid; /* database ID. InvalidOid for shared
objects. */
uint64 objid; /* object ID (table, function, etc.), or
* identifier. */
} PgStat_HashKey;So the entry objid is computed by combining hashes from userid, queryid and
toplevel.
Makes sense.
Note that because of 0001 there are no changes required to the
existing regression tests. We may need to add some new tests,
but I have not thought too much about that yet.Key design changes:
- No SpinLock on the execution path. Counters accumulate
per-backend and merge into shared memory on flush. Stddev
and related fields use Welford's algorithm as before, but since the stats
are first updated locally, we need
a way to merge the stats on flush, so we use Chan's parallel
algorithm for this purpose [5].- No fixed shared memory allocation. Entries live in the pgstat
dshash, which grows dynamically. pg_stat_statements.max
becomes PGC_SIGHUP, and therefore can be changed dynamically.
Big +1 on these first two design changes - its clear that we get a
benefit here, and its nice to simplify/streamline the code as well.
- Throttled inline eviction. When entry count reaches
pgss_max, a backend attempts eviction using a conditional
lock and a shared timestamp that ensures at most one eviction
cycle per 10 seconds. Other backends simply skip entry
creation without blocking. This is acceptable because the
current upstream behavior already suffers from the same data
loss under heavy churn. Newly created entries are immediately
candidates for eviction and are frequently removed in the
next deallocation cycle.
I feel this is problematic as presented (looking at pgss_maybe_evict
in v2), because even on a somewhat steady workload you have a high
risk of losing the 5001st entry (if max is 5000).
What if we instead trigger deallocation early, e.g. at 90% of entries
being full? I think this could also go well with a background worker
doing the work.
The throttled approach makes this
trade-off explicit and avoids the cost of blocking all backends
behind an exclusive lock to create space for entries that may
just be removed shortly. See benchmark results below for
measuring the retention of "hot" entries.The aging decay mechanism remains in place, but the sticky
entries are no longer needed for this patch, and we simply
don't evict calls == 0.A background worker for eviction was also considered. The
current inline approach was chosen because it avoids the
added complexity while still preventing other backends from
blocking. If a background worker has more merit, I am open
to that discussion.
I think a background worker might simplify things around deallocation
and query text GC handling - we have precedent that contrib extensions
utilize background workers, e.g. pg_prewarm, pg_stash_advice, etc.
- Query texts in DSA memory with disk fallback. A new GUC
pg_stat_statements.query_text_memory (default 64 MB) controls
DSA shared memory for query text storage. When enabled and
not exhausted, new texts are stored in DSA instead of the
external file, eliminating file I/O on the read path. If DSA
is disabled (set to 0) or full, texts fall back to the
existing file-based storage. Entry eviction and reset properly
free DSA-allocated texts, and GC of the text file skips
DSA-backed entries.
I feel like this design introduces a performance cliff at the
in-memory => disk boundary that will be surprising and hard to
understand. We had previously discussed doing an in-memory only
design, but I could see that being harder to get agreement on in
practice with the large query text files many of us have seen on
customer systems.
What if we instead designed this as an in-memory buffering mechanism?
i.e. new query texts are first written to the in-memory buffer, and
then flushed to disk. The in-memory buffer is kept as DSA like you
have it, so many concurrent sessions suddenly running a new query only
have one of them actually saving it to the buffer. We could then have
a background writer write that out to disk in batches, to make room
for new entries in the in-memory buffer.
If we wanted to get fancy we could even introduce compression when we
write it out to disk, since that work would no longer have to be
happening inline with the query execution - it could instead be
happening in the background worker.
Open questions:
1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the
custom kind ID. For commit, we'd want a proper kind number.
Makes sense to me - I assume the question is when we assign the real ID?
I think we could separately consider whether pg_stat_statements should
live in core (maybe?), but that seems like a different patch set to
me.
2. The 10-second eviction throttle is a compile-time constant
(EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed
interval sufficient?
I suspect its probably slightly too frequent, but its a bit hard to
say for sure. Personally, I don't think we need a GUC for it
necessarily, at least not from how I currently look at it.
3. The current design skips entry creation when at capacity and
eviction is throttled. An alternative would be to allow
temporary overshoot (soft limit) and never reject entries.
Thoughts on the trade-off? However, for heavy churn and
many unique entries we can easily overshoot the max by
magnitudes higher which I don't think is a good idea.
Per earlier note, I think we should change this so that we trigger
eviction early before actually reaching the max.
It seems reasonable to still drop entries when we actually exceed the
max, instead of blocking on the deallocation finishing (as we do
today).
4. The default size of query text memory. The patch has it at 64MB.
I think we can work on other parts of the design first, but I wonder
if we should auto-scale that based on shared_buffers (and allow
overriding) - 64MB default seems high for very small instances.
Thanks,
Lukas
--
Lukas Fittl
On Thu, May 21, 2026 at 9:38 PM Lukas Fittl <lukas@fittl.com> wrote:
On Mon, May 11, 2026 at 4:54 PM Sami Imseih <samimseih@gmail.com> wrote:
pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.Agreed. Let's fix this for Postgres 20.
Thank you for putting together initial patches!
For archive's sake and others reading along, we had a productive
discussion about this today at PGConf.Dev.
I've created a new wiki page combining the prior 2025 discussion, and
notes from today:
https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements
Thanks,
Lukas
--
Lukas Fittl
Hi,
On Fri, May 22, 2026 at 06:15:00PM -0700, Lukas Fittl wrote:
For archive's sake and others reading along, we had a productive
discussion about this today at PGConf.Dev.I've created a new wiki page combining the prior 2025 discussion, and
notes from today:https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements
Thanks a lot Lukas!
Just a small clarification, I wasn't asking for a way to remember the last time
a query was executed (although I think it's a very good thing to have), but
remembering the time each query text was saved.
In a reasonable system (that is a system where the number of entries doesn't
grow much more than pg_stat_statements.max, ie, you can actually use the
current version pg_stat_statements) you will likely get a lot of entries even
if you filter out the entries that didn't get executed in the last X minutes.
However, the vast majority of them should not be new queries. So if you
maintain an external system that snapshot pg_stat_statements once in a while
you only care about the (possibly) missing query texts, which should be a very
small fraction of all the records you otherwise need.
pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.Agreed. Let's fix this for Postgres 20.
Thank you for putting together initial patches!
For archive's sake and others reading along, we had a productive
discussion about this today at PGConf.Dev.I've created a new wiki page combining the prior 2025 discussion, and
notes from today:https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements
Thank you, Lukas!
--
Sami
I've created a new wiki page combining the prior 2025 discussion, and
notes from today:https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements
Thank you, Lukas!
Thanks for the feedback on v2, and for the productive unconference
discussion in PGConf.dev. I've been iterating on the patches/benchmarks
and wanted to post an updated series that addresses the major design
points raised.
Addressing the feedback from the unconference session:
1/ Eviction
The concern in the design presented in which entries are skipped until
a single threaded eviction completed was that on a steady workload with
max=5000, the 5001st unique query would always be immediately evicted.
This was considered unacceptable, and rightly so. It may be fine for
high churn, pathological cases, but for slow churn cases it would be
unacceptable.
Andres mentioned that eviction should occur in parallel, which I
understood as multiple backends should evict a subset of the entries in
the dshash concurrently. To do this, I implemented a parallel
clock-sweep in which eviction sweeps a partition decrementing
refcounts, and only entries that have decayed to zero are evicted. A
genuinely new query that just arrived has a fresh refcount and survives
the sweep, and if it becomes popular in the meantime, it will survive
for longer. So the algorithm wants an entry to prove itself to remain
considered "hot", else it's swept.
There's a 5% headroom using the already existing
USAGE_DEALLOC_PERCENT in which we evict until count (across all
partitions) drops to 95% of max, so there's always room for new
arrivals before the next sweep.
I chose not to use a background worker for eviction, as discussed
earlier, based on the consensus that backpressure is important, and a
background worker being asynchronous in nature will not provide that.
Also, it could be complicated where a background worker cannot be spun
up for whatever reason.
However, to implement parallel eviction as described above, I needed 2
core changes:
- pgstat_drop_entry must be able to optionally tolerate a dropped
entry, as there could be a delay between the time an entry is marked
as dropped and garbage collection, and within that time multiple
evictions may attempt to drop the same entry.
- Implement a seq scan API for dshash that scans a specific partition.
This will allow eviction to cycle through partitions, i.e.
clock-sweep.
The eviction does not attempt to make room in the bucket for an entry
that triggered it. The main point is to keep forward progress in
making room. We don't need to be more strict here. Also the fact that
pg_stat_statements.max is dynamic means a user can increase this value
to manage high churn without a restart.
The dealloc counter in pg_stat_statements_info now counts individual
entries evicted rather than the number of times eviction was invoked.
I think this is more useful, but it does change the semantics between
versions. With the new design, a single eviction pass can remove many
entries across a partition, so counting evictions no longer tells
you much. Open to other thoughts here on this.
Profiling revealed that pgstat_request_entry_refs_gc(), which is the
standard pattern for less frequently removed entries in other places,
was too expensive when called on every entry drop or every partition
sweep. Now it is only called once per full rotation across all
partitions, which showed much better results in benchmarks.
2/ DSA only query text storage
Rather than a "performance cliff" that Lukas mentioned above when we
switch between DSA and disk for query text storage, the consensus is
to just store all query text in memory. Andres made a point that even
now, transient memory usage for loading the query text for the purpose
of garbage-collection or reading the query stats in
pg_stat_statements means that a user's machine must have enough memory
to handle this. So, why not just throw all query text in memory. The
memory is capped by a new GUC pg_stat_statements.query_text_memory
(default of 4MB but up for discussion).
It is also possible to store more entries due to .max than
.query_text_memory can support, so empty query text columns could be
possible. In the case a user increases .query_text_memory if they
observe empty query columns, the next time an entry is touched, it
will backfill the query text and normalize the string if it can. The
last part could be improved and actually guaranteed if we make
JumbleState available to all hooks (I did not work on this part, but
open for discussion).
Some other comments from Lukas's earlier review:
It appears you've moved the equivalent of the "if (!entry)" check
into the pgss_store_query_text function, and we now unconditionally
call generate_normalized_query.
Fixed.
v3 series is now 5 patches:
0001: pgstat: Introduce pg_stat_report_anytime()
Nothing changes from v2.
0002: pgstat: tolerate already-dropped entries in pgstat_drop_entry()
Required for the parallel eviction design. With multiple backends
sweeping different partitions concurrently, the same entry can be
targeted for drop more than once before garbage collection runs.
A skip_dropped flag makes this safe rather than throwing ERROR.
0003: dshash: add partition-scoped sequential scan
Adds dshash_seq_init_partition() to restrict a scan to a single
partition. This is the building block for per-partition clock-sweep;
a backend only locks and sweeps one partition at a time.
0004: pg_stat_statements: modernize entry storage with pgstat kind
The main patch. Replaces ShmemInitHash with dshash via DSM registry,
and replaces per-entry spinlock counter updates with a custom pgstat
kind that uses the core pgstat infrastructure.
Eviction changes from qsort-all-entries to clock-sweep with an atomic
rotating hand. Each entry carries a refcount (capped at 10) that
decays on sweep; entries reaching zero are evicted. Hot queries keep
their refcount topped up proportionally to access frequency.
pg_stat_statements.max becomes PGC_SIGHUP.
0005: pg_stat_statements: store query text in DSA instead of file
Moves query text from pgss_query_texts.stat into a DSA area via
GetNamedDSA. Adds pg_stat_statements.query_text_memory (PGC_SIGHUP,
default 4MB) controlling DSA size. Eliminates the GC machinery
entirely. When DSA is exhausted, entries are still tracked but query
text is stored as NULL. A backfill mechanism recovers text on
subsequent executions once space becomes available.
Benchmark:
Attached are the benchmark scripts I used for v3 (and will keep
using going forward) with the results in the benchmark_v3.txt file comparing
patch vs upstream.
The benchmark performs various workloads: "high churn", "light churn",
"multi stmt", and simple "select1". I am also tracking query retention
(hot/cold entry retention) to verify the clock-sweep behaves as
expected. I attached the .sql scripts used to benchmark.
The select1 result shows a 1.0% regression. This workload has no
contention, so it purely measures pgstat infrastructure overhead;
perf profiling shows pgstat_get_entry_ref() at the top. However, on
machines with higher core count the upstream spinlock on the counters
becomes a bottleneck, which is where the dshash design should win
back this overhead and maybe more. I still plan to benchmark this on
a larger machine.
You will notice that cold_calls in the patched churn case are much
lower (805 vs 4,458). This is because entries get evicted sooner
under per-partition sweep. Hot and cold query retention lines up w
ith current upstream (1000/1000 hot entries survive continuous churn).
The deallocs count is much higher in the patch (11.9M vs 38.5K in
high churn). per-partition sweep fires frequently to keep the table
at target capacity, whereas upstream batches fewer, larger deallocations.
We can maybe look into reducing USAGE_DEALLOC_PERCENT to
increase retention of "colder" entries.
We also see much less LWLock contention in the patched churn case.
The top wait is PgStatsDSA (502 total) vs pg_stat_statements (7,757)
in upstream, a 15x reduction.
Looking forward to your feedback!
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
benchmark_v3.txttext/plain; charset=US-ASCII; name=benchmark_v3.txtDownload
v3-0001-pgstat-Introduce-pg_stat_report_anytime-for-mid-t.patchapplication/x-patch; name=v3-0001-pgstat-Introduce-pg_stat_report_anytime-for-mid-t.patchDownload+431-36
v3-0002-pgstat-tolerate-already-dropped-entries-in-pgstat.patchapplication/x-patch; name=v3-0002-pgstat-tolerate-already-dropped-entries-in-pgstat.patchDownload+19-12
v3-0003-dshash-add-partition-scoped-sequential-scan.patchapplication/x-patch; name=v3-0003-dshash-add-partition-scoped-sequential-scan.patchDownload+45-14
v3-0005-pg_stat_statements-store-query-text-in-DSA-instea.patchapplication/x-patch; name=v3-0005-pg_stat_statements-store-query-text-in-DSA-instea.patchDownload+285-556
v3-0004-pg_stat_statements-modernize-entry-storage-with-p.patchapplication/x-patch; name=v3-0004-pg_stat_statements-modernize-entry-storage-with-p.patchDownload+1229-1329
Hello!
I noticed that the patch has some issues with minmax_only, and these differences seem to be unintended to me:
1. minmax only resets updates mean/sum_var_time (but then when it calculates them again, it also uses the data before the minmax reset)
+ /*
+ * Reset only min/max timing values and update minmax_stats_since.
+ * Iterate matching entries in the dshash, reset the corresponding
+ * pgstat counters' min/max fields.
+ */
...
+ shared->counters.mean_time[kind] = 0;
+ shared->counters.sum_var_time[kind] = 0;
2. min_time doesn't recover after a minmax reset:
+ if (pending->min_time[kind] < shared->min_time[kind])
+ shared->min_time[kind] = pending->min_time[kind];
As it is only updated if it's smaller, but it is 0 after it, and because the call count remains > 0 it doesn't get initialized with the current value.
+ if ((int) pg_atomic_read_u64(&pgss_shared->nentries) <= pgss_max * (100 - USAGE_DEALLOC_PERCENT) / 100)
+ break;
isn't there a possible overflow here? The similar deleted check had explicit uint64 casts for pgss_max multiplication.
- /*
- * Don't proceed if file does not exceed 512 bytes per possible entry.
- */
- if ((uint64) extent < (uint64) 512 * pgss_max)
- return false;