pg_buffercache: Add per-relation summary stats

Started by Lukas Fittl3 months ago26 messageshackers

lukas@fittl.com

3 months ago

Hi,

See attached a patch that implements a new function,
pg_buffercache_relation_stats(), which returns per-relfilenode
statistics on the number of buffers, how many are dirtied/pinned, and
their avg usage count.

This can be used in monitoring scripts to know which relations are
kept in shared buffers, to understand performance issues better that
occur due to relations getting evicted from the cache. In our own
monitoring tool (pganalyze) we've offered a functionality like this
based on the existing pg_buffercache() function for a bit over a year
now [0]https://pganalyze.com/blog/tracking-postgres-buffer-cache-statistics, and people have found this very valuable - but it doesn't
work for larger database servers.

Specifically, performing a query that gets this information can be
prohibitively expensive when using large shared_buffers, and even on
the default 128MB shared buffers there is a measurable difference:

postgres=# WITH pg_buffercache_relation_stats AS (
SELECT relfilenode, reltablespace, reldatabase, relforknumber,
COUNT(*) AS buffers,
COUNT(*) FILTER (WHERE isdirty) AS buffers_dirty,
COUNT(*) FILTER (WHERE pinning_backends > 0) AS buffers_pinned,
AVG(usagecount) AS usagecount_avg
FROM pg_buffercache
WHERE reldatabase IS NOT NULL
GROUP BY 1, 2, 3, 4

)
SELECT * FROM pg_buffercache_relation_stats WHERE relfilenode = 2659;

relfilenode | reltablespace | reldatabase | relforknumber | buffers |
buffers_dirty | buffers_pinned | usagecount_avg
-------------+---------------+-------------+---------------+---------+---------------+----------------+--------------------
2659 | 1663 | 5 | 0 | 8 |
0 | 0 | 5.0000000000000000
2659 | 1663 | 1 | 0 | 7 |
0 | 0 | 5.0000000000000000
2659 | 1663 | 229553 | 0 | 7 |
0 | 0 | 5.0000000000000000
(3 rows)

Time: 20.991 ms

postgres=# SELECT * FROM pg_buffercache_relation_stats() WHERE
relfilenode = 2659;
relfilenode | reltablespace | reldatabase | relforknumber | buffers |
buffers_dirty | buffers_pinned | usagecount_avg
-------------+---------------+-------------+---------------+---------+---------------+----------------+----------------
2659 | 1663 | 1 | 0 | 7 |
0 | 0 | 5
2659 | 1663 | 229553 | 0 | 7 |
0 | 0 | 5
2659 | 1663 | 5 | 0 | 8 |
0 | 0 | 5
(3 rows)

Time: 2.912 ms

With the new function this gets done before putting the data in the
tuplestore used for the set-returning function.

Thanks,
Lukas

[0]: https://pganalyze.com/blog/tracking-postgres-buffer-cache-statistics

--
Lukas Fittl

Jakub Wartak

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Lukas Fittl (#1)

Re: pg_buffercache: Add per-relation summary stats

On Sun, Mar 1, 2026 at 12:59 AM Lukas Fittl <lukas@fittl.com> wrote:

Hi,

See attached a patch that implements a new function,
pg_buffercache_relation_stats(), which returns per-relfilenode
statistics on the number of buffers, how many are dirtied/pinned, and
their avg usage count.

This can be used in monitoring scripts to know which relations are
kept in shared buffers, to understand performance issues better that
occur due to relations getting evicted from the cache. In our own
monitoring tool (pganalyze) we've offered a functionality like this
based on the existing pg_buffercache() function for a bit over a year
now [0], and people have found this very valuable - but it doesn't
work for larger database servers.

[..]

(3 rows)

Time: 20.991 ms

[..vs]

Time: 2.912 ms

Hi Lukas, I have glanced at the patch briefly and couldn't find any
issues - patch looks solid, however I'm not sure if e.g. launching whole
NBuffers scan let's say every 5mins doesn't cause latency spikes on the
system? I mean introducing such function seems to invite users to use
pg_buffercache and I'm wondering if such regular pattern doesn't cause
issues? (this is not FUD :), just more like a question based on Your's
obervation)

Also have you quantified what was the breaking point of previous query?
(You wrote "larger database servers", but was that like 128GB+ shared_buffers?
and if so what would be the difference in terms of runtime there -- also
like ~7x?)

-J.

Lukas Fittl

lukas@fittl.com

3 months ago

In reply to: Jakub Wartak (#2)

Re: pg_buffercache: Add per-relation summary stats

On Mon, Mar 2, 2026 at 2:16 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

Hi Lukas, I have glanced at the patch briefly and couldn't find any
issues - patch looks solid, however I'm not sure if e.g. launching whole
NBuffers scan let's say every 5mins doesn't cause latency spikes on the
system? I mean introducing such function seems to invite users to use
pg_buffercache and I'm wondering if such regular pattern doesn't cause
issues? (this is not FUD :), just more like a question based on Your's
obervation)

Thanks for taking a look!

I think its the definitely the kind of thing you'd want to have people
opt-in to (we currently only do it when someone enables pg_buffercache
on their own, and allow turning it off completely), but on systems
that experience performance issues due to what's in cache (and have
some CPU capacity to spare), it seems to be helpful more than harmful,
I think. To be clear on the sample size, this is a subset of our user base
where we have experience with this, probably across 50-100 installs,
roughly speaking, of small to medium sized production systems (size as
in shared_buffers).

Also, FWIW, this isn't going not helpful if your cache contents change
completely once a minute, but in practice I think its more the
unexpected effects of e.g. large data loads or background processes
that mess with the cache, where tracking this over time can help you
find the root cause of slowness - we currently run this on a 10 minute
schedule when enabled, and that seems to work in terms of
understanding large swings in cache contents. I think even if you ran
this once an hour it could be helpful, and with the patch would give
you the data you're interested in ("whats in the cache") without
causing a large temporary file to be created.

Also have you quantified what was the breaking point of previous query?
(You wrote "larger database servers", but was that like 128GB+ shared_buffers?
and if so what would be the difference in terms of runtime there -- also
like ~7x?)

We've currently set the default limit of where we measure this with
our tool at 200GB, but that's mainly because the temporary file that
gets written out with pg_buffercache today to do the grouping just
becomes noticeably large.

I'll work on sharing more numbers in the following days for larger
servers, to show the benefit of the patch.

Thanks,
Lukas

--
Lukas Fittl

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

3 months ago

In reply to: Lukas Fittl (#1)

Re: pg_buffercache: Add per-relation summary stats

Hi,

On Sat, Feb 28, 2026 at 03:58:34PM -0800, Lukas Fittl wrote:

Hi,

See attached a patch that implements a new function,
pg_buffercache_relation_stats(), which returns per-relfilenode
statistics on the number of buffers, how many are dirtied/pinned, and
their avg usage count.

This can be used in monitoring scripts to know which relations are
kept in shared buffers, to understand performance issues better that
occur due to relations getting evicted from the cache. In our own
monitoring tool (pganalyze) we've offered a functionality like this
based on the existing pg_buffercache() function for a bit over a year
now [0], and people have found this very valuable - but it doesn't
work for larger database servers.

Specifically, performing a query that gets this information can be
prohibitively expensive when using large shared_buffers, and even on
the default 128MB shared buffers there is a measurable difference:

Thanks for the patch!

A few comments:

=== 1

+typedef struct
+{
+   RelFileNumber relfilenumber;
+   Oid         reltablespace;
+   Oid         reldatabase;
+   ForkNumber  forknum;
+} BufferRelStatsKey;

What about making use of RelFileLocator (instead of 3 members relfilenumber,
reltablespace and reldatabase)?

=== 2

+  <para>
+   The <function>pg_buffercache_relation_stats()</function> function returns a
+   set of rows summarizing the state of all shared buffers, aggregated by
+   relation and fork number.  Similar and more detailed information is
+   provided by the <structname>pg_buffercache</structname> view, but
+   <function>pg_buffercache_relation_stats()</function> is significantly
+   cheaper.
+  </para>

I'm not 100% sure about the name of the function since the stats are "reset"
after a rewrite. What about pg_buffercache_relfilenode or
pg_buffercache_aggregated?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Haibo Yan

tristan.yim@gmail.com

2 months ago

In reply to: Lukas Fittl (#1)

Re: pg_buffercache: Add per-relation summary stats

Hi Lukas,

I have read the patch, and I have a few questions/comments while going through it:

Could this use RelFileLocator plus ForkNumber instead of open-coding BufferRelStatsKey? That seems closer to existing PostgreSQL abstractions for physical relation identity.

I wonder whether pg_buffercache_relation_stats() is the best name here. The function is really aggregating by relation file identity plus fork, and it is producing a summary of the current buffer contents rather than what many readers might assume from “relation stats”. Would something with summary be clearer than stats?

Why are OUT relforknumber and OUT relfilenode exposed as int2 and oid respectively? Internally these are represented as ForkNumber and RelFileNumber, so I wonder whether the SQL interface should reflect that more clearly, or at least whether the current choice should be explained.

The comment says, “Hash key for pg_buffercache_relation_stats — groups by relation identity”, but that seems imprecise. It is really grouping by relfilenode plus fork, i.e. physical relation-file identity rather than relation identity in a more logical sense.

Is PARALLEL SAFE actually desirable here, as opposed to merely technically safe? A parallel query could cause multiple workers to perform full shared-buffer scans independently, which does not seem obviously desirable for this kind of diagnostic function.

Best regards,

Haibo Yan

Show quoted text

On Feb 28, 2026, at 3:58 PM, Lukas Fittl <lukas@fittl.com> wrote:

Hi,

See attached a patch that implements a new function,
pg_buffercache_relation_stats(), which returns per-relfilenode
statistics on the number of buffers, how many are dirtied/pinned, and
their avg usage count.

This can be used in monitoring scripts to know which relations are
kept in shared buffers, to understand performance issues better that
occur due to relations getting evicted from the cache. In our own
monitoring tool (pganalyze) we've offered a functionality like this
based on the existing pg_buffercache() function for a bit over a year
now [0], and people have found this very valuable - but it doesn't
work for larger database servers.

Specifically, performing a query that gets this information can be
prohibitively expensive when using large shared_buffers, and even on
the default 128MB shared buffers there is a measurable difference:

postgres=# WITH pg_buffercache_relation_stats AS (
SELECT relfilenode, reltablespace, reldatabase, relforknumber,
COUNT(*) AS buffers,
COUNT(*) FILTER (WHERE isdirty) AS buffers_dirty,
COUNT(*) FILTER (WHERE pinning_backends > 0) AS buffers_pinned,
AVG(usagecount) AS usagecount_avg
FROM pg_buffercache
WHERE reldatabase IS NOT NULL
GROUP BY 1, 2, 3, 4

)
SELECT * FROM pg_buffercache_relation_stats WHERE relfilenode = 2659;

relfilenode | reltablespace | reldatabase | relforknumber | buffers |
buffers_dirty | buffers_pinned | usagecount_avg
-------------+---------------+-------------+---------------+---------+---------------+----------------+--------------------
2659 | 1663 | 5 | 0 | 8 |
0 | 0 | 5.0000000000000000
2659 | 1663 | 1 | 0 | 7 |
0 | 0 | 5.0000000000000000
2659 | 1663 | 229553 | 0 | 7 |
0 | 0 | 5.0000000000000000
(3 rows)

Time: 20.991 ms

postgres=# SELECT * FROM pg_buffercache_relation_stats() WHERE
relfilenode = 2659;
relfilenode | reltablespace | reldatabase | relforknumber | buffers |
buffers_dirty | buffers_pinned | usagecount_avg
-------------+---------------+-------------+---------------+---------+---------------+----------------+----------------
2659 | 1663 | 1 | 0 | 7 |
0 | 0 | 5
2659 | 1663 | 229553 | 0 | 7 |
0 | 0 | 5
2659 | 1663 | 5 | 0 | 8 |
0 | 0 | 5
(3 rows)

Time: 2.912 ms

With the new function this gets done before putting the data in the
tuplestore used for the set-returning function.

Thanks,
Lukas

[0]: https://pganalyze.com/blog/tracking-postgres-buffer-cache-statistics

--
Lukas Fittl
<v1-0001-pg_buffercache-Add-pg_buffercache_relation_stats-.patch>

Masahiko Sawada

sawada.mshk@gmail.com

2 months ago

In reply to: Lukas Fittl (#1)

Re: pg_buffercache: Add per-relation summary stats

Hi Lukas,

On Sat, Feb 28, 2026 at 3:59 PM Lukas Fittl <lukas@fittl.com> wrote:

Hi,

See attached a patch that implements a new function,
pg_buffercache_relation_stats(), which returns per-relfilenode
statistics on the number of buffers, how many are dirtied/pinned, and
their avg usage count.

Thank you for the proposal!

Paul A Jungwirth, Khoa Nguyen, and I reviewed this patch through the
Patch Review Workshop, and I'd like to share our comments.

This can be used in monitoring scripts to know which relations are
kept in shared buffers, to understand performance issues better that
occur due to relations getting evicted from the cache. In our own
monitoring tool (pganalyze) we've offered a functionality like this
based on the existing pg_buffercache() function for a bit over a year
now [0], and people have found this very valuable - but it doesn't
work for larger database servers.

Specifically, performing a query that gets this information can be
prohibitively expensive when using large shared_buffers, and even on
the default 128MB shared buffers there is a measurable difference:

postgres=# WITH pg_buffercache_relation_stats AS (
SELECT relfilenode, reltablespace, reldatabase, relforknumber,
COUNT(*) AS buffers,
COUNT(*) FILTER (WHERE isdirty) AS buffers_dirty,
COUNT(*) FILTER (WHERE pinning_backends > 0) AS buffers_pinned,
AVG(usagecount) AS usagecount_avg
FROM pg_buffercache
WHERE reldatabase IS NOT NULL
GROUP BY 1, 2, 3, 4

)
SELECT * FROM pg_buffercache_relation_stats WHERE relfilenode = 2659;

relfilenode | reltablespace | reldatabase | relforknumber | buffers |
buffers_dirty | buffers_pinned | usagecount_avg
-------------+---------------+-------------+---------------+---------+---------------+----------------+--------------------
2659 | 1663 | 5 | 0 | 8 |
0 | 0 | 5.0000000000000000
2659 | 1663 | 1 | 0 | 7 |
0 | 0 | 5.0000000000000000
2659 | 1663 | 229553 | 0 | 7 |
0 | 0 | 5.0000000000000000
(3 rows)

Time: 20.991 ms

postgres=# SELECT * FROM pg_buffercache_relation_stats() WHERE
relfilenode = 2659;
relfilenode | reltablespace | reldatabase | relforknumber | buffers |
buffers_dirty | buffers_pinned | usagecount_avg
-------------+---------------+-------------+---------------+---------+---------------+----------------+----------------
2659 | 1663 | 1 | 0 | 7 |
0 | 0 | 5
2659 | 1663 | 229553 | 0 | 7 |
0 | 0 | 5
2659 | 1663 | 5 | 0 | 8 |
0 | 0 | 5
(3 rows)

Time: 2.912 ms

With the new function this gets done before putting the data in the
tuplestore used for the set-returning function.

Overall, we find that the proposed feature is useful. The proposed way
is much cheaper, especially when the number of per-relation stats is
not large.

Here are review comments on the v1 patch:

---
-   pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
+   pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql \
+   pg_buffercache--1.7--1.8.sql

Since commit 4b203d499c6 bumped the version from 1.6 to 1.7 last
November, we think we don't need to bump the version again for this new
feature.

---
+/*
+ * Hash key for pg_buffercache_relation_stats — groups by relation identity.
+ */
+typedef struct
+{
+   RelFileNumber relfilenumber;
+   Oid         reltablespace;
+   Oid         reldatabase;
+   ForkNumber  forknum;
+} BufferRelStatsKey;
+
+/*
+ * Hash entry for pg_buffercache_relation_stats — accumulates per-relation
+ * buffer statistics.
+ */
+typedef struct
+{
+   BufferRelStatsKey key;      /* must be first */
+   int32       buffers;
+   int32       buffers_dirty;
+   int32       buffers_pinned;
+   int64       usagecount_total;
+} BufferRelStatsEntry;

Can we move these typedefs above function prototypes as other typedefs
are defined there?

---
+   relstats_hash = hash_create("pg_buffercache relation stats",
+                               128,
+                               &hash_ctl,
+                               HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

It might be worth considering simplehash.h for even better performance.

---
+   while ((entry = (BufferRelStatsEntry *) hash_seq_search(&hash_seq)) != NULL)
+   {
+       if (entry->buffers == 0)
+           continue;
+

We might want to put CHECK_FOR_INTERRUPTS() here too as the number of
entries can be as many as NBuffers in principle.

---
We've discussed there might be room for improvement in the function
name. For example, pg_buffercache_relations instead of
pg_buffercache_relation_stats might be a good name, since everything
in this module
is stats. if we drop "_stats" then "relation" should be plural, to
match other functions in the module ("pages", "os_pages",
"numa_pages", "usage_counts").

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

2 months ago

In reply to: Masahiko Sawada (#6)

Re: pg_buffercache: Add per-relation summary stats

On Wed, Mar 25, 2026 at 12:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi Lukas,

On Sat, Feb 28, 2026 at 3:59 PM Lukas Fittl <lukas@fittl.com> wrote:

Hi,

See attached a patch that implements a new function,
pg_buffercache_relation_stats(), which returns per-relfilenode
statistics on the number of buffers, how many are dirtied/pinned, and
their avg usage count.

Thank you for the proposal!

Paul A Jungwirth, Khoa Nguyen, and I reviewed this patch through the
Patch Review Workshop, and I'd like to share our comments.

This can be used in monitoring scripts to know which relations are
kept in shared buffers, to understand performance issues better that
occur due to relations getting evicted from the cache. In our own
monitoring tool (pganalyze) we've offered a functionality like this
based on the existing pg_buffercache() function for a bit over a year
now [0], and people have found this very valuable - but it doesn't
work for larger database servers.

Specifically, performing a query that gets this information can be
prohibitively expensive when using large shared_buffers, and even on
the default 128MB shared buffers there is a measurable difference:

postgres=# WITH pg_buffercache_relation_stats AS (
SELECT relfilenode, reltablespace, reldatabase, relforknumber,
COUNT(*) AS buffers,
COUNT(*) FILTER (WHERE isdirty) AS buffers_dirty,
COUNT(*) FILTER (WHERE pinning_backends > 0) AS buffers_pinned,
AVG(usagecount) AS usagecount_avg
FROM pg_buffercache
WHERE reldatabase IS NOT NULL
GROUP BY 1, 2, 3, 4

)
SELECT * FROM pg_buffercache_relation_stats WHERE relfilenode = 2659;

relfilenode | reltablespace | reldatabase | relforknumber | buffers |
buffers_dirty | buffers_pinned | usagecount_avg
-------------+---------------+-------------+---------------+---------+---------------+----------------+--------------------
2659 | 1663 | 5 | 0 | 8 |
0 | 0 | 5.0000000000000000
2659 | 1663 | 1 | 0 | 7 |
0 | 0 | 5.0000000000000000
2659 | 1663 | 229553 | 0 | 7 |
0 | 0 | 5.0000000000000000
(3 rows)

Time: 20.991 ms

postgres=# SELECT * FROM pg_buffercache_relation_stats() WHERE
relfilenode = 2659;
relfilenode | reltablespace | reldatabase | relforknumber | buffers |
buffers_dirty | buffers_pinned | usagecount_avg
-------------+---------------+-------------+---------------+---------+---------------+----------------+----------------
2659 | 1663 | 1 | 0 | 7 |
0 | 0 | 5
2659 | 1663 | 229553 | 0 | 7 |
0 | 0 | 5
2659 | 1663 | 5 | 0 | 8 |
0 | 0 | 5
(3 rows)

Time: 2.912 ms

With the new function this gets done before putting the data in the
tuplestore used for the set-returning function.

Overall, we find that the proposed feature is useful. The proposed way
is much cheaper, especially when the number of per-relation stats is
not large.

Here are review comments on the v1 patch:
---
-   pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
+   pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql \
+   pg_buffercache--1.7--1.8.sql
Since commit 4b203d499c6 bumped the version from 1.6 to 1.7 last
November, we think we don't need to bump the version again for this new
feature.
---
+/*
+ * Hash key for pg_buffercache_relation_stats — groups by relation identity.
+ */
+typedef struct
+{
+   RelFileNumber relfilenumber;
+   Oid         reltablespace;
+   Oid         reldatabase;
+   ForkNumber  forknum;
+} BufferRelStatsKey;
+
+/*
+ * Hash entry for pg_buffercache_relation_stats — accumulates per-relation
+ * buffer statistics.
+ */
+typedef struct
+{
+   BufferRelStatsKey key;      /* must be first */
+   int32       buffers;
+   int32       buffers_dirty;
+   int32       buffers_pinned;
+   int64       usagecount_total;
+} BufferRelStatsEntry;
Can we move these typedefs above function prototypes as other typedefs
are defined there?
---
+   relstats_hash = hash_create("pg_buffercache relation stats",
+                               128,
+                               &hash_ctl,
+                               HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
It might be worth considering simplehash.h for even better performance.
---
+   while ((entry = (BufferRelStatsEntry *) hash_seq_search(&hash_seq)) != NULL)
+   {
+       if (entry->buffers == 0)
+           continue;
+
We might want to put CHECK_FOR_INTERRUPTS() here too as the number of
entries can be as many as NBuffers in principle.

---
We've discussed there might be room for improvement in the function
name. For example, pg_buffercache_relations instead of
pg_buffercache_relation_stats might be a good name, since everything
in this module
is stats. if we drop "_stats" then "relation" should be plural, to
match other functions in the module ("pages", "os_pages",
"numa_pages", "usage_counts").

I know we already have a couple of hand-aggregation functions but I am
hesitant to add more of these. Question is where do we stop? For
example, the current function is useless if someone wants to find the
parts of a relation which are hot since it doesn't include page
numbers. Do we write another function for the same? Or we add page
numbers to this function and then there's hardly any aggregation
happening. What if somebody wanted to perform an aggregation more
complex than just count() like average number of buffers per relation
or distribution of relation buffers in the cache, do they write
separate functions?

Another problem is the maintenance cost these functions bring. For
example, with the resizable shared buffer project we have another
function to stress test.

Looking at the function, I see it uses a hash table to aggregate the
data. To some extent it's duplicating the functionality we already
have - aggregates using hashing. Are we going to duplicate
functionality everywhere we require aggregation on top of a system
function? These functions will then be missing any optimizations we do
to hash aggregation in future. Can we instead investigate the reason
the aggregation on top of pg_buffercache output requires so much more
time than doing it in the function and fix that as much as we can? I
know some slowness will come from tuplestore APIs, tuple formation and
deformation but I won't expect it to be 10 times slower.

--
Best Wishes,
Ashutosh Bapat

Lukas Fittl

lukas@fittl.com

2 months ago

In reply to: Ashutosh Bapat (#7)

Re: pg_buffercache: Add per-relation summary stats

Hi Ashutosh,

On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I know we already have a couple of hand-aggregation functions but I am
hesitant to add more of these. Question is where do we stop? For
example, the current function is useless if someone wants to find the
parts of a relation which are hot since it doesn't include page
numbers. Do we write another function for the same? Or we add page
numbers to this function and then there's hardly any aggregation
happening. What if somebody wanted to perform an aggregation more
complex than just count() like average number of buffers per relation
or distribution of relation buffers in the cache, do they write
separate functions?

I think the problem this solves for, which is a very common question I
hear from end users, is "how much of this table/index is in cache" and
"was our query slow because the cache contents changed?".

It can't provide a perfect answer to all questions regarding what's in
the cache (i.e. it won't tell you which part of the table is cached),
but its in line with other statistics we do already provide in
pg_stat_user_tables etc., which are all aggregate counts, not further
breakdowns.

Its also a reasonable compromise on providing something usable that
can be shown on dashboards, as I've seen in collecting this
information using the existing methods from small production systems
in practice over the last ~1.5 years.

Another problem is the maintenance cost these functions bring. For
example, with the resizable shared buffer project we have another
function to stress test.

Can you expand how your testing would be impacted? I hear you on not
adding many unnecessary functions, but the basic paradigm of how it
iterates over buffers here is very similar to the other functions in
pg_buffercache, its just shifting the aggregation to be at a different
level.

Looking at the function, I see it uses a hash table to aggregate the
data. To some extent it's duplicating the functionality we already
have - aggregates using hashing. Are we going to duplicate
functionality everywhere we require aggregation on top of a system
function? These functions will then be missing any optimizations we do
to hash aggregation in future. Can we instead investigate the reason
the aggregation on top of pg_buffercache output requires so much more
time than doing it in the function and fix that as much as we can? I
know some slowness will come from tuplestore APIs, tuple formation and
deformation but I won't expect it to be 10 times slower.

I don't think this is fixable outside the function, and I'd be
surprised if you could get comparable performance, unless you had an
extreme case where it was < 100 buffers per relation. There are just
too many layers involved where we'd keep the full set of buffer
entries vs the grouped version. I'm happy to be convinced otherwise,
but I won't be the one pushing forward that effort myself.

Thanks,
Lukas

--
Lukas Fittl

Lukas Fittl

lukas@fittl.com

2 months ago

In reply to: Haibo Yan (#5)

Re: pg_buffercache: Add per-relation summary stats

Hi Haibo,

Thanks for your review!

On Mon, Mar 16, 2026 at 9:21 PM Haibo Yan <tristan.yim@gmail.com> wrote:

Could this use RelFileLocator plus ForkNumber instead of open-coding BufferRelStatsKey? That seems closer to existing PostgreSQL abstractions for physical relation identity.

Yes, that was noted by other reviewers as well, and makes sense.

I wonder whether pg_buffercache_relation_stats() is the best name here. The function is really aggregating by relation file identity plus fork, and it is producing a summary of the current buffer contents rather than what many readers might assume from “relation stats”. Would something with summary be clearer than stats?

Per the most recent feedback, I'll rename this to
"pg_buffercache_relations" for now.

Why are OUT relforknumber and OUT relfilenode exposed as int2 and oid respectively? Internally these are represented as ForkNumber and RelFileNumber, so I wonder whether the SQL interface should reflect that more clearly, or at least whether the current choice should be explained.

This is consistent with how pg_buffercache_pages represents them - I
think those are the correct mappings of the int ForkNumber (which we
know to be small in practice) and RelFileNumber is a typedef of Oid.

The comment says, “Hash key for pg_buffercache_relation_stats — groups by relation identity”, but that seems imprecise. It is really grouping by relfilenode plus fork, i.e. physical relation-file identity rather than relation identity in a more logical sense.

Good point. I'll adapt this to "groups by relation file" for now.

Is PARALLEL SAFE actually desirable here, as opposed to merely technically safe? A parallel query could cause multiple workers to perform full shared-buffer scans independently, which does not seem obviously desirable for this kind of diagnostic function.

I see your point, but I don't think a parallel plan would happen in
practice when just the function is being queried. Since other
pg_buffercache functions are also PARALLEL SAFE, I'll keep this as is
for now - if we want to adjust it we should be consistent I think.

Thanks,
Lukas

--
Lukas Fittl

#10

Lukas Fittl

lukas@fittl.com

2 months ago

In reply to: Bertrand Drouvot (#4)

Re: pg_buffercache: Add per-relation summary stats

Hi Bertrand,

Thanks for your review, and sorry it took me a while to get to it -
other patches took more of my attention unexpectedly.

On Mon, Mar 9, 2026 at 2:15 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

=== 1
+typedef struct
+{
+   RelFileNumber relfilenumber;
+   Oid         reltablespace;
+   Oid         reldatabase;
+   ForkNumber  forknum;
+} BufferRelStatsKey;
What about making use of RelFileLocator (instead of 3 members relfilenumber,
reltablespace and reldatabase)?

Agreed, and noted by others later too - will adjust.

+  <para>
+   The <function>pg_buffercache_relation_stats()</function> function returns a
+   set of rows summarizing the state of all shared buffers, aggregated by
+   relation and fork number.  Similar and more detailed information is
+   provided by the <structname>pg_buffercache</structname> view, but
+   <function>pg_buffercache_relation_stats()</function> is significantly
+   cheaper.
+  </para>

I'm not 100% sure about the name of the function since the stats are "reset"
after a rewrite. What about pg_buffercache_relfilenode or
pg_buffercache_aggregated?

I'll gone with "pg_buffercache_relations" for now in v2, but I could
also see "pg_buffercache_relfilenodes" making sense, if we wanted to
be more clear about the fact that these are physical relation files,
not logical relations.

Thanks,
Lukas

--
Lukas Fittl

#11

Lukas Fittl

lukas@fittl.com

2 months ago

In reply to: Masahiko Sawada (#6)

Re: pg_buffercache: Add per-relation summary stats

Hi Masahiko-san, Paul and Khoa,

Thanks for the review!

On Tue, Mar 24, 2026 at 12:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

---
-   pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql
+   pg_buffercache--1.5--1.6.sql pg_buffercache--1.6--1.7.sql \
+   pg_buffercache--1.7--1.8.sql
Since commit 4b203d499c6 bumped the version from 1.6 to 1.7 last
November, we think we don't need to bump the version again for this new
feature.

Makes sense, adjusted.

Can we move these typedefs above function prototypes as other typedefs
are defined there?

Makes sense, done.

+   relstats_hash = hash_create("pg_buffercache relation stats",
+                               128,
+                               &hash_ctl,
+                               HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

It might be worth considering simplehash.h for even better performance.

Good point, adjusted.

+   while ((entry = (BufferRelStatsEntry *) hash_seq_search(&hash_seq)) != NULL)
+   {
+       if (entry->buffers == 0)
+           continue;
+
We might want to put CHECK_FOR_INTERRUPTS() here too as the number of
entries can be as many as NBuffers in principle.

Sure, that makes sense.

We've discussed there might be room for improvement in the function
name. For example, pg_buffercache_relations instead of
pg_buffercache_relation_stats might be a good name, since everything
in this module
is stats. if we drop "_stats" then "relation" should be plural, to
match other functions in the module ("pages", "os_pages",
"numa_pages", "usage_counts").

I've renamed this to "pg_buffercache_relations", though per Bertrand's
earlier email, I could also see that it makes sense to incorporate the
fact more clearly that we're returning physical relfilenodes, not
logical relations.

See attached v2 that incorporates the review feedback.

Thank you all for reviewing!

Thanks,
Lukas

--
Lukas Fittl

#12

Masahiko Sawada

sawada.mshk@gmail.com

2 months ago

In reply to: Lukas Fittl (#8)

Re: pg_buffercache: Add per-relation summary stats

On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <lukas@fittl.com> wrote:

Hi Ashutosh,

On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I know we already have a couple of hand-aggregation functions but I am
hesitant to add more of these. Question is where do we stop? For
example, the current function is useless if someone wants to find the
parts of a relation which are hot since it doesn't include page
numbers. Do we write another function for the same? Or we add page
numbers to this function and then there's hardly any aggregation
happening. What if somebody wanted to perform an aggregation more
complex than just count() like average number of buffers per relation
or distribution of relation buffers in the cache, do they write
separate functions?

I think the problem this solves for, which is a very common question I
hear from end users, is "how much of this table/index is in cache" and
"was our query slow because the cache contents changed?".

It can't provide a perfect answer to all questions regarding what's in
the cache (i.e. it won't tell you which part of the table is cached),
but its in line with other statistics we do already provide in
pg_stat_user_tables etc., which are all aggregate counts, not further
breakdowns.

Its also a reasonable compromise on providing something usable that
can be shown on dashboards, as I've seen in collecting this
information using the existing methods from small production systems
in practice over the last ~1.5 years.

Regarding the proposed statistics, I find them reasonably useful for
many users. I'm not sure we need to draw a strict line on what belongs
in the module. If a proposed function does exactly what most
pg_buffercache users want or are already writing themselves, that is
good enough motivation to include it.

I think pg_visibility is a good precedent here. In that module, we
have both pg_visibility_map() and pg_visibility_map_summary(), even
though we can retrieve the exact same results as the latter by simply
using the former:

select sum(all_visible::int), sum(all_frozen::int) from
pg_visibility_map('test') ;

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#13

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

2 months ago

In reply to: Masahiko Sawada (#12)

Re: pg_buffercache: Add per-relation summary stats

On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <lukas@fittl.com> wrote:

Hi Ashutosh,

On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I know we already have a couple of hand-aggregation functions but I am
hesitant to add more of these. Question is where do we stop? For
example, the current function is useless if someone wants to find the
parts of a relation which are hot since it doesn't include page
numbers. Do we write another function for the same? Or we add page
numbers to this function and then there's hardly any aggregation
happening. What if somebody wanted to perform an aggregation more
complex than just count() like average number of buffers per relation
or distribution of relation buffers in the cache, do they write
separate functions?

I think the problem this solves for, which is a very common question I
hear from end users, is "how much of this table/index is in cache" and
"was our query slow because the cache contents changed?".

It can't provide a perfect answer to all questions regarding what's in
the cache (i.e. it won't tell you which part of the table is cached),
but its in line with other statistics we do already provide in
pg_stat_user_tables etc., which are all aggregate counts, not further
breakdowns.

Its also a reasonable compromise on providing something usable that
can be shown on dashboards, as I've seen in collecting this
information using the existing methods from small production systems
in practice over the last ~1.5 years.

Regarding the proposed statistics, I find them reasonably useful for
many users. I'm not sure we need to draw a strict line on what belongs
in the module. If a proposed function does exactly what most
pg_buffercache users want or are already writing themselves, that is
good enough motivation to include it.

I think pg_visibility is a good precedent here. In that module, we
have both pg_visibility_map() and pg_visibility_map_summary(), even
though we can retrieve the exact same results as the latter by simply
using the former:

select sum(all_visible::int), sum(all_frozen::int) from
pg_visibility_map('test') ;

A summary may still be ok, but this proposal is going a bit farther,
it's grouping by one subset which should really be done by GROUP BY in
SQL. And I do

I am afraid that at some point, we will start finding all of these to
be a maintenance burden. At that point, removing them will become a
real pain for the backward compatibility reason. For example
1. The proposed function is going to add one more test to an already
huge testing exercise for shared buffers resizing.
2. If we change the way to manage buffer cache e.g. use a tree based
cache instead of hash + array cache, each of the functions which
traverses the buffer cache array is going to add work - adjusting it
to the new data structure - and make a hard project even harder. In
this case we have other ways to get the summary, so the code level
scan of buffer cache is entirely avoidable.

If I am the only one opposing it, and there are more senior
contributors in favour of adding this function, we can accept it.

--
Best Wishes,
Ashutosh Bapat

#14

Dmitry Dolgov

9erthalion6@gmail.com

2 months ago

In reply to: Lukas Fittl (#11)

Re: pg_buffercache: Add per-relation summary stats

Thanks for the patch, I like the idea.

On Sat, Feb 28, 2026 at 03:58:34PM -0800, Lukas Fittl wrote:

This can be used in monitoring scripts to know which relations are
kept in shared buffers, to understand performance issues better that
occur due to relations getting evicted from the cache. In our own
monitoring tool (pganalyze) we've offered a functionality like this
based on the existing pg_buffercache() function for a bit over a year
now [0], and people have found this very valuable - but it doesn't
work for larger database servers.

I see how relation footprint on the buffer cache can be useful, e.g. how
many buffers per relation are used as well as how many of them are
dirty. But how one can benefit from number of pinned buffers and average
usage count per relation, is there a clear understanding of what to do
about those numbers?

The <function>pg_buffercache_summary()</function> function returns a
single row summarizing the state of all shared buffers.

The <function>pg_buffercache_relations()</function> function returns
a set of rows summarizing the state of all shared buffers, aggregated by
relation and fork number.

Maybe it was already asked before, but given those two functions
(pg_buffercache_summary and pg_buffercache_relations) seems to have a
very similar goal and the summary is just happening in different ways,
is there a way to somehow unify them and have one function instead of
two separate? It could be a unified function implementation and still
two different interfaces to call, or even a single pg_buffercache_summary
with an argument, specifying how to summarize.

Also, would it be useful to have a numa-aware counterpart for the
per-relation summary? Something like: this relation has so and so many
buffers, and 50% of them are located on the node 1, while the other 50%
on the node 2.

#15

Tomas Vondra

tomas.vondra@2ndquadrant.com

2 months ago

In reply to: Ashutosh Bapat (#13)

Re: pg_buffercache: Add per-relation summary stats

On 3/26/26 05:21, Ashutosh Bapat wrote:

On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <lukas@fittl.com> wrote:

Hi Ashutosh,

On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I know we already have a couple of hand-aggregation functions but I am
hesitant to add more of these. Question is where do we stop? For
example, the current function is useless if someone wants to find the
parts of a relation which are hot since it doesn't include page
numbers. Do we write another function for the same? Or we add page
numbers to this function and then there's hardly any aggregation
happening. What if somebody wanted to perform an aggregation more
complex than just count() like average number of buffers per relation
or distribution of relation buffers in the cache, do they write
separate functions?

I think the problem this solves for, which is a very common question I
hear from end users, is "how much of this table/index is in cache" and
"was our query slow because the cache contents changed?".

It can't provide a perfect answer to all questions regarding what's in
the cache (i.e. it won't tell you which part of the table is cached),
but its in line with other statistics we do already provide in
pg_stat_user_tables etc., which are all aggregate counts, not further
breakdowns.

Its also a reasonable compromise on providing something usable that
can be shown on dashboards, as I've seen in collecting this
information using the existing methods from small production systems
in practice over the last ~1.5 years.

Regarding the proposed statistics, I find them reasonably useful for
many users. I'm not sure we need to draw a strict line on what belongs
in the module. If a proposed function does exactly what most
pg_buffercache users want or are already writing themselves, that is
good enough motivation to include it.

I think pg_visibility is a good precedent here. In that module, we
have both pg_visibility_map() and pg_visibility_map_summary(), even
though we can retrieve the exact same results as the latter by simply
using the former:

select sum(all_visible::int), sum(all_frozen::int) from
pg_visibility_map('test') ;

A summary may still be ok, but this proposal is going a bit farther,
it's grouping by one subset which should really be done by GROUP BY in
SQL. And I do

I am afraid that at some point, we will start finding all of these to
be a maintenance burden. At that point, removing them will become a
real pain for the backward compatibility reason. For example
1. The proposed function is going to add one more test to an already
huge testing exercise for shared buffers resizing.
2. If we change the way to manage buffer cache e.g. use a tree based
cache instead of hash + array cache, each of the functions which
traverses the buffer cache array is going to add work - adjusting it
to the new data structure - and make a hard project even harder. In
this case we have other ways to get the summary, so the code level
scan of buffer cache is entirely avoidable.

If I am the only one opposing it, and there are more senior
contributors in favour of adding this function, we can accept it.

I understand this argument - we have SQL, which allows us to process the
data in a flexible way, without hard-coding all interesting groupings.
The question is whether this particular grouping is special enough to
warrant a custom *faster* function.

The main argument here seems to be the performance, and the initial
message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
128MB shared buffers. Unless I misunderstood what config it uses.

I gave it a try on an azure VM with 32GB shared buffers, to make it a
bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
if the original timings really were from a cluster with 128MB, because
for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
what was reported). But I suppose that's also hw specific.

Nevertheless, it is much faster. I haven't profiled this but I assume
it's thanks to not having to write the entries into a tuplestore (and
possibly into a tempfile).

But is it actually needed / worth it? I wonder what timings does Lukas
observe when running this on larger clusters. Because in a later email
he says:

... we currently run this on a 10 minute schedule when enabled, and
that seems to work in terms of understanding large swings in cache
contents.

I'm all in for optimizing stuff, but if you're running a monitoring task
every 10 minutes, does it matter if it's running for 1 or 5 seconds? I
find that a bit hard to believe.

Let's assume it's worth it. I wonder what similar summaries might be
interesting for users. I'd probably want to see a per-database summary,
especially on a shared / multi-tenant cluster. But AFAICS I can
calculate that from the pg_buffercache_relations() result, except that
I'll have to recalculate the usagecount.

I don't have clear opinion if we should do this. I kinda doubt it's a
significant maintenance burden. It'd add one more place the patch for
on-line resizing of shared buffers needs to worry about. Surely that
should not be very difficult, considering there are ~5 other places in
this very extension doing this already?

One thing we lose by doing ad hoc aggregation (instead of just relying
on the regular SQL aggregation operators) is lack of memory limit.
There's a simple in-memory hash table, no spilling to disk etc. The
simple pg_buffercache view does not have this issue, because the
tuplestore will spill to disk after hitting work_mem. Simplehash won't.

The entries are ~48B, so there would need to be buffers for ~100k
(relfilenode,forknum) combinations to overflow 4MB. It's not very
common, but I've seen systems with more relations that this. Would be
good to show some numbers showing it's not an issue.

A couple minor comments about the code:

1) Isn't this check unnecessary? All entries should have buffers > 0.

if (entry->buffers == 0)
continue;

2) Shouldn't this check BM_TAG_VALID too? Or is BM_VALID enough to look
at the bufHdr->tag?

/* Skip unused/invalid buffers */
if (!(buf_state & BM_VALID))
continue;

3) I think "buffers" argument should be renamed to "buffers_unused" for
consistency with pg_buffercache_summary.

Overall, I'm -0.1 on this. I'm not opposed to doing this, but I'm also
not quite convinced it's worth it.

regards

--
Tomas Vondra

#16

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

2 months ago

In reply to: Tomas Vondra (#15)

Re: pg_buffercache: Add per-relation summary stats

On Sat, Mar 28, 2026 at 4:28 AM Tomas Vondra <tomas@vondra.me> wrote:

On 3/26/26 05:21, Ashutosh Bapat wrote:

On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <lukas@fittl.com> wrote:

Hi Ashutosh,

On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I know we already have a couple of hand-aggregation functions but I am
hesitant to add more of these. Question is where do we stop? For
example, the current function is useless if someone wants to find the
parts of a relation which are hot since it doesn't include page
numbers. Do we write another function for the same? Or we add page
numbers to this function and then there's hardly any aggregation
happening. What if somebody wanted to perform an aggregation more
complex than just count() like average number of buffers per relation
or distribution of relation buffers in the cache, do they write
separate functions?

I think the problem this solves for, which is a very common question I
hear from end users, is "how much of this table/index is in cache" and
"was our query slow because the cache contents changed?".

It can't provide a perfect answer to all questions regarding what's in
the cache (i.e. it won't tell you which part of the table is cached),
but its in line with other statistics we do already provide in
pg_stat_user_tables etc., which are all aggregate counts, not further
breakdowns.

Its also a reasonable compromise on providing something usable that
can be shown on dashboards, as I've seen in collecting this
information using the existing methods from small production systems
in practice over the last ~1.5 years.

Regarding the proposed statistics, I find them reasonably useful for
many users. I'm not sure we need to draw a strict line on what belongs
in the module. If a proposed function does exactly what most
pg_buffercache users want or are already writing themselves, that is
good enough motivation to include it.

I think pg_visibility is a good precedent here. In that module, we
have both pg_visibility_map() and pg_visibility_map_summary(), even
though we can retrieve the exact same results as the latter by simply
using the former:

select sum(all_visible::int), sum(all_frozen::int) from
pg_visibility_map('test') ;

A summary may still be ok, but this proposal is going a bit farther,
it's grouping by one subset which should really be done by GROUP BY in
SQL. And I do

I am afraid that at some point, we will start finding all of these to
be a maintenance burden. At that point, removing them will become a
real pain for the backward compatibility reason. For example
1. The proposed function is going to add one more test to an already
huge testing exercise for shared buffers resizing.
2. If we change the way to manage buffer cache e.g. use a tree based
cache instead of hash + array cache, each of the functions which
traverses the buffer cache array is going to add work - adjusting it
to the new data structure - and make a hard project even harder. In
this case we have other ways to get the summary, so the code level
scan of buffer cache is entirely avoidable.

If I am the only one opposing it, and there are more senior
contributors in favour of adding this function, we can accept it.

I understand this argument - we have SQL, which allows us to process the
data in a flexible way, without hard-coding all interesting groupings.
The question is whether this particular grouping is special enough to
warrant a custom *faster* function.

The main argument here seems to be the performance, and the initial
message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
128MB shared buffers. Unless I misunderstood what config it uses.

I gave it a try on an azure VM with 32GB shared buffers, to make it a
bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
if the original timings really were from a cluster with 128MB, because
for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
what was reported). But I suppose that's also hw specific.

Nevertheless, it is much faster. I haven't profiled this but I assume
it's thanks to not having to write the entries into a tuplestore (and
possibly into a tempfile).

Parallely myself and Palak Chaturvedi developed a quick patch to
modernise pg_buffercache_pages() and use tuplestore so that it doesn't
have to rely on NBuffers being the same between start of the scan,
when memory allocated, when the scan ends - a condition possible with
resizing buffer cache. It seems to improve the timings by about 10-30%
on my laptop for 128MB buffercache size. Without this patch the time
taken to execute Lukas's query varies between 10-15ms on my laptop.
With this patch it varies between 8-9ms. So the timing is more stable
as a side effect. It's not a 10x improvement that we are looking for
but it looks like a step in the right direction. That improvement
seems to come purely because we avoid creating a heap tuple. I wonder
if there are some places up in the execution tree where full
heaptuples get formed again instead of continuing to use minimal
tuples or places where we perform some extra actions that are not
required.

I didn't dig into the history to find out why we didn't modernize
pg_buffercache_pages(). I don't see any hazard though.

Lukas's patch allocates the hash table in memory entirely, whereas
tuplestore restricts memory usage to work_mem, so it might cause the
function to use more memory than user expects it to use when size of
the hash table grows beyond work_mem.

--
Best Wishes,
Ashutosh Bapat

#17

Masahiko Sawada

sawada.mshk@gmail.com

2 months ago

In reply to: Tomas Vondra (#15)

Re: pg_buffercache: Add per-relation summary stats

On Fri, Mar 27, 2026 at 3:58 PM Tomas Vondra <tomas@vondra.me> wrote:

On 3/26/26 05:21, Ashutosh Bapat wrote:

On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <lukas@fittl.com> wrote:

Hi Ashutosh,

On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I know we already have a couple of hand-aggregation functions but I am
hesitant to add more of these. Question is where do we stop? For
example, the current function is useless if someone wants to find the
parts of a relation which are hot since it doesn't include page
numbers. Do we write another function for the same? Or we add page
numbers to this function and then there's hardly any aggregation
happening. What if somebody wanted to perform an aggregation more
complex than just count() like average number of buffers per relation
or distribution of relation buffers in the cache, do they write
separate functions?

I think the problem this solves for, which is a very common question I
hear from end users, is "how much of this table/index is in cache" and
"was our query slow because the cache contents changed?".

It can't provide a perfect answer to all questions regarding what's in
the cache (i.e. it won't tell you which part of the table is cached),
but its in line with other statistics we do already provide in
pg_stat_user_tables etc., which are all aggregate counts, not further
breakdowns.

Its also a reasonable compromise on providing something usable that
can be shown on dashboards, as I've seen in collecting this
information using the existing methods from small production systems
in practice over the last ~1.5 years.

Regarding the proposed statistics, I find them reasonably useful for
many users. I'm not sure we need to draw a strict line on what belongs
in the module. If a proposed function does exactly what most
pg_buffercache users want or are already writing themselves, that is
good enough motivation to include it.

I think pg_visibility is a good precedent here. In that module, we
have both pg_visibility_map() and pg_visibility_map_summary(), even
though we can retrieve the exact same results as the latter by simply
using the former:

select sum(all_visible::int), sum(all_frozen::int) from
pg_visibility_map('test') ;

A summary may still be ok, but this proposal is going a bit farther,
it's grouping by one subset which should really be done by GROUP BY in
SQL. And I do

I am afraid that at some point, we will start finding all of these to
be a maintenance burden. At that point, removing them will become a
real pain for the backward compatibility reason. For example
1. The proposed function is going to add one more test to an already
huge testing exercise for shared buffers resizing.
2. If we change the way to manage buffer cache e.g. use a tree based
cache instead of hash + array cache, each of the functions which
traverses the buffer cache array is going to add work - adjusting it
to the new data structure - and make a hard project even harder. In
this case we have other ways to get the summary, so the code level
scan of buffer cache is entirely avoidable.

If I am the only one opposing it, and there are more senior
contributors in favour of adding this function, we can accept it.

I understand this argument - we have SQL, which allows us to process the
data in a flexible way, without hard-coding all interesting groupings.
The question is whether this particular grouping is special enough to
warrant a custom *faster* function.

The main argument here seems to be the performance, and the initial
message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
128MB shared buffers. Unless I misunderstood what config it uses.

I gave it a try on an azure VM with 32GB shared buffers, to make it a
bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
if the original timings really were from a cluster with 128MB, because
for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
what was reported). But I suppose that's also hw specific.

Nevertheless, it is much faster. I haven't profiled this but I assume
it's thanks to not having to write the entries into a tuplestore (and
possibly into a tempfile).

But is it actually needed / worth it? I wonder what timings does Lukas
observe when running this on larger clusters. Because in a later email
he says:

... we currently run this on a 10 minute schedule when enabled, and
that seems to work in terms of understanding large swings in cache
contents.

I'm all in for optimizing stuff, but if you're running a monitoring task
every 10 minutes, does it matter if it's running for 1 or 5 seconds? I
find that a bit hard to believe.

I imagined such a query is just one of many monitoring queries running
concurrently, so the cumulative overhead can still matter.

I don't have clear opinion if we should do this. I kinda doubt it's a
significant maintenance burden. It'd add one more place the patch for
on-line resizing of shared buffers needs to worry about. Surely that
should not be very difficult, considering there are ~5 other places in
this very extension doing this already?

Yeah, I've not looked at the online shared buffer resizing patch, but
I hope that the patch somewhat abstructs the access to shared buffers
that might be being resized so that we don't need to worry about the
complex part when writing code accessing the shared buffers.

One thing we lose by doing ad hoc aggregation (instead of just relying
on the regular SQL aggregation operators) is lack of memory limit.
There's a simple in-memory hash table, no spilling to disk etc. The
simple pg_buffercache view does not have this issue, because the
tuplestore will spill to disk after hitting work_mem. Simplehash won't.

The entries are ~48B, so there would need to be buffers for ~100k
(relfilenode,forknum) combinations to overflow 4MB. It's not very
common, but I've seen systems with more relations that this. Would be
good to show some numbers showing it's not an issue.

Good point. I agree that we should not introduce the function in a way
that there is a risk of using excessive memory while not respecting
work_mem or other GUC parameters.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#18

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

2 months ago

In reply to: Masahiko Sawada (#17)

Re: pg_buffercache: Add per-relation summary stats

On Sat, Mar 28, 2026 at 11:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 27, 2026 at 3:58 PM Tomas Vondra <tomas@vondra.me> wrote:

On 3/26/26 05:21, Ashutosh Bapat wrote:

On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <lukas@fittl.com> wrote:

Hi Ashutosh,

On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I know we already have a couple of hand-aggregation functions but I am
hesitant to add more of these. Question is where do we stop? For
example, the current function is useless if someone wants to find the
parts of a relation which are hot since it doesn't include page
numbers. Do we write another function for the same? Or we add page
numbers to this function and then there's hardly any aggregation
happening. What if somebody wanted to perform an aggregation more
complex than just count() like average number of buffers per relation
or distribution of relation buffers in the cache, do they write
separate functions?

I think the problem this solves for, which is a very common question I
hear from end users, is "how much of this table/index is in cache" and
"was our query slow because the cache contents changed?".

It can't provide a perfect answer to all questions regarding what's in
the cache (i.e. it won't tell you which part of the table is cached),
but its in line with other statistics we do already provide in
pg_stat_user_tables etc., which are all aggregate counts, not further
breakdowns.

Its also a reasonable compromise on providing something usable that
can be shown on dashboards, as I've seen in collecting this
information using the existing methods from small production systems
in practice over the last ~1.5 years.

Regarding the proposed statistics, I find them reasonably useful for
many users. I'm not sure we need to draw a strict line on what belongs
in the module. If a proposed function does exactly what most
pg_buffercache users want or are already writing themselves, that is
good enough motivation to include it.

I think pg_visibility is a good precedent here. In that module, we
have both pg_visibility_map() and pg_visibility_map_summary(), even
though we can retrieve the exact same results as the latter by simply
using the former:

select sum(all_visible::int), sum(all_frozen::int) from
pg_visibility_map('test') ;

A summary may still be ok, but this proposal is going a bit farther,
it's grouping by one subset which should really be done by GROUP BY in
SQL. And I do

I am afraid that at some point, we will start finding all of these to
be a maintenance burden. At that point, removing them will become a
real pain for the backward compatibility reason. For example
1. The proposed function is going to add one more test to an already
huge testing exercise for shared buffers resizing.
2. If we change the way to manage buffer cache e.g. use a tree based
cache instead of hash + array cache, each of the functions which
traverses the buffer cache array is going to add work - adjusting it
to the new data structure - and make a hard project even harder. In
this case we have other ways to get the summary, so the code level
scan of buffer cache is entirely avoidable.

If I am the only one opposing it, and there are more senior
contributors in favour of adding this function, we can accept it.

I understand this argument - we have SQL, which allows us to process the
data in a flexible way, without hard-coding all interesting groupings.
The question is whether this particular grouping is special enough to
warrant a custom *faster* function.

Well-said. Thanks.

The main argument here seems to be the performance, and the initial
message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
128MB shared buffers. Unless I misunderstood what config it uses.

I gave it a try on an azure VM with 32GB shared buffers, to make it a
bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
if the original timings really were from a cluster with 128MB, because
for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
what was reported). But I suppose that's also hw specific.

Nevertheless, it is much faster. I haven't profiled this but I assume
it's thanks to not having to write the entries into a tuplestore (and
possibly into a tempfile).

But is it actually needed / worth it? I wonder what timings does Lukas
observe when running this on larger clusters. Because in a later email
he says:

... we currently run this on a 10 minute schedule when enabled, and
that seems to work in terms of understanding large swings in cache
contents.

I'm all in for optimizing stuff, but if you're running a monitoring task
every 10 minutes, does it matter if it's running for 1 or 5 seconds? I
find that a bit hard to believe.

I imagined such a query is just one of many monitoring queries running
concurrently, so the cumulative overhead can still matter.

What kind of cumulative overhead, do you see? Reduced TPS, increased
memory/CPU consumption? I think itd will be good to see some metric
evidence of this, rather than relying on the assumption.

I don't have a clear opinion if we should do this. I kinda doubt it's a
significant maintenance burden. It'd add one more place the patch for
on-line resizing of shared buffers needs to worry about. Surely that
should not be very difficult, considering there are ~5 other places in
this very extension doing this already?

Yeah, I've not looked at the online shared buffer resizing patch, but
I hope that the patch somewhat abstructs the access to shared buffers
that might be being resized so that we don't need to worry about the
complex part when writing code accessing the shared buffers.

The code to abstract isn't there in the patch yet, but I agree that
regular scan code shouldn't receive a lot of changes in the resizing
implementation patch. I have been toying with the idea that we provide
an abstraction to walk the buffer cache (foreach_buffer for example),
which may hide any complexity, if required. So, yes, there will be
some abstraction as you envision. However, testing will still be
required. It's the testing effort, increase in test time etc. which I
am worried about. However, I may be overestimating it.

One thing we lose by doing ad hoc aggregation (instead of just relying
on the regular SQL aggregation operators) is lack of memory limit.
There's a simple in-memory hash table, no spilling to disk etc. The
simple pg_buffercache view does not have this issue, because the
tuplestore will spill to disk after hitting work_mem. Simplehash won't.

The entries are ~48B, so there would need to be buffers for ~100k
(relfilenode,forknum) combinations to overflow 4MB. It's not very
common, but I've seen systems with more relations that this. Would be
good to show some numbers showing it's not an issue.

Good point. I agree that we should not introduce the function in a way
that there is a risk of using excessive memory while not respecting
work_mem or other GUC parameters.

+1.

--
Best Wishes,
Ashutosh Bapat

#19

Lukas Fittl

lukas@fittl.com

2 months ago

In reply to: Masahiko Sawada (#17)

Re: pg_buffercache: Add per-relation summary stats

On Fri, Mar 27, 2026 at 10:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 27, 2026 at 3:58 PM Tomas Vondra <tomas@vondra.me> wrote:

One thing we lose by doing ad hoc aggregation (instead of just relying
on the regular SQL aggregation operators) is lack of memory limit.
There's a simple in-memory hash table, no spilling to disk etc. The
simple pg_buffercache view does not have this issue, because the
tuplestore will spill to disk after hitting work_mem. Simplehash won't.

The entries are ~48B, so there would need to be buffers for ~100k
(relfilenode,forknum) combinations to overflow 4MB. It's not very
common, but I've seen systems with more relations that this. Would be
good to show some numbers showing it's not an issue.

Good point. I agree that we should not introduce the function in a way
that there is a risk of using excessive memory while not respecting
work_mem or other GUC parameters.

Yeah, I agree that is problematic regarding work_mem.

FWIW, I could see two methods to address that specifically, if we
wanted the special purpose function:

1) Error out if our hash table grows too large and require the user to
increase work_mem to get the data - seems inconvenient, but might be
okay if we are typically below work_mem limit anyway (I haven't run
the numbers on that yet)

2) Implement disk spill logic using a LogicalTapeSet or similar - I
think that'd be substantially more code, doesn't seem worth it just
for this (but if a situation like this recurs, we could consider a
more generalized facility)

Thanks,
Lukas

--
Lukas Fittl

#20

Lukas Fittl

lukas@fittl.com

2 months ago

In reply to: Ashutosh Bapat (#18)

Re: pg_buffercache: Add per-relation summary stats

On Sat, Mar 28, 2026 at 9:12 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Sat, Mar 28, 2026 at 11:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 27, 2026 at 3:58 PM Tomas Vondra <tomas@vondra.me> wrote:

On 3/26/26 05:21, Ashutosh Bapat wrote:

The main argument here seems to be the performance, and the initial
message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
128MB shared buffers. Unless I misunderstood what config it uses.

I gave it a try on an azure VM with 32GB shared buffers, to make it a
bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
if the original timings really were from a cluster with 128MB, because
for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
what was reported). But I suppose that's also hw specific.

Nevertheless, it is much faster. I haven't profiled this but I assume
it's thanks to not having to write the entries into a tuplestore (and
possibly into a tempfile).

But is it actually needed / worth it? I wonder what timings does Lukas
observe when running this on larger clusters. Because in a later email
he says:

... we currently run this on a 10 minute schedule when enabled, and
that seems to work in terms of understanding large swings in cache
contents.

I'm all in for optimizing stuff, but if you're running a monitoring task
every 10 minutes, does it matter if it's running for 1 or 5 seconds? I
find that a bit hard to believe.

I imagined such a query is just one of many monitoring queries running
concurrently, so the cumulative overhead can still matter.

What kind of cumulative overhead, do you see? Reduced TPS, increased
memory/CPU consumption? I think itd will be good to see some metric
evidence of this, rather than relying on the assumption.

On my part, the overhead that I've specifically seen in the field,
besides CPU utilization (which isn't great, but could be worse) is
temporary file use for large shared_buffers, due to writing out one
row to the tuplestore per buffer entry.

Here is an example from a production database, running Postgres 16
with 200GB shared_buffers:

SHOW shared_buffers;

shared_buffers
----------------
207873040kB
(1 row)

EXPLAIN (ANALYZE, BUFFERS) SELECT reldatabase, relfilenode, count(*)
FROM pg_buffercache
WHERE reldatabase IS NOT NULL
GROUP BY 1, 2;

QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=17.46..19.46 rows=200 width=16) (actual
time=12999.311..12999.526 rows=1683 loops=1)
Group Key: p.reldatabase, p.relfilenode
Batches: 1 Memory Usage: 209kB
Buffers: temp read=158595 written=158595
I/O Timings: temp read=246.513 write=1013.565
-> Function Scan on pg_buffercache_pages p (cost=0.00..10.00
rows=995 width=8) (actual time=5403.944..8864.129 rows=25984130
loops=1)
Filter: (reldatabase IS NOT NULL)
Buffers: temp read=158595 written=158595
I/O Timings: temp read=246.513 write=1013.565
Planning:
Buffers: shared hit=5
Planning Time: 0.101 ms
Execution Time: 13200.972 ms
(13 rows)

In this case we used a ~1.2GB temporary file to write out 25 million
rows, what could have been a ~100kb allocation in memory instead (~40
bytes BufferRelStatsEntry in v2 * 2048 slots in simplehash).

Thanks,
Lukas

--
Lukas Fittl

#21

Lukas Fittl

lukas@fittl.com

2 months ago

In reply to: Tomas Vondra (#15)

#22