Add pg_stat_vfdcache view for VFD cache statistics
Hello hackers,
This comes from Tomas's patch idea from his website[1]https://vondra.me/posts/patch-idea-statistics-for-file-descriptor-cache, i thought this
patch makes sense to have.
PostgreSQL's virtual file descriptor (VFD) maintains a
per-backend cache of open file descriptors, bounded by
max_files_per_process (default 1000). When the cache is full, the
least-recently-used entry is evicted so its OS fd is closed, so a new
file can be opened. On the next access to that file, open() must be
called again, incurring a syscall that a larger cache would have
avoided.
A trivial example is with partitioned tables: a table with 1500
partitions requires even more than 1500 file descriptors per full scan (main
fork, vm ...), which is more than the default limit, causing potential
evictions and reopens.
The problem is well-understood and the fix is straightforward: raise
max_files_per_process. Tomas showed a 4-5x throughput
improvement in [1]https://vondra.me/posts/patch-idea-statistics-for-file-descriptor-cache sometimes, on my end i see something less than that,
depending on the query itself, but we get the idea.
AFAIK there is currently no way from inside PostgreSQL to know whether fd
cache pressure is occurring.
Implementation is trivial, because the VFD cache is strictly per-backend,
the counters are also
per-backend and require no shared memory or locking. Three macros
(pgstat_count_vfd_hit/miss/eviction) update fields in PendingVfdCacheStats
directly from fd.c.
I find this a bit useful, I would love to hear about anyone's thoughts
whether this is useful or not.
[1]: https://vondra.me/posts/patch-idea-statistics-for-file-descriptor-cache
Regards,
Ayoub
Attachments:
0001-Add-pg_stat_vfdcache-view-for-VFD-cache-statistics.patchtext/x-patch; charset=US-ASCII; name=0001-Add-pg_stat_vfdcache-view-for-VFD-cache-statistics.patchDownload+384-1
On Sat, Mar 21, 2026 at 5:59 PM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello hackers,
This comes from Tomas's patch idea from his website[1], i thought this patch makes sense to have.
PostgreSQL's virtual file descriptor (VFD) maintains a
per-backend cache of open file descriptors, bounded by
max_files_per_process (default 1000). When the cache is full, the
least-recently-used entry is evicted so its OS fd is closed, so a new
file can be opened. On the next access to that file, open() must be
called again, incurring a syscall that a larger cache would have
avoided.A trivial example is with partitioned tables: a table with 1500
partitions requires even more than 1500 file descriptors per full scan (main
fork, vm ...), which is more than the default limit, causing potential evictions and reopens.The problem is well-understood and the fix is straightforward: raise
max_files_per_process. Tomas showed a 4-5x throughput
improvement in [1] sometimes, on my end i see something less than that, depending on the query itself, but we get the idea.AFAIK there is currently no way from inside PostgreSQL to know whether fd cache pressure is occurring.
Implementation is trivial, because the VFD cache is strictly per-backend, the counters are also
per-backend and require no shared memory or locking. Three macros (pgstat_count_vfd_hit/miss/eviction) update fields in PendingVfdCacheStats directly from fd.c.I find this a bit useful, I would love to hear about anyone's thoughts whether this is useful or not.
Hi,
My $0.02, for that for that to being useful it would need to allow viewing
global vfd cache picture (across all backends), not just from *current* backend.
Applicaiton wouldn't call this function anyway, because they would have to be
modified.
In order to get that you technically should collect the hits/misses in local
pending pgstat io area (see e.g. pgstat_io or simpler pgstat_bgwriter/
checkpointer) like you do already with PendingVfdCacheStats, but then copy them
to shared memory pgstat area (with some LWLock* protection) that would be
queryable.
-J.
Hi!
On 23.03.2026 12:22, Jakub Wartak wrote:
On Sat, Mar 21, 2026 at 5:59 PM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello hackers,
This comes from Tomas's patch idea from his website[1], i thought this patch makes sense to have.
PostgreSQL's virtual file descriptor (VFD) maintains a
per-backend cache of open file descriptors, bounded by
max_files_per_process (default 1000). When the cache is full, the
least-recently-used entry is evicted so its OS fd is closed, so a new
file can be opened. On the next access to that file, open() must be
called again, incurring a syscall that a larger cache would have
avoided.
That's one use-case. The other one that I've recently come across is
just knowing how many VFD cache entries there are in the first place.
While the number of open files is bounded by max_files_per_process, the
number of cache entries is unbounded. Large database can easily have
hundreds of thousands of files due to our segmentation scheme.
Workloads that access a big portion of these files can end up spending
very considerable amounts of memory on the VFD cache. For example, with
100,000 VFD entries per backend * 80 bytes per VFD = ~7.6 MiB. With 1000
backends that almost 10 GiB just for VFD entries; assuming that each
backend over time accumulates that many files.
A production database I looked recently had ~300,000 files and many
thousand backends. It spent close to 30 GiBs on VFD cache.
I've looked at struct vfd and some simple changes to the struct would
already cut memory consumption in half. I can look into that.
Thoughts?
A trivial example is with partitioned tables: a table with 1500
partitions requires even more than 1500 file descriptors per full scan (main
fork, vm ...), which is more than the default limit, causing potential evictions and reopens.The problem is well-understood and the fix is straightforward: raise
max_files_per_process. Tomas showed a 4-5x throughput
improvement in [1] sometimes, on my end i see something less than that, depending on the query itself, but we get the idea.
The question is what the kernel makes out of that, especially in
aforementioned case where the number of total files and backends is large.
In the Linux kernel each process that open some file gets its own struct
file. sizeof(struct file) is ~200 bytes. Hence, increasing
max_files_per_process can measurably impact memory consumption if
changed lightheartedly. We should document that.
But I guess in most cases it's rather about changing it from 1k to 2k,
rather than changing it from 1k to 100k.
AFAIK there is currently no way from inside PostgreSQL to know whether fd cache pressure is occurring.
Implementation is trivial, because the VFD cache is strictly per-backend, the counters are also
per-backend and require no shared memory or locking. Three macros (pgstat_count_vfd_hit/miss/eviction) update fields in PendingVfdCacheStats directly from fd.c.I find this a bit useful, I would love to hear about anyone's thoughts whether this is useful or not.
Hi,
My $0.02, for that for that to being useful it would need to allow viewing
global vfd cache picture (across all backends), not just from *current* backend.
Applicaiton wouldn't call this function anyway, because they would have to be
modified.
+1
In order to get that you technically should collect the hits/misses in local
pending pgstat io area (see e.g. pgstat_io or simpler pgstat_bgwriter/
checkpointer) like you do already with PendingVfdCacheStats, but then copy them
to shared memory pgstat area (with some LWLock* protection) that would be
queryable.
I would include here the sum of VFD cache entries across all backend and
the total VFD cache size.
--
David Geier
Hello David and Jakub,
On Mon, Mar 23, 2026 at 1:35 PM David Geier <geidav.pg@gmail.com> wrote:
Hi!
On 23.03.2026 12:22, Jakub Wartak wrote:
On Sat, Mar 21, 2026 at 5:59 PM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello hackers,
This comes from Tomas's patch idea from his website[1], i thought this
patch makes sense to have.
PostgreSQL's virtual file descriptor (VFD) maintains a
per-backend cache of open file descriptors, bounded by
max_files_per_process (default 1000). When the cache is full, the
least-recently-used entry is evicted so its OS fd is closed, so a new
file can be opened. On the next access to that file, open() must be
called again, incurring a syscall that a larger cache would have
avoided.That's one use-case. The other one that I've recently come across is
just knowing how many VFD cache entries there are in the first place.While the number of open files is bounded by max_files_per_process, the
number of cache entries is unbounded. Large database can easily have
hundreds of thousands of files due to our segmentation scheme.Workloads that access a big portion of these files can end up spending
very considerable amounts of memory on the VFD cache. For example, with
100,000 VFD entries per backend * 80 bytes per VFD = ~7.6 MiB. With 1000
backends that almost 10 GiB just for VFD entries; assuming that each
backend over time accumulates that many files.A production database I looked recently had ~300,000 files and many
thousand backends. It spent close to 30 GiBs on VFD cache.I've looked at struct vfd and some simple changes to the struct would
already cut memory consumption in half. I can look into that.Thoughts?
Looking forward to this.
What also bothers me in that space is if a backend allocates 100K entries
in VFD cache, that cache is never shrank ever again,
the cache only grows (if it needs more than its lifetime maximum) until the
backend dies, although this is useful as entries are reused if free instead
of
allocating entries, whether a spike in files openings effects a long living
backend to keep holding a useless amount of
cache size it will need in the future, i don't imagine this to be common
though, what do you think about this issue from your experience ?
In the Linux kernel each process that open some file gets its own struct
file. sizeof(struct file) is ~200 bytes. Hence, increasing
max_files_per_process can measurably impact memory consumption if
changed lightheartedly. We should document that.But I guess in most cases it's rather about changing it from 1k to 2k,
rather than changing it from 1k to 100k.
Indeed, I agree with this.
AFAIK there is currently no way from inside PostgreSQL to know whether
fd cache pressure is occurring.
Implementation is trivial, because the VFD cache is strictly
per-backend, the counters are also
per-backend and require no shared memory or locking. Three macros
(pgstat_count_vfd_hit/miss/eviction) update fields in PendingVfdCacheStats
directly from fd.c.I find this a bit useful, I would love to hear about anyone's thoughts
whether this is useful or not.
Hi,
My $0.02, for that for that to being useful it would need to allow
viewing
global vfd cache picture (across all backends), not just from *current*
backend.
Applicaiton wouldn't call this function anyway, because they would have
to be
modified.
+1
Would it be reasonable to have both ? I changed the way i was thinking
about it,
Its clear one would want global vfd cache behavior monitoring as its
configuration is global but its effect is mostly backend specific,
That leads me to think it could also be useful to ALSO maintain per-backend
metrics, to help identify which backends are going crazy with the VFD cache.
In order to get that you technically should collect the hits/misses in
local
pending pgstat io area (see e.g. pgstat_io or simpler pgstat_bgwriter/
checkpointer) like you do already with PendingVfdCacheStats, but thencopy them
to shared memory pgstat area (with some LWLock* protection) that would be
queryable.I would include here the sum of VFD cache entries across all backend and
the total VFD cache size.
I'll be doing this soon.
--
David Geier
Regards,
Ayoub Kazar
On 24.03.2026 00:36, KAZAR Ayoub wrote:
I've looked at struct vfd and some simple changes to the struct would
already cut memory consumption in half. I can look into that.Thoughts?
Looking forward to this.
I try to come up with something the next days.
What also bothers me in that space is if a backend allocates 100K entries
in VFD cache, that cache is never shrank ever again,
the cache only grows (if it needs more than its lifetime maximum) until the
backend dies, although this is useful as entries are reused if free instead
of
allocating entries, whether a spike in files openings effects a long living
backend to keep holding a useless amount of
cache size it will need in the future, i don't imagine this to be common
though, what do you think about this issue from your experience ?
Currently the cache is directly mapped by the VFD index. That means we
could only resize down to the maximum used VFD index.
Being able to resize independently of the maximum VFD index would
require changing to a hash map like simplehash.h. I can take a look how
invasive such a change would be.
--
David Geier
Hello,
On Tue, Mar 24, 2026 at 3:09 PM David Geier <geidav.pg@gmail.com> wrote:
On 24.03.2026 00:36, KAZAR Ayoub wrote:
I've looked at struct vfd and some simple changes to the struct would
already cut memory consumption in half. I can look into that.Thoughts?
Looking forward to this.
I try to come up with something the next days.
What also bothers me in that space is if a backend allocates 100K entries
in VFD cache, that cache is never shrank ever again,
the cache only grows (if it needs more than its lifetime maximum) untilthe
backend dies, although this is useful as entries are reused if free
instead
of
allocating entries, whether a spike in files openings effects a longliving
backend to keep holding a useless amount of
cache size it will need in the future, i don't imagine this to be common
though, what do you think about this issue from your experience ?Currently the cache is directly mapped by the VFD index. That means we
could only resize down to the maximum used VFD index.Being able to resize independently of the maximum VFD index would
require changing to a hash map like simplehash.h. I can take a look how
invasive such a change would be.
--
David Geier
I've implemented the recommended global stats view on vfd cache, the
implementation should be also straightforward as it follows the same
cumulative shared statistics infrastructure like pgstat_bgwriter and others
do.
Attached is v2 patch also contains what David suggested for global cache
size and entries in the view.
Kind regards,
Ayoub
Attachments:
v2-0001-Add-pg_stat_vfdcache-view-for-VFD-cache-statistics.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-pg_stat_vfdcache-view-for-VFD-cache-statistics.patchDownload+655-4
Rebased v2.
On Sun, Mar 29, 2026 at 8:23 PM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Show quoted text
Hello,
On Tue, Mar 24, 2026 at 3:09 PM David Geier <geidav.pg@gmail.com> wrote:On 24.03.2026 00:36, KAZAR Ayoub wrote:
I've looked at struct vfd and some simple changes to the struct would
already cut memory consumption in half. I can look into that.Thoughts?
Looking forward to this.
I try to come up with something the next days.
What also bothers me in that space is if a backend allocates 100K
entries
in VFD cache, that cache is never shrank ever again,
the cache only grows (if it needs more than its lifetime maximum) untilthe
backend dies, although this is useful as entries are reused if free
instead
of
allocating entries, whether a spike in files openings effects a longliving
backend to keep holding a useless amount of
cache size it will need in the future, i don't imagine this to be common
though, what do you think about this issue from your experience ?Currently the cache is directly mapped by the VFD index. That means we
could only resize down to the maximum used VFD index.Being able to resize independently of the maximum VFD index would
require changing to a hash map like simplehash.h. I can take a look how
invasive such a change would be.--
David GeierI've implemented the recommended global stats view on vfd cache, the
implementation should be also straightforward as it follows the same
cumulative shared statistics infrastructure like pgstat_bgwriter and others
do.Attached is v2 patch also contains what David suggested for global cache
size and entries in the view.Kind regards,
Ayoub
Attachments:
v2-0001-Add-pg_stat_vfdcache-view-for-VFD-cache-statistics.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-pg_stat_vfdcache-view-for-VFD-cache-statistics.patchDownload+655-4
Hi!
On 29.03.2026 21:46, KAZAR Ayoub wrote:
I've looked at struct vfd and some simple changes to the struct would
already cut memory consumption in half. I can look into that.Thoughts?
Looking forward to this.
I try to come up with something the next days.
What also bothers me in that space is if a backend allocates 100K
entries
in VFD cache, that cache is never shrank ever again,
the cache only grows (if it needs more than its lifetime maximum) untilthe
backend dies, although this is useful as entries are reused if free
instead
of
allocating entries, whether a spike in files openings effects a longliving
backend to keep holding a useless amount of
cache size it will need in the future, i don't imagine this to be common
though, what do you think about this issue from your experience ?Currently the cache is directly mapped by the VFD index. That means we
could only resize down to the maximum used VFD index.Being able to resize independently of the maximum VFD index would
require changing to a hash map like simplehash.h. I can take a look how
invasive such a change would be.
That would actually be doable without too much code churn. It would,
however, add some more overhead to each cache entry: 1 byte for the
simplehash.h status and a 4 byte for the hash, if we want to avoid
rehashing on each access. Probably we can get away without storing the
hash. We would then have to monitor the cache size and recreate the hash
table when the size has shrunk by enough.
Another alternative is pallocing vfd entries so that we can freely move
them around in the vfd cache array. That would mean an ABI change (File
would be an 8-byte pointer instead of a 4-byte integer) but give us much
more flexibility for possible improvements.
For example, apart from compacting without hash map, this would allow us
to allocate variable amounts of memory per entry to, e.g.
- store the file name inline of the struct as variable length array
(instead of a pstrdup() pointer) and
- depending on type allocate more or less memory. This is because
non-temporary files don't use the ResOwner and the file size.
I've implemented the recommended global stats view on vfd cache, the
implementation should be also straightforward as it follows the same
cumulative shared statistics infrastructure like pgstat_bgwriter and others
do.Attached is v2 patch also contains what David suggested for global cache
size and entries in the view.
I'll review the patch the next days. On quick inspection I saw that you
compute the size of a struct vfd by using sizeof(). That works, except
for the filename which is stored as a pointer to a pstrdup() piece of
memory. I guess you can just say:
sizeof(VfdCache[i]) + GetMemoryChunkSpace(VfdCache[i].fileName)
--
David Geier
Hi,
Thanks for working on this, I think having some stats about the vfd
cache would be quite helpful. I took a quick look at the patch, and in
general it goes in the right direction.
Here's a couple comments / suggestions:
1) sgml docs
- Formatting seems a bit wrong, both for the entry in the first table,
and then for the view description later. Clearly different from the
nearby tables after the documentation is built. Did you generate the
tables somehow?
- The ordering seems a bit random, but I'd argue the pg_stat_vfdcache
view should go before pg_stat_wal (at least in the first table).
2) system_views.sql
- missing the REVOKE command
- We don't align the AS clauses with spaces (I like to align my queries,
but here it's a question of consistency with the other commands, and the
alignment makes future diffs larger)
3) fd.c
- Wouldn't it be better to have pgstat_count_vfd_access(hit bool)? That
way you would't even need the new else, it'd ne enough to do
pgstat_count_vfd_access(!FileIsNotOpen(file)) at the beginning.
- GetVfdCacheOccupancy does not seem to be called from anywhere
4) pgstat_vfdcache.c
- I'm not sure usagecount makes sense for these stats, because that's
for cases with a single writer. These stats are written by backends, so
it probably needs a lwlock.
- I'm not sure updating PgStatShared_Backend from pgstat_vfdcache.c is a
good idea, when it's already synced from pgstat_backend. We don't do
that for WAL either, but there's pgstat_flush_backend_entry_wal in
pgstat_backend.c. I suppose vfdcache should do it the same way.
5) pg_proc.dat
- formatting seems a bit inconsistent
6) pgstat.h
- Aren't evictions mostly the same as misses, at least after a while?
- I think it would be useful to report how many file descriptors we
are allowed to open (it's less than max_files_per_process, depending
on the ulimits etc.)
- I know io_uring can consume quite a few descriptors, and it can cause
issues, I wonder if this would make it easier to observe
I also suggest to split the patch into smaller patches, to make it
easier to review and evaluate. Not because of size - the patch is fairly
small. But it's better to not mix multiple features with different
cost/benefit trade offs, because then it's possible to evaluate them
separately. Maybe even commit the first part and continue discussion
about the following one(s).
This patch seems to mix two different types of stats - global stats of
the vfd cache, and then also per-backend stats. Those seems like very
different things, both in terms of overhead and benefits.
The global cache stats is going to be virtually free (at least the
hits/misses, I'm not sure about the number of entries and bytes), and
it's obviously useful for tuning the max_files_per_process GUC. I'd even
contemplate getting this into PG19, maybe.
The per-backend stats seem like a much harder sell to me, but I can be
convinced. Maybe it's not an issue in terms of overhead, maybe the stats
we get from that are worth it. Not sure. But I'd keep it in a separate
0002 patch, on top of 0001 with just the "global" stats.
regards
--
Tomas Vondra
Hello,
Thanks for the review!
On Tue, Mar 31, 2026 at 8:27 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,
Thanks for working on this, I think having some stats about the vfd
cache would be quite helpful. I took a quick look at the patch, and in
general it goes in the right direction.Here's a couple comments / suggestions:
6) pgstat.h
- Aren't evictions mostly the same as misses, at least after a while?
Correct, the time where they are not the same is pretty much meaningless
info, i removed it.
- I think it would be useful to report how many file descriptors we
are allowed to open (it's less than max_files_per_process, depending
on the ulimits etc.)
Agree, This should be max_safe_fds calculated by postmaster, I added this
but let me know if its acceptable to export max_safe_fds in the way I did.
- I know io_uring can consume quite a few descriptors, and it can cause
issues, I wonder if this would make it easier to observeI also suggest to split the patch into smaller patches, to make it
easier to review and evaluate. Not because of size - the patch is fairly
small. But it's better to not mix multiple features with different
cost/benefit trade offs, because then it's possible to evaluate them
separately. Maybe even commit the first part and continue discussion
about the following one(s).This patch seems to mix two different types of stats - global stats of
the vfd cache, and then also per-backend stats. Those seems like very
different things, both in terms of overhead and benefits.The global cache stats is going to be virtually free (at least the
hits/misses, I'm not sure about the number of entries and bytes), and
it's obviously useful for tuning the max_files_per_process GUC. I'd even
contemplate getting this into PG19, maybe.The per-backend stats seem like a much harder sell to me, but I can be
convinced. Maybe it's not an issue in terms of overhead, maybe the stats
we get from that are worth it. Not sure. But I'd keep it in a separate
0002 patch, on top of 0001 with just the "global" stats.regards
--
Tomas VondraI fixed style related issues and followed your suggestions on splitting
the patch to do global stats first then the per-backend stats of cache size
and entries count reporting, attached is v3-0001 that does just the global
stats counting.
When we make sure this is correct i'll proceed with the per-backend stats
patch.
Regards,
Ayoub