Don't keep closed WAL segment in page cache after replay
Hi,
I've been looking at page cache usage as some of our replicas were
under memory pressure (no inactive pages available) which led to WAL
replay lag as the recovery process had to read from disk. One thing
I've noticed was that the last WAL files are in the pagecache even
after having been replayed.
This can be checked with vmtouch:
vmtouch pg_wal/*
Files: 141
Directories: 2
Resident Pages: 290816/290816 1G/1G 100%
And page-types shows a replayed WAL file in the active LRU:
page-types -Cl -f 000000010000001B00000076
page-count MB long-symbolic-flags
4096 16 referenced,uptodate,lru,active
From my understanding, once replayed on a replica, WAL segment files
won't be re-read. So keeping it in the pagecache seems like an
unnecessary strain on the memory (more so that they appear to be in
the active LRU).
This patch adds a POSIX_FADV_DONTNEED before closing a WAL segment,
immediately releasing cached pages. With this, the page cache usage of
pg_wal stays under the wal_segment_size:
vmtouch pg_wal/*
Files: 88
Directories: 2
Resident Pages: 3220/262144 12M/1G 1.23%
Regards,
Anthonin
Attachments:
v01-0001-Don-t-keep-closed-WAL-segments-in-page-cache-aft.patchapplication/octet-stream; name=v01-0001-Don-t-keep-closed-WAL-segments-in-page-cache-aft.patchDownload+7-1
On 2025/07/02 19:10, Anthonin Bonnefoy wrote:
Hi,
I've been looking at page cache usage as some of our replicas were
under memory pressure (no inactive pages available) which led to WAL
replay lag as the recovery process had to read from disk. One thing
I've noticed was that the last WAL files are in the pagecache even
after having been replayed.This can be checked with vmtouch:
vmtouch pg_wal/*
Files: 141
Directories: 2
Resident Pages: 290816/290816 1G/1G 100%And page-types shows a replayed WAL file in the active LRU:
page-types -Cl -f 000000010000001B00000076
page-count MB long-symbolic-flags
4096 16 referenced,uptodate,lru,activeFrom my understanding, once replayed on a replica, WAL segment files
won't be re-read. So keeping it in the pagecache seems like an
unnecessary strain on the memory (more so that they appear to be in
the active LRU).
WAL files that have already been replayed can still be read again
for WAL archiving (if archive_mode = always) or for replication
(if the standby is acting as a streaming replication sender or
a logical replication publisher). No?
This patch adds a POSIX_FADV_DONTNEED before closing a WAL segment,
immediately releasing cached pages.
Maybe we should do this only on a standby where WAL archiving
isn't working and it isn't acting as a sender or publisher.
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
On 2025/07/02 22:24, Fujii Masao wrote:
On 2025/07/02 19:10, Anthonin Bonnefoy wrote:
Hi,
I've been looking at page cache usage as some of our replicas were
under memory pressure (no inactive pages available) which led to WAL
replay lag as the recovery process had to read from disk. One thing
I've noticed was that the last WAL files are in the pagecache even
after having been replayed.This can be checked with vmtouch:
vmtouch pg_wal/*
Files: 141
Directories: 2
Resident Pages: 290816/290816 1G/1G 100%And page-types shows a replayed WAL file in the active LRU:
page-types -Cl -f 000000010000001B00000076
page-count MB long-symbolic-flags
4096 16 referenced,uptodate,lru,activeFrom my understanding, once replayed on a replica, WAL segment files
won't be re-read. So keeping it in the pagecache seems like an
unnecessary strain on the memory (more so that they appear to be in
the active LRU).WAL files that have already been replayed can still be read again
for WAL archiving (if archive_mode = always) or for replication
(if the standby is acting as a streaming replication sender or
a logical replication publisher). No?
Also, the WAL summarizer might read those WAL files as well.
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
Thanks for the comments!
On Wed, Jul 2, 2025 at 7:12 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
WAL files that have already been replayed can still be read again
for WAL archiving (if archive_mode = always) or for replication
(if the standby is acting as a streaming replication sender or
a logical replication publisher). No?Also, the WAL summarizer might read those WAL files as well.
True, I've forgotten we could do this on standbys. For archive_mode
and WAL summarizer, the check can be done with "XLogArchiveMode !=
ARCHIVE_MODE_ALWAYS" and "!summarize_wal".
For the replication, it looks a bit trickier. Checking if there's an
active walsender process seems like a good approach but I haven't
found an existing way to do this. Checking WalSndCtl->walsnds for used
slots isn't great as this should stay in walsender_private.h. A better
way would be to check how many elements there are in
ProcGlobal->walsenderFreeProcs. If there's max_wal_sender elements in
the list, then it means there's no active walsender processes. There's
already HaveNFreeProcs that could provide this information, though
it's currently only doing this for the freeProcs list. I've modified
HaveNFreeProcs to take a proc_free_list type as argument so it can be
used to get the number of free slots in walsenderFreeProcs.
One possible impact of this approach is when the cascading replication
stream starts (either first time or after a disconnect), the WAL files
will need to be read from disk instead of being already in the page
cache. Though I think that's a better default behaviour as I would
expect that most replicas don't have cascading replication and
removing closed WAL segments would benefit most replicas.
Regards,
Anthonin Bonnefoy
Attachments:
v02-0001-Don-t-keep-closed-WAL-segments-in-page-cache-aft.patchapplication/octet-stream; name=v02-0001-Don-t-keep-closed-WAL-segments-in-page-cache-aft.patchDownload+51-5
On Thu, 3 Jul 2025 at 17:57, Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
Thanks for the comments!
On Wed, Jul 2, 2025 at 7:12 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
WAL files that have already been replayed can still be read again
for WAL archiving (if archive_mode = always) or for replication
(if the standby is acting as a streaming replication sender or
a logical replication publisher). No?Also, the WAL summarizer might read those WAL files as well.
True, I've forgotten we could do this on standbys. For archive_mode
and WAL summarizer, the check can be done with "XLogArchiveMode !=
ARCHIVE_MODE_ALWAYS" and "!summarize_wal".For the replication, it looks a bit trickier. Checking if there's an
active walsender process seems like a good approach but I haven't
found an existing way to do this. Checking WalSndCtl->walsnds for used
slots isn't great as this should stay in walsender_private.h. A better
way would be to check how many elements there are in
ProcGlobal->walsenderFreeProcs. If there's max_wal_sender elements in
the list, then it means there's no active walsender processes. There's
already HaveNFreeProcs that could provide this information, though
it's currently only doing this for the freeProcs list. I've modified
HaveNFreeProcs to take a proc_free_list type as argument so it can be
used to get the number of free slots in walsenderFreeProcs.One possible impact of this approach is when the cascading replication
stream starts (either first time or after a disconnect), the WAL files
will need to be read from disk instead of being already in the page
cache. Though I think that's a better default behaviour as I would
expect that most replicas don't have cascading replication and
removing closed WAL segments would benefit most replicas.Regards,
Anthonin Bonnefoy
Hi!
Looking at v2. You need to add ProcFreeList to
`src/bin/pgindent/typedefs.list` to avoid extra-spacing.
Checking WalSndCtl->walsnds for used
slots isn't great as this should stay in walsender_private.h. A better
way would be to check how many elements there are in
ProcGlobal->walsenderFreeProcs. If there's max_wal_sender elements in
the list, then it means there's no active walsender processes.
This does not immediately strike me as good reasoning. We have, for
example, pg_stat_get_wal_senders (or WalSndInitStopping from
walsenders.h) function which accesses exactly WalSndCtl->walsnds.
Why don't we simply have another utility function that will return the
number of active walsenders?
Other than that, the patch looks good.
--
Best regards,
Kirill Reshke
Thanks for the review!
On Tue, Feb 17, 2026 at 9:38 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
This does not immediately strike me as good reasoning. We have, for
example, pg_stat_get_wal_senders (or WalSndInitStopping from
walsenders.h) function which accesses exactly WalSndCtl->walsnds.
Why don't we simply have another utility function that will return the
number of active walsenders?
That's true it is an option. I've switched to this approach and
created a WalSndRunning. We only need to know if there's at least one
wal sender running, no need to have a precise number.
I've also added StandbyMode as a condition to restrict this to
replicas. XLogPageRead may be used by the primary when starting up,
and will likely re-read the WAL so releasing cached pages should be
avoided on the primary.
Attachments:
v3-0001-Don-t-keep-closed-WAL-segments-in-page-cache-afte.patchapplication/octet-stream; name=v3-0001-Don-t-keep-closed-WAL-segments-in-page-cache-afte.patchDownload+40-1
Hi,
The idea looks good and efficient albeit I have some feedback. The first one is about logical replication slot.
Inside the patch, it checks if there is an active walsender process. Is it possible to create a replication slot and wait until a subscriber will connect it. During this time due to patch PostgreSQL will close the WAL segments on the memory and once the subscriber connects it has to read the WAL files from disk. But it's a trade-off and can be decided by others too.
+/*
+ * Return true if there's at least one active walsender process
+ */
+bool
+WalSndRunning(void)
+{
+ int i;
+
+ for (i = 0; i < max_wal_senders; i++)
+ {
+ WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+ SpinLockAcquire(&walsnd->mutex);
+ if (walsnd->pid > 0)
+ {
+ SpinLockRelease(&walsnd->mutex);
+ return true;
+ }
+ SpinLockRelease(&walsnd->mutex);
+ }
+ return false;
+}
+
Secondly, when it comes to using spinlock to check the running walsender processes it can lead inefficient recovery process. Because assuming that a database with max_wal_sender set to 20+ and producing more than 4-5TB WAL in a day it can cause additional +100-200 spinlocks each second on walreceiver. Put simply, WalSndRunning() scans every walsender slot with spinlocks on every segment switch, contending with all active walsenders updating their own slots. On high-throughput standbys this creates unnecessary cross-process spinlock contention in the recovery hot path — the
exact path that should be as lean as possible for fast replay. Maybe you can implement a single pg_atomic_uint32 counter in WalSndCtlData and achieve the same result with zero contention.
Regards.
On Mon, Mar 2, 2026 at 9:15 AM Hüseyin Demir <huseyin.d3r@gmail.com> wrote:
The idea looks good and efficient albeit I have some feedback. The first one is about logical replication slot.
Inside the patch, it checks if there is an active walsender process. Is it possible to create a replication slot and wait until a subscriber will connect it. During this time due to patch PostgreSQL will close the WAL segments on the memory and once the subscriber connects it has to read the WAL files from disk. But it's a trade-off and can be decided by others too.
Good point. This is also the case for physical replication slots, if
there's at least one used replication slot, then it's very likely a
walsender will start at some point and would need to read the WAL. And
having to read a large amount of WAL files from disk is likely not
desirable, so it's probably better to add this as a condition.
Secondly, when it comes to using spinlock to check the running walsender processes it can lead inefficient recovery process. Because assuming that a database with max_wal_sender set to 20+ and producing more than 4-5TB WAL in a day it can cause additional +100-200 spinlocks each second on walreceiver. Put simply, WalSndRunning() scans every walsender slot with spinlocks on every segment switch, contending with all active walsenders updating their own slots. On high-throughput standbys this creates unnecessary cross-process spinlock contention in the recovery hot path — the
exact path that should be as lean as possible for fast replay. Maybe you can implement a single pg_atomic_uint32 counter in WalSndCtlData and achieve the same result with zero contention.
True. I think I've assumed that a segment closing is rare enough that
it would be fine to go through all walsenders, but there are certainly
clusters that can generate a large amount of segments.
I've updated the patch with the suggested approach:
- a new atomic counter tracking the number of active walsenders
- a similar atomic counter for the number of used replication slots.
Otherwise, we would have a similar issue of going through all
max_replication_slots to check if one is used.
- Both are now used as condition to send POSIX_FADV_DONTNEED or not
Regards,
Anthonin Bonnefoy
Attachments:
v4-0001-Don-t-keep-closed-WAL-segments-in-page-cache-afte.patchapplication/octet-stream; name=v4-0001-Don-t-keep-closed-WAL-segments-in-page-cache-afte.patchDownload+60-7
Hi, Anthonin
Date: Wed, 04 Mar 2026 09:37:27 +0800
On Tue, 03 Mar 2026 at 15:13, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
On Mon, Mar 2, 2026 at 9:15 AM Hüseyin Demir <huseyin.d3r@gmail.com> wrote:
The idea looks good and efficient albeit I have some feedback. The first one is about logical replication slot.
Inside the patch, it checks if there is an active walsender
process. Is it possible to create a replication slot and wait until
a subscriber will connect it. During this time due to patch
PostgreSQL will close the WAL segments on the memory and once the
subscriber connects it has to read the WAL files from disk. But it's
a trade-off and can be decided by others too.Good point. This is also the case for physical replication slots, if
there's at least one used replication slot, then it's very likely a
walsender will start at some point and would need to read the WAL. And
having to read a large amount of WAL files from disk is likely not
desirable, so it's probably better to add this as a condition.Secondly, when it comes to using spinlock to check the running
walsender processes it can lead inefficient recovery
process. Because assuming that a database with max_wal_sender set to
20+ and producing more than 4-5TB WAL in a day it can cause
additional +100-200 spinlocks each second on walreceiver. Put
simply, WalSndRunning() scans every walsender slot with spinlocks on
every segment switch, contending with all active walsenders updating
their own slots. On high-throughput standbys this creates
unnecessary cross-process spinlock contention in the recovery hot
path — the
exact path that should be as lean as possible for fast replay. Maybe you can implement a single pg_atomic_uint32 counter in WalSndCtlData and achieve the same result with zero contention.True. I think I've assumed that a segment closing is rare enough that
it would be fine to go through all walsenders, but there are certainly
clusters that can generate a large amount of segments.I've updated the patch with the suggested approach:
- a new atomic counter tracking the number of active walsenders
- a similar atomic counter for the number of used replication slots.
Otherwise, we would have a similar issue of going through all
max_replication_slots to check if one is used.
- Both are now used as condition to send POSIX_FADV_DONTNEED or notRegards,
Anthonin Bonnefoy
I noticed two different comment styles in the v4 patch:
1.
+/*
+ * Return the number of used replication slots
+ */
+int
+ReplicationSlotNumUsed(void)
2.
+/*
+ * Returns the number of active walsender processes
+ */
+int
+WalSndNumActive(void)
Both "Return ..." and "Returns ..." styles exist in the PostgreSQL codebase,
but within the same patch, would it be better to use a consistent style?
I'd like to use the imperative/singular form. What do you think?
[2. text/x-diff; v4-0001-Don-t-keep-closed-WAL-segments-in-page-cache-afte.patch]...
--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.
On Wed, Mar 4, 2026 at 2:40 AM Japin Li <japinli@hotmail.com> wrote:
Both "Return ..." and "Returns ..." styles exist in the PostgreSQL codebase,
but within the same patch, would it be better to use a consistent style?
That makes sense, I've updated the comment. I think I usually look
around the code to copy the comments' style, but as you said, both are
present in PostgreSQL codebase.
Regards,
Anthonin Bonnefoy
Attachments:
v5-0001-Don-t-keep-closed-WAL-segments-in-page-cache-afte.patchapplication/octet-stream; name=v5-0001-Don-t-keep-closed-WAL-segments-in-page-cache-afte.patchDownload+60-7
Hi,
On 2026-03-04 08:38:24 +0100, Anthonin Bonnefoy wrote:
From ad0a3cfe10bdd2cccc4274849c4a77898b06e13c Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Date: Wed, 2 Jul 2025 09:58:52 +0200
Subject: Don't keep closed WAL segments in page cache after replayOn a standby, the recovery process reads the WAL segments, applies
changes and closes the segment. When closed, the segments will still be
in page cache memory until they are evicted due to inactivity. The
segments may be re-read if archive_mode is set to always, wal_summarizer
is enabled or if the standby is used for replication and has an active
walsender.The presence of a replication slots is also a likely indicator that
a walsender will be started, and need to read the WAL segments.Outside of those circumstances, the WAL segments won't be re-read and
keeping them in the page cache generates unnecessary memory pressure.
A POSIX_FADV_DONTNEED is sent before closing a replayed WAL segment to
immediately free any cached pages.
I am quite sceptical that this is a good idea.
Have you actually measured benefits? I skimmed the thread and didn't see
anything. It's pretty cheap for the kernel to replace a clean page from the
page cache with different content.
If you [crash-]restart the replica this will make it way more expensive. If
you have twophase commits where we need to read 2PC details from the WAL, this
will make it more expensive. If somebody takes a base backup, this ...
I think you'd have to have pretty convincing benchmarks showing that this is a
good idea before we should even remotely consider applying this.
Greetings,
Andres Freund