Two issues leading to discrepancies in FSM data on the standby server

Started by Alexey Makhmutovabout 1 month ago9 messageshackers

a.makhmutov@postgrespro.ru

about 1 month ago

We’ve recently observed a situation with significant increase in
response time for insert operations after switching to a replica server.
The collected information pointed to the discrepancy in the FSM data on
the replica side, which became visible to the insert sessions once
autovacuum process pulled incorrect data from from leaf blocks into FSM
root. The entire situation was looking like the case discussed in
/messages/by-id/20180802172857.5skoexsilnjvgruk@alvherre.pgsql and
which was supposed to be fixed by ‘ab7dbd681’ (which introduced FSM
update during 'heap_xlog_visible' invocation). However in our case and
synthetic tests we were able to see data blocks marked as ‘all visible’,
but still having incorrect FSM records.

After analyzing the code I’ve noticed that during recovery FSM data is
updated in XLogRecordPageWithFreeSpace, which uses MarkBufferDirtyHint
to mark FSM block as modified. However, if data checksums are enabled,
then this call does nothing during recovery and is actually a no-op –
basically it just exits immediately without marking block as dirty. The
logic here is that as no new WAL data could not be generated during the
recovery, so changes to hints in block should not mark block as dirty to
avoid risk of torn pages being written. This seems logical, but it seems
not aligned well with the FSM case, as its blocks could be just zeroed
if checksum mismatch is detected. Currently changes to a FSM block could
be lost if each change to the particular FSM block occur rarely enough
to allow its eviction from the cache. To persist the change the
modification need to be performed while FSM block is still kept in
buffers and marked as dirty after receiving its FPI. If block was
already cleaned, then the change won’t be persisted and stored FSM
blocks may remain in an obsolete state. In our case the table had its
'fillfactor' parameter set below 80, so during insert bursts each FSM
block on replica side was modified only during first access of FSM block
since checkpoint (with FPI) and then by processing XLOG_HEAP2_VISIBLE
record for data once it was marked as ‘all visible’. This gives plenty
of time to cleanup buffer between these moments, so the second change
was just never written to the disk. So, large number of blocks were left
with incorrect data in FSM leaf blocks, which caused problem after
switchover.

Given that FSM is ready to handle torn page writes and
XLogRecordPageWithFreeSpace is called only during the recovery there
seems to be no reason to use MarkBufferDirtyHint here instead of a
regular MarkBufferDirty call. The code is already trying to limit
updates to the FSM (i.e. by updating it only after reaching 80% of used
space for regular DML), so we probably want to ensure that these updates
are actually persisted.

The second noticed issue (not related to our observed problem) is
related to the ‘heap_xlog_visible’ – this function uses
‘PageGetFreeSpace’ call instead of ‘PageGetHeapFreeSpace’ to get size of
free space for regular heap blocks. This seems like a bug, as method
'PageGetHeapFreeSpace' is used for any other case where we need to get
free space for a heap page. Usage of incorrect function could also cause
incorrect data being written to the FSM on replica: if block still have
free space, but already reached MaxHeapTuplesPerPage limit, then it
should be marked as unavailable for new rows in FSM, otherwise inserter
will need to check and update its FSM data as well.

Attached are separate patches, which tries to fixes both these problems
– calling ‘MarkBufferDirty’ instead of ‘MarkBufferDirtyHint’ in the
first case and replacing ‘PageGetFreeSpace’ with ‘PageGetHeapFreeSpace’
in the second case.

Two synthetic test cases are also attached which simulates both these
situations – ‘test_case1.zip’ to simulate the problem with lost FSM
update on replica side and ‘test_case2.zip’ to simulate incorrect FSM
data on standby server for blocks with large number of redirect slots.
In both cases the ‘test_prepare.sh’ script could be edited to specify
path to PG installation and port numbers. Then invoke ‘test_preapre.sh’
script to prepare two databases. For first case the second script
‘test_run.sh’ need to be invoked after that to show large number of
blocks being visited for simple insert and for second test case state of
the FSM (for single block) is just displayed at the end of
‘test_prepare.sh’.

Thanks,
Alexey

Andrey Borodin

amborodin@acm.org

about 1 month ago

In reply to: Alexey Makhmutov (#1)

Re: Two issues leading to discrepancies in FSM data on the standby server

Hi!

Very interesting cases!

On 20 Mar 2026, at 06:32, Alexey Makhmutov <a.makhmutov@postgrespro.ru> wrote:

Attached are separate patches, which tries to fixes both these problems – calling ‘MarkBufferDirty’ instead of ‘MarkBufferDirtyHint’ in the first case and replacing ‘PageGetFreeSpace’ with ‘PageGetHeapFreeSpace’ in the second case.

Patch 0001 - MarkBufferDirty() instead of MarkBufferDirtyHint() in XLogRecordPageWithFreeSpace().
Yes, MarkBufferDirtyHint() is no-op in recovery and it's the only case I found of using MarkBufferDirtyHint()
in redo.
Originally in e981653 was used MarkBufferDirty() but 96ef3b8 flipped to MarkBufferDirtyHint().
Neither of these commits provided a comment on why this version was chosen. I think if we fix it we must comment things.

Patch 0002 - PageGetHeapFreeSpace instead of PageGetFreeSpace in heap_xlog_visible.
This seems to be just an oversight in ab7dbd6. Every other call to XLogRecordPageWithFreeSpace() uses PageGetHeapFreeSpace().
And this seems correct to me, but a bit odd. Why indexes do not update FSM via this routine?

It seems indexes do not log free pages at all, relying on index vacuum to rebuild fsm on Standby.

Nice catch!

Best regards, Andrey Borodin.

Alexey Makhmutov

a.makhmutov@postgrespro.ru

23 days ago

In reply to: Andrey Borodin (#2)

Re: Two issues leading to discrepancies in FSM data on the standby server

Hi Andrey!

Thank you for the attention to this patch!

Originally in e981653 was used MarkBufferDirty() but 96ef3b8 flipped to MarkBufferDirtyHint().
Neither of these commits provided a comment on why this version was chosen. I think if we fix it we must comment things.

I think that reason of change in 96ef3b8 (changing of 'MarkBufferDirty'
to 'MarkBufferDirtyHint') may be described in the next commit (9df56f6),
during the README update:

New WAL records cannot be written during recovery, so hint bits set

during recovery must not dirty the page if the buffer is not already
dirty, when checksums are enabled. Systems in Hot-Standby mode may
benefit from hint bits being set, but with checksums enabled, a page
cannot be dirtied after setting a hint bit (due to the torn page risk).
So, it must wait for full-page images containing the hint bit updates to
arrive from the master.

So, it seems logical, that any changes to the data not protected by the
WAL (which includes VM and FSM as well) should use MarkBufferDirtyHint,
which does not set dirty flag during recovery. However, as FSM blocks
could be just zeroed in case of checksums mismatch, so I think it's
perfectly fine to use regular MarkBufferDirty here.

I've updated the first patch by adding the comment with explanation of
the reason for using MarkBufferDirty instead of MarkBufferDirtyHint here.

As for the second issue and the patch - it seems to be resolved in the
current master by a881cc9, which removed the entire 'heap_xlog_visible'
method, as all-visibility information is now sent with the
XLOG_HEAP2_PRUNE_VACUUM_CLEANUP message and its handler already uses
PageGetHeapFreeSpace. The problem is still relevant for the pre-19
versions, so I will probably move it to the separate thread in bugs.

Thanks,
Alexey

Melanie Plageman

melanieplageman@gmail.com

15 days ago

In reply to: Alexey Makhmutov (#3)

Re: Two issues leading to discrepancies in FSM data on the standby server

On Mon, Apr 6, 2026 at 10:26 AM Alexey Makhmutov
<a.makhmutov@postgrespro.ru> wrote:

Originally in e981653 was used MarkBufferDirty() but 96ef3b8 flipped to MarkBufferDirtyHint().
Neither of these commits provided a comment on why this version was chosen. I think if we fix it we must comment things.

I think that reason of change in 96ef3b8 (changing of 'MarkBufferDirty'
to 'MarkBufferDirtyHint') may be described in the next commit (9df56f6),
during the README update:

New WAL records cannot be written during recovery, so hint bits set

during recovery must not dirty the page if the buffer is not already
dirty, when checksums are enabled. Systems in Hot-Standby mode may
benefit from hint bits being set, but with checksums enabled, a page
cannot be dirtied after setting a hint bit (due to the torn page risk).
So, it must wait for full-page images containing the hint bit updates to
arrive from the master.

So, it seems logical, that any changes to the data not protected by the
WAL (which includes VM and FSM as well) should use MarkBufferDirtyHint,
which does not set dirty flag during recovery. However, as FSM blocks
could be just zeroed in case of checksums mismatch, so I think it's
perfectly fine to use regular MarkBufferDirty here.

Yea, I agree that this seems like simply an oversight in 96ef3b8. And
it seems safe to use MarkBufferDirty() here instead.

- Melanie

Alexander Korotkov

aekorotkov@gmail.com

9 days ago

In reply to: Melanie Plageman (#4)

Re: Two issues leading to discrepancies in FSM data on the standby server

On Tue, Apr 14, 2026 at 7:22 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Apr 6, 2026 at 10:26 AM Alexey Makhmutov
<a.makhmutov@postgrespro.ru> wrote:

Originally in e981653 was used MarkBufferDirty() but 96ef3b8 flipped to MarkBufferDirtyHint().
Neither of these commits provided a comment on why this version was chosen. I think if we fix it we must comment things.

I think that reason of change in 96ef3b8 (changing of 'MarkBufferDirty'
to 'MarkBufferDirtyHint') may be described in the next commit (9df56f6),
during the README update:

New WAL records cannot be written during recovery, so hint bits set

during recovery must not dirty the page if the buffer is not already
dirty, when checksums are enabled. Systems in Hot-Standby mode may
benefit from hint bits being set, but with checksums enabled, a page
cannot be dirtied after setting a hint bit (due to the torn page risk).
So, it must wait for full-page images containing the hint bit updates to
arrive from the master.

So, it seems logical, that any changes to the data not protected by the
WAL (which includes VM and FSM as well) should use MarkBufferDirtyHint,
which does not set dirty flag during recovery. However, as FSM blocks
could be just zeroed in case of checksums mismatch, so I think it's
perfectly fine to use regular MarkBufferDirty here.

Yea, I agree that this seems like simply an oversight in 96ef3b8. And
it seems safe to use MarkBufferDirty() here instead.

I also think that usage of MarkBufferDirty() here is safe. If I
understood correctly.
1) When wal_log_hints = on, should be completely safe. Even if we
have torn page after the crash, during recovery FPI from the primary
should come first.
2) When wal_log_hints = off, we can end up with torn pages not covered
by FPI. Without checksums, FSM can tolerate torn pages. With
checksums, that would result in zeroed pages. FSM can tolerate that
as well. And the last shouldn't happen too frequently. So, we should
finally get way better FSM state than it is now.

Should we push it to all supported branches?

------
Regards,
Alexander Korotkov
Supabase

Melanie Plageman

melanieplageman@gmail.com

8 days ago

In reply to: Alexander Korotkov (#5)

Re: Two issues leading to discrepancies in FSM data on the standby server

On Tue, Apr 21, 2026 at 9:49 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Apr 14, 2026 at 7:22 PM Melanie Plageman

Yea, I agree that this seems like simply an oversight in 96ef3b8. And
it seems safe to use MarkBufferDirty() here instead.

I also think that usage of MarkBufferDirty() here is safe. If I
understood correctly.
1) When wal_log_hints = on, should be completely safe. Even if we
have torn page after the crash, during recovery FPI from the primary
should come first.

I think FPIs from primary don't really matter here, since we are only
talking about MarkBufferDirty() in XLogRecordPageWithFreespace(). If
we change it to MarkBufferDirty() on the standby and the machine
crashes mid-write leading to a checksum error, we'll just zero it out
-- which is really your point 2. While FPIs from the primary will
overwrite the standby's FSM page, they don't provide torn-page
protection for modifications made by the standby as you could read the
page between the torn write and replaying any FPI from the primary.

2) When wal_log_hints = off, we can end up with torn pages not covered
by FPI. Without checksums, FSM can tolerate torn pages. With
checksums, that would result in zeroed pages. FSM can tolerate that
as well. And the last shouldn't happen too frequently. So, we should
finally get way better FSM state than it is now.

Yes, I think the bottom line is that we can't get checksum errors
reading FSM pages because of ZERO_ON_ERROR, so there is no reason to
do MarkBufferDirtyHint() in recovery for FSM. It only leads to losing
changes to the page.

Should we push it to all supported branches?

I haven't looked at the code paths in previous versions, but as long
as they are reading FSM pages with RBM_ZERO_ON_ERROR, I think it is
safe to do so. It is a bug that is causing overly optimistic FSM
numbers, but it's not a correctness issue like wrong results/data
corruption etc. So, I think you could make an argument either way
about fixing it.

I don't know how much of Alexey's reported issue was this vs
PageGetFreeSpace() in heap_xlog_visible(). The MarkBufferDirty()
change is easy to fix, so it probably makes sense to do so. I haven't
investigated more about the PageGetFreeSpace() issue.

- Melanie

Alexander Korotkov

aekorotkov@gmail.com

8 days ago

In reply to: Melanie Plageman (#6)

Re: Two issues leading to discrepancies in FSM data on the standby server

On Tue, Apr 21, 2026 at 5:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Apr 21, 2026 at 9:49 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Apr 14, 2026 at 7:22 PM Melanie Plageman

Yea, I agree that this seems like simply an oversight in 96ef3b8. And
it seems safe to use MarkBufferDirty() here instead.

I also think that usage of MarkBufferDirty() here is safe. If I
understood correctly.
1) When wal_log_hints = on, should be completely safe. Even if we
have torn page after the crash, during recovery FPI from the primary
should come first.

I think FPIs from primary don't really matter here, since we are only
talking about MarkBufferDirty() in XLogRecordPageWithFreespace(). If
we change it to MarkBufferDirty() on the standby and the machine
crashes mid-write leading to a checksum error, we'll just zero it out
-- which is really your point 2. While FPIs from the primary will
overwrite the standby's FSM page, they don't provide torn-page
protection for modifications made by the standby as you could read the
page between the torn write and replaying any FPI from the primary.

It's probably not so important in this context, but I'd like to verify
my thoughts further. My idea is that standby's changes of FSM are
mirroring primary's changes of FSM, even that FSM changes don't have
own WAL-records and being decoded from other WAL records. Thus, if
some FSM page on primary gets changed then primary emits FPI for the
first change after checkpoint. The standby restartpoints are
synchronized with primary's checkpoints, and the FSM changes mirrors
FSM changes on primary. Standby should also have its first change of
FSM page after the restartpoint covered by FPI received from primary.
So, the consistency of FSM pages should be guaranteed in the similar
way to every other WAL-logged pages, except FSM pages are not directly
WAL-logged, but got their changes decoded from main fork WAL-records.

The weak point I see in the reasoning above is the assumption that FSM
changes on standby fully mirrors FSM changes on primary. I didn't
really check this invariant. But other than that, could you please,
re-check this thoughts and let me know what do you think?

2) When wal_log_hints = off, we can end up with torn pages not covered
by FPI. Without checksums, FSM can tolerate torn pages. With
checksums, that would result in zeroed pages. FSM can tolerate that
as well. And the last shouldn't happen too frequently. So, we should
finally get way better FSM state than it is now.

Yes, I think the bottom line is that we can't get checksum errors
reading FSM pages because of ZERO_ON_ERROR, so there is no reason to
do MarkBufferDirtyHint() in recovery for FSM. It only leads to losing
changes to the page.

Should we push it to all supported branches?

I haven't looked at the code paths in previous versions, but as long
as they are reading FSM pages with RBM_ZERO_ON_ERROR, I think it is
safe to do so.

I've checked that since 96ef3b8 we only read FSM pages with RBM_ZERO_ON_ERROR.

It is a bug that is causing overly optimistic FSM
numbers, but it's not a correctness issue like wrong results/data
corruption etc. So, I think you could make an argument either way
about fixing it.

It has user-visible effect of increased insertion time after replica
promotion. I think this is quite a reason to backpatch.

I don't know how much of Alexey's reported issue was this vs
PageGetFreeSpace() in heap_xlog_visible(). The MarkBufferDirty()
change is easy to fix, so it probably makes sense to do so. I haven't
investigated more about the PageGetFreeSpace() issue.

Makes sense. I suggest Alexey could clarify this.

------
Regards,
Alexander Korotkov
Supabase

Melanie Plageman

melanieplageman@gmail.com

8 days ago

In reply to: Alexander Korotkov (#7)

Re: Two issues leading to discrepancies in FSM data on the standby server

On Wed, Apr 22, 2026 at 8:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Apr 21, 2026 at 5:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I also think that usage of MarkBufferDirty() here is safe. If I
understood correctly.
1) When wal_log_hints = on, should be completely safe. Even if we
have torn page after the crash, during recovery FPI from the primary
should come first.

I think FPIs from primary don't really matter here, since we are only
talking about MarkBufferDirty() in XLogRecordPageWithFreespace(). If
we change it to MarkBufferDirty() on the standby and the machine
crashes mid-write leading to a checksum error, we'll just zero it out
-- which is really your point 2. While FPIs from the primary will
overwrite the standby's FSM page, they don't provide torn-page
protection for modifications made by the standby as you could read the
page between the torn write and replaying any FPI from the primary.

It's probably not so important in this context, but I'd like to verify
my thoughts further. My idea is that standby's changes of FSM are
mirroring primary's changes of FSM, even that FSM changes don't have
own WAL-records and being decoded from other WAL records. Thus, if
some FSM page on primary gets changed then primary emits FPI for the
first change after checkpoint. The standby restartpoints are
synchronized with primary's checkpoints, and the FSM changes mirrors
FSM changes on primary. Standby should also have its first change of
FSM page after the restartpoint covered by FPI received from primary.
So, the consistency of FSM pages should be guaranteed in the similar
way to every other WAL-logged pages, except FSM pages are not directly
WAL-logged, but got their changes decoded from main fork WAL-records.

The weak point I see in the reasoning above is the assumption that FSM
changes on standby fully mirrors FSM changes on primary. I didn't
really check this invariant. But other than that, could you please,
re-check this thoughts and let me know what do you think?

Yea, I don't think they are really being mirrored. The conditions for
updating the FSM in normal operation vs in recovery are different.
When doing an INSERT, we can look for freespace on heap page 5, find
it doesn't have enough and then update the freespace map to reflect
that. Then, let's say we eventually find space for the INSERTed tuple
on heap page 10000. In normal operation, we update the FSM on
rejection, not when we find freespace. So we do not update the FSM for
heap page 10000 to reflect how much freespace it now has remaining.
Later, when replaying the INSERT, if the heuristic conditions are met
(action == BLK_NEEDS_REDO && freespace < BLCKSZ / 5) for freespace on
heap block 10000, we may then update the FSM for page 10000. The local
changes to the FSM page don't mark the buffer dirty and thus aren't
safe from loss if the page is evicted and definitely not safe from
being wiped out by a future FPI since they don't update the FSM page
LSN.

It is a bug that is causing overly optimistic FSM
numbers, but it's not a correctness issue like wrong results/data
corruption etc. So, I think you could make an argument either way
about fixing it.

It has user-visible effect of increased insertion time after replica
promotion. I think this is quite a reason to backpatch.

Right, since it's not invasive, I'm fine with backpatching.

I don't know how much of Alexey's reported issue was this vs
PageGetFreeSpace() in heap_xlog_visible(). The MarkBufferDirty()
change is easy to fix, so it probably makes sense to do so. I haven't
investigated more about the PageGetFreeSpace() issue.

Makes sense. I suggest Alexey could clarify this.

Would be good also if he could add a 19 open item for the
PageGetFreeSpace() thing (in the issues affecting back branches
section) or register a patch for next commitfest (as a bug), so we
don't forget about it.

- Melanie

Alexey Makhmutov

a.makhmutov@postgrespro.ru

7 days ago

In reply to: Melanie Plageman (#8)

Re: Two issues leading to discrepancies in FSM data on the standby server

Hi Alexander and Melanie,

Thank you very much for the attention to this thread!

On 4/22/26 17:15, Melanie Plageman wrote:

I don't know how much of Alexey's reported issue was this vs
PageGetFreeSpace() in heap_xlog_visible(). The MarkBufferDirty()
change is easy to fix, so it probably makes sense to do so. I haven't
investigated more about the PageGetFreeSpace() issue.

Makes sense. I suggest Alexey could clarify this.

The issue with 'PageGetFreeSpace' call is completely unrelated to the
other issue discussed in this thread (i.e. marking FSM buffer as dirty).

In my view it's just a small overlook in the ab7dbd6, which used
'PageGetFreeSpace' instead of 'PageGetHeapFreeSpace'. This latter
function calls ‘PageGetFreeSpace’ and additionally checks that there is
a space in line pointers section (i.e. checks that MaxHeapTuplesPerPage
is not exceeded). If MaxHeapTuplesPerfPage items are already created,
then free space for a heap block need to be reported as 0, as no new
tuple could be added to the page. Such case could be observed if page
has large number of redirect slots created as result of HOT cleanups.
Basically, every code working with heap pages should use
‘PageGetHeapFreeSpace’ rather than ‘PageGetFreeSpace’ and function
‘heap_xlog_visible’ is the only place in the code where
‘PageGetFreeSpace’ is used for heap pages. I think that situation with
difference between results of these two functions is rarely observed,
but it seems to be logical to just call the correct function in all cases.

This issue is present in all supported version (PG 14-18), but is not
present in current master for PG19 (as ‘heap_xlog_visible’ was removed
in a881cc9).

Would be good also if he could add a 19 open item for the
PageGetFreeSpace() thing (in the issues affecting back branches
section) or register a patch for next commitfest (as a bug), so we
don't forget about it.

Sure, I could just move this second patch (with test to reproduce the
problem) to a different thread and register a new commitfest entry for
it. Target version will be 14 (first supported version with the code).
As it cannot by applied on top of current master (as the problem was
already fixed for PG19 as result of a881cc9), so probably it will be
displayed in 'need rebase' status.

Thanks,
Alexey

Two issues leading to discrepancies in FSM data on the standby server

Attachments:

Attachments: