Bug? pg_rewind produces unusable but starting database with standby recovery
Hello!
My colleague found this issue while running tests with pg_rewind: in
some situations using pg_rewind can result in incomplete recovery on
the target, with the server seemingly starting normally. Error happen
only later when users actually try to query the database.
I think it would be better to handle this properly and report an error
during recovery instead of completing the startup and reporting errors
when users try to access the data. This also seems to be the intention
of these conditions, based on the comment a few lines above my changes
in the attached patch.
Originally we noticed this with only the wal summarizer enabled, no
other settings were changed. Later we realized the issue is
reproducible without it, so it could affect earlier postgres versions.
The wal summarizer simply delayed recycling on the target, preventing
pg_rewind from exiting early with an error.
It is also reproducible by changing wal size settings and ensuring the
source recycles records, while the target keeps them. This requires an
asymmetric configuration which is not ideal, but since `summarize_wal`
is only available in pg17+, I based my test case on this condition
instead.
All we need is a situation where some wal segments are missing on the
standby, which seems to be a possibility in the "backup-from-replica"
scenario described above?
The fix also seems simple: relax the conditions used for the "WAL ends
before consistent recovery point" error to catch this case.
What do you think?
Attachments:
0001-Enforce-minRecoveryPoint-check-regardless-of-archive.patchapplication/octet-stream; name=0001-Enforce-minRecoveryPoint-check-regardless-of-archive.patchDownload+198-13
Hello!
I want to add one slightly related issue to the thread, attached as
0002. The patches are independent, one can be applied without the
other. The only relation is that we found 0002 because of the changes
in 0001. After reviewing the details, however, I believe 0002 isn't a
regression caused by 0001, but rather a previously hidden bug that
0001 exposed.
Sometimes pg_rewind can generate a state where the stated
minRecoveryPoint is beyond the actual available wal. In the original
recovery checks, postgres simply continued on this point. With the
modified 0001 version, it detects that we didn't reach the stated
minRecoveryPoint and reports an error.
Regardless of 0001, pg_rewind shouldn't result in inconsistent output
like that. The attached 0002 aims to solve this by capturing the
expected minRecoveryPoint earlier, before traversing the wal files.
The previous, opposite order could add new segment files between
retrieving the file list and querying the LSN, which results in
missing WAL data in the output.
Attachments:
0002-pg_rewind-fix-remote-source-WAL-race-condition.patchapplication/octet-stream; name=0002-pg_rewind-fix-remote-source-WAL-race-condition.patchDownload+207-13
0001-Enforce-minRecoveryPoint-check-regardless-of-archive.patchapplication/octet-stream; name=0001-Enforce-minRecoveryPoint-check-regardless-of-archive.patchDownload+198-13
On 15 May 2026, at 12:50, Zsolt Parragi <zsolt.parragi@percona.com> wrote:
Sometimes pg_rewind can generate a state where the stated
minRecoveryPoint is beyond the actual available wal.
Hi Zsolt,
I looked at both patches. Some notes.
1. Patch 0001 (unconditional minRecoveryPoint check) looks necessary and
safe to me.
The new check fires only when LocalMinRecoveryPoint is valid, and that
only happens when InArchiveRecovery is true. In plain crash recovery
(no backup_label, no signal files) StartupXLOG sets LocalMinRecoveryPoint
to InvalidXLogRecPtr, so "EndOfLog < LocalMinRecoveryPoint" is always
false and the check is a no-op. InArchiveRecovery becomes true only when
a backup_label is present, or when archive recovery was requested. In
all those cases minRecoveryPoint > EndOfLog really does mean we stopped
short of consistency, so erroring out is the only correct option.
The only case the old code missed is: backup_label present, but
backupEndRequired is false and no signal files. The single producer of
that combination is "BACKUP METHOD: pg_rewind" (streamed base backups
set backupEndRequired and were already caught). So the new behavior is
limited exactly to the pg_rewind hole, with no regression for streamed
backups or crash recovery.
2. Patch 0002 (remote source WAL race) I think goes the wrong way.
Moving the consistency LSN capture before collecting the file list
breaks the rule that backup stop relies on: the consistency point must
be established AFTER the data pages are read, never before. The source
is a live server and pg_rewind reads pages from disk one by one with
pg_read_binary_file(). A page modified on the source after we captured
flush LSN, but before we copy it, lands on the target with an LSN
greater than minRecoveryPoint. The target can then reach minRecoveryPoint
and be declared consistent while holding pages from the future whose WAL
was never replayed - a torn, cross-page inconsistent state.
The good half of the patch is insert LSN -> flush LSN. Note that
pg_rewind reads pages from disk, and by the WAL-before-data rule any
on-disk page has LSN <= flush LSN. So flush LSN is the tight upper bound
on what we can copy: insert LSN overshoots (may point past any copied
page, possibly into WAL still only in shared memory), flush-before-copy
undershoots (torn pages), and flush-after-copy is exactly right: it
covers every copied page and is durable.
So I think the fix should keep the capture where it was (end of
perform_rewind, after the copy) and only switch insert LSN to flush LSN.
Missing WAL up to minRecoveryPoint is then handled like a base backup:
the target fetches it via streaming or restore_command, and patch 0001
turns the unreachable case into a clean FATAL instead of silent
corruption. This also means the 012_remote_wal_race test, which checks
that the copied pg_wal alone covers minRecoveryPoint, encodes the
questionable assumption and would need to be reworked.
3. Longer term it would be nice to converge with backup stop.
pg_rewind already writes a backup_label and the target already recovers
like a backup. What is hand-rolled is the stop/consistency point: rewind
computes an LSN itself and stores it into minRecoveryPoint. Both issues
above live in that hand-rolled part. Establishing the consistency target
the same way pg_backup_stop() does (after the copy), or even driving it
from pg_backup_start()/pg_backup_stop() on the live source, would make
the consistency-after-copy invariant automatic and let StartupXLOG
enforce it uniformly.
Best regards, Andrey Borodin.
Hi,
2. Patch 0002 (remote source WAL race) I think goes the wrong way.
Moving the consistency LSN capture before collecting the file list
breaks the rule that backup stop relies on: the consistency point must
be established AFTER the data pages are read, never before. The source
is a live server and pg_rewind reads pages from disk one by one with
pg_read_binary_file(). A page modified on the source after we captured
flush LSN, but before we copy it, lands on the target with an LSN
greater than minRecoveryPoint. The target can then reach minRecoveryPoint
and be declared consistent while holding pages from the future whose WAL
was never replayed - a torn, cross-page inconsistent state.The good half of the patch is insert LSN -> flush LSN. Note that
pg_rewind reads pages from disk, and by the WAL-before-data rule any
on-disk page has LSN <= flush LSN. So flush LSN is the tight upper bound
on what we can copy: insert LSN overshoots (may point past any copied
page, possibly into WAL still only in shared memory), flush-before-copy
undershoots (torn pages), and flush-after-copy is exactly right: it
covers every copied page and is durable.So I think the fix should keep the capture where it was (end of
perform_rewind, after the copy) and only switch insert LSN to flush LSN.
Missing WAL up to minRecoveryPoint is then handled like a base backup:
the target fetches it via streaming or restore_command, and patch 0001
turns the unreachable case into a clean FATAL instead of silent
corruption. This also means the 012_remote_wal_race test, which checks
that the copied pg_wal alone covers minRecoveryPoint, encodes the
questionable assumption and would need to be reworked.
+1. I reported a related bug and wrote a patch doing "insert LSN -> flush LSN" in:
https://commitfest.postgresql.org/patch/6679/
--
Regards,
ChangAo Chen
Thank you both for the review! After reviewing the issue again, I
agree that my 0002 was wrong.
0001 received a trivial rebase as there were no conflicts, and I
simply removed 0002 in favor of ChangAo Chen's thread.