Timeline switching with partial WAL records can break replica recovery
[FIX] Timeline switching with partial WAL records can break replica recovery
Hi Hackers,
I've encountered an issue where physical replicas may enter an infinite WAL
request loop under the following conditions:
1. A promotion occurs during a multi-block WAL record write.
2. The resulting timeline is very short (begin/end in the same WAL block).
== Current behavior ==
When a WAL record crosses block boundaries and gets interrupted by a
promotion:
1. The partial record remains in the original timeline's WAL and is not
copied
to the next timeline.
2. The new timeline begins at the end of the last valid completed record.
3. Replicas consuming the WAL stream may:
a) Encounter the partial record in timeline N,
b) Recovery fetches a non-full record from walrcv and asks for more data,
c) Fetch the timeline history, including timeline N+1,
d) Request the same record from timeline N+1,
e) Receive data up to the end of timeline N+1 (which, notably, does not
contain
this record),
f) Repeat indefinitely for the incomplete record on timeline N+1.
I hope this illustration helps clarify the issue:
block block, the record is written
block
| | up to that point
|
v v
v
--+-----------------------------+-------------------------------------------+---
: | I I | I
| :
--+-----------------------------+-------------------------------------------+---
^ ^ ^
| └ end of TLI = N + 1, └ the point up to which the
record
| start of TLI = N + 2 must be fully written,
| recovery waits data until this
LSN.
|
└ end of TLI = N,
start of TLI = N + 1,
start of the record.
== Background theory ==
1. When a record crosses a page boundary, it's logically divided into parts
called "contrecords".
2. If an incomplete record is detected during crash recovery:
a) a special XLOG_OVERWRITE_CONTRECORD record is written after the
incomplete
record,
b) the page header where the rest of the record would reside is flagged with
XLP_FIRST_IS_OVERWRITTEN_CONTRECORD,
c) this ensures PostgreSQL gracefully ignores the incomplete record in
subsequent recovery attempts,
3. If and incomplete record is detected during promotion recovery:
a) the new timeline starts at the end of the last valid record (before the
incomplete one),
b) the last block containing the record is copied to the new timeline, but
the
record itself is zeroed out.
c) the new timeline contains neither the XLOG_OVERWRITE_CONTRECORD marker
nor
the XLP_FIRST_IS_OVERWRITTEN_CONTRECORD page header flag.
== Historical context ==
1. The initial contrecord implementation [1]/messages/by-id/202108232252.dh7uxf6oxwcy@alvherre.pgsql proposed:
a) copying the last block containing the last valid record to the new
timeline,
and zeroing the partially written record,
b) writing XLOG_OVERWRITE_CONTRECORD in the new timeline.
2. Later, XLOG_OVERWRITE_CONTRECORD writing was avoided due to concerns
about
zero-gaps in WAL [2]/messages/by-id/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com.
== Proposed Solution ==
I propose preserving WAL's append-only linear nature by graceful handling of
incomplete records during timeline switches:
1. Timeline Finalization:
Before switching timelines write an XLOG_OVERWRITE_CONTRECORD record to
mark
the incomplete record in the current timeline. Only then initialize the
new
timeline and continue recovery. Since no concurrent WAL writes can occur
during this phase, the operation is safe.
2. Other processes may read InsertTLI from shared memory during the switch.
This
can cause processes to read the current timeline instead of the new one
we're
switching to.
This behaviour may occur in the walsummarizer process.
If the walsummarizer fetches InsertTLI of the current timeline (which is
set
while a XLOG_OVERWRITE_CONTRECORD is being written), it can result in
latest_lsn < read_upto (which becomes walrcv->flushedUpto in this case),
triggering an assertion failure. This assertion ensures that the
read-up-to
LSN is correctly updated. We can safely handle this edge case by
replacing
the assertion with an if-statement.
I wrote a TAP test to reproduce the bug, but it doesn’t trigger the issue
consistently. To help investigate, I’ve also added additional logs with some
debug output.
I would appreciate review and feedback.
[1]: /messages/by-id/202108232252.dh7uxf6oxwcy@alvherre.pgsql
[2]: /messages/by-id/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com
/messages/by-id/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com
--
Regards,
Alyona Vinter
Attachments:
v1-0001-Fix-missing-XLOG_OVERWRITE_CONTRECORD-in-timeline_switches.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-missing-XLOG_OVERWRITE_CONTRECORD-in-timeline_switches.patchDownload
From 9a200daba7a44a157865eb9f63078d484ee44b5f Mon Sep 17 00:00:00 2001
From: Alena Vinter <dlaaren8@gmail.com>
Date: Mon, 26 May 2025 13:18:35 +0700
Subject: [PATCH v1 1/2] Handle WAL timeline switches with incomplete records
When switching timelines with incomplete WAL records at the end of the
old timeline, physical replicas could enter an infinite recovery loop by
repeatedly requesting the same WAL data. This occurs because the
incomplete record isn't properly marked with XLOG_OVERWRITE_CONTRECORD
record, causing replicas to retry fetching it.
To fix this, we preserve WAL's append-only nature by writing an
XLOG_OVERWRITE_CONTRECORD to explicitly mark incomplete records before
initializing the new timeline. This ensures replicas can properly detect
transition to the new timeline without getting stuck.
---
src/backend/access/transam/xlog.c | 179 +++++++++++-----------
src/backend/access/transam/xlogrecovery.c | 3 +-
2 files changed, 93 insertions(+), 89 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47ffc0a2307..7a3242c62bc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6044,6 +6044,7 @@ StartupXLOG(void)
EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI;
abortedRecPtr = endOfRecoveryInfo->abortedRecPtr;
missingContrecPtr = endOfRecoveryInfo->missingContrecPtr;
+ newTLI = endOfRecoveryInfo->lastRecTLI;
/*
* Reset ps status display, so as no information related to recovery shows
@@ -6116,6 +6117,97 @@ StartupXLOG(void)
*/
SetInstallXLogFileSegmentActive();
+ /*
+ * Actually, if WAL ended in an incomplete record, skip the parts that
+ * made it through and start writing after the portion that persisted.
+ * (It's critical to first write an OVERWRITE_CONTRECORD message, which
+ * we'll do as soon as we're open for writing new WAL.)
+ */
+ if (!XLogRecPtrIsInvalid(missingContrecPtr) &&
+ endOfRecoveryInfo->endOfLog == abortedRecPtr)
+ {
+ Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
+ EndOfLog = missingContrecPtr;
+ }
+
+ /*
+ * Tricky point here: lastPage contains the *last* block that the LastRec
+ * record spans, not the one it starts in. The last block is indeed the
+ * one we want to use.
+ */
+ if (EndOfLog % XLOG_BLCKSZ != 0)
+ {
+ char *page;
+ int len;
+ int firstIdx;
+
+ firstIdx = XLogRecPtrToBufIdx(EndOfLog);
+ len = EndOfLog - endOfRecoveryInfo->lastPageBeginPtr;
+ Assert(len < XLOG_BLCKSZ);
+
+ /* Copy the valid part of the last block, and zero the rest */
+ page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
+ memcpy(page, endOfRecoveryInfo->lastPage, len);
+ memset(page + len, 0, XLOG_BLCKSZ - len);
+
+ pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
+ pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
+ XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr;
+ }
+ else
+ {
+ /*
+ * There is no partial block to copy. Just set InitializedUpTo, and
+ * let the first attempt to insert a log record to initialize the next
+ * buffer.
+ */
+ pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
+ XLogCtl->InitializedFrom = EndOfLog;
+ }
+ pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));
+
+ /*
+ * Prepare to write WAL starting at EndOfLog location, and init xlog
+ * buffer cache using the block containing the last record from the
+ * previous incarnation.
+ */
+ Insert = &XLogCtl->Insert;
+ Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
+ Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+
+ /*
+ * Update local and shared status. This is OK to do without any locks
+ * because no other process can be reading or writing WAL yet.
+ */
+ LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+ pg_atomic_write_u64(&XLogCtl->logInsertResult, EndOfLog);
+ pg_atomic_write_u64(&XLogCtl->logWriteResult, EndOfLog);
+ pg_atomic_write_u64(&XLogCtl->logFlushResult, EndOfLog);
+ XLogCtl->LogwrtRqst.Write = EndOfLog;
+ XLogCtl->LogwrtRqst.Flush = EndOfLog;
+
+ /* Enable WAL writes for this backend only. */
+ LocalSetXLogInsertAllowed();
+
+ /* If necessary, write overwrite-contrecord before doing anything else */
+ if (!XLogRecPtrIsInvalid(abortedRecPtr) &&
+ !XLogRecPtrIsInvalid(missingContrecPtr) &&
+ EndOfLog == missingContrecPtr)
+ {
+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->InsertTimeLineID = newTLI;
+ XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ EndOfLog = CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
+ /*
+ * Ensure next records are written to the next timeline segment by
+ * closing the current segment.
+ */
+ if (openLogFile >= 0)
+ XLogFileClose();
+ }
+
/*
* Consider whether we need to assign a new timeline ID.
*
@@ -6130,7 +6222,6 @@ StartupXLOG(void)
*
* In a normal crash recovery, we can just extend the timeline we were in.
*/
- newTLI = endOfRecoveryInfo->lastRecTLI;
if (ArchiveRecoveryRequested)
{
newTLI = findNewestTimeLine(recoveryTargetTLI) + 1;
@@ -6177,82 +6268,6 @@ StartupXLOG(void)
XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
SpinLockRelease(&XLogCtl->info_lck);
- /*
- * Actually, if WAL ended in an incomplete record, skip the parts that
- * made it through and start writing after the portion that persisted.
- * (It's critical to first write an OVERWRITE_CONTRECORD message, which
- * we'll do as soon as we're open for writing new WAL.)
- */
- if (!XLogRecPtrIsInvalid(missingContrecPtr))
- {
- /*
- * We should only have a missingContrecPtr if we're not switching to a
- * new timeline. When a timeline switch occurs, WAL is copied from the
- * old timeline to the new only up to the end of the last complete
- * record, so there can't be an incomplete WAL record that we need to
- * disregard.
- */
- Assert(newTLI == endOfRecoveryInfo->lastRecTLI);
- Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
- EndOfLog = missingContrecPtr;
- }
-
- /*
- * Prepare to write WAL starting at EndOfLog location, and init xlog
- * buffer cache using the block containing the last record from the
- * previous incarnation.
- */
- Insert = &XLogCtl->Insert;
- Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
- Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
-
- /*
- * Tricky point here: lastPage contains the *last* block that the LastRec
- * record spans, not the one it starts in. The last block is indeed the
- * one we want to use.
- */
- if (EndOfLog % XLOG_BLCKSZ != 0)
- {
- char *page;
- int len;
- int firstIdx;
-
- firstIdx = XLogRecPtrToBufIdx(EndOfLog);
- len = EndOfLog - endOfRecoveryInfo->lastPageBeginPtr;
- Assert(len < XLOG_BLCKSZ);
-
- /* Copy the valid part of the last block, and zero the rest */
- page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
- memcpy(page, endOfRecoveryInfo->lastPage, len);
- memset(page + len, 0, XLOG_BLCKSZ - len);
-
- pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
- pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
- XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr;
- }
- else
- {
- /*
- * There is no partial block to copy. Just set InitializedUpTo, and
- * let the first attempt to insert a log record to initialize the next
- * buffer.
- */
- pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
- XLogCtl->InitializedFrom = EndOfLog;
- }
- pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));
-
- /*
- * Update local and shared status. This is OK to do without any locks
- * because no other process can be reading or writing WAL yet.
- */
- LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
- pg_atomic_write_u64(&XLogCtl->logInsertResult, EndOfLog);
- pg_atomic_write_u64(&XLogCtl->logWriteResult, EndOfLog);
- pg_atomic_write_u64(&XLogCtl->logFlushResult, EndOfLog);
- XLogCtl->LogwrtRqst.Write = EndOfLog;
- XLogCtl->LogwrtRqst.Flush = EndOfLog;
-
/*
* Preallocate additional log files, if wanted.
*/
@@ -6296,16 +6311,6 @@ StartupXLOG(void)
/* Shut down xlogreader */
ShutdownWalRecovery();
- /* Enable WAL writes for this backend only. */
- LocalSetXLogInsertAllowed();
-
- /* If necessary, write overwrite-contrecord before doing anything else */
- if (!XLogRecPtrIsInvalid(abortedRecPtr))
- {
- Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
- CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
- }
-
/*
* Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
* record before resource manager writes cleanup WAL records or checkpoint
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8b..9b024697026 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3174,8 +3174,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
* complete record, so if we did this, we would later create an
* overwrite contrecord in the wrong place, breaking everything.
*/
- if (!ArchiveRecoveryRequested &&
- !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
+ if (!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
missingContrecPtr = xlogreader->missingContrecPtr;
--
2.49.0
v1-0002-Replace-assert-with-if-check-in-walsummarizer.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Replace-assert-with-if-check-in-walsummarizer.patchDownload
From 594473e38150fcc51b96d9c611b60c16f366d3a2 Mon Sep 17 00:00:00 2001
From: Alena Vinter <dlaaren8@gmail.com>
Date: Sun, 15 Jun 2025 01:29:23 +0700
Subject: [PATCH v1 2/2] Removed assertion in walsummarizer
Fixes an edge case in the walsummarizer process where fetching InsertTLI
during a timeline switch could lead to latest_lsn < read_upto (which
becomes walrcv->flushedUpto in this case), previously triggering an
assertion failure. We now handle this safely by replacing the assertion
with conditional logic.
---
src/backend/postmaster/walsummarizer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 0fec4f1f871..4939df4c415 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1551,8 +1551,8 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
if (private_data->tli == latest_tli)
{
/* Still the current timeline, update max LSN. */
- Assert(latest_lsn >= private_data->read_upto);
- private_data->read_upto = latest_lsn;
+ if (latest_lsn >= private_data->read_upto)
+ private_data->read_upto = latest_lsn;
}
else
{
--
2.49.0
I've done more research and identified that replicas enter
an indefinite loop in the 'XLogReadPage' function.
The loop works as follows:
0. timeline N contains a partially written record with LSN = targetRecPtr;
1. In 'XLogReadPage' we attempt to read the next page, which has to
contain the rest of the unfinished record;
2. In 'WaitForWALToBecomeAvailable' walrcv is requested to fetch
records starting from LSN = targetRecPtr on timeline N + 1;
3. Walrcv retrieves data up to the end of page containing the end of
timeline N + 1;
4. Then, in 'WaitForWALToBecomeAvailable', replica switches to
XLOG_FROM_ARCHIVE state, and the function returns true;
5. Execution continues in 'XLogReadPage';
6. The page at addr = targetPagePtr is checked for validity, but we
get an 'invalid magic number' error because walrcv hasn't retrieved
this page;
7. Execution jumps to 'next_record_is_invalid' label;
8. Since we are in StandBy mode, the process retries from the beginning.
See the attachments for more colorful illustration this time =)
From my point of view, the first solution which I described in my
previous message still seems like a good choice.
I've also found the current solution in commit [1]https://github.com/postgres/postgres/commit/6cf1647d87e7cd423d71525a8759b75c4e4a47ec. With all due
respect, but it seems to treat the symptom rather than the underlying
issue.
[1]: https://github.com/postgres/postgres/commit/6cf1647d87e7cd423d71525a8759b75c4e4a47ec
https://github.com/postgres/postgres/commit/6cf1647d87e7cd423d71525a8759b75c4e4a47ec
Attachments:
how_replicas_enter_indefinite_loop_1.jpgimage/jpeg; name=how_replicas_enter_indefinite_loop_1.jpgDownload
�����Exif MM * � ? � � 2 � � � � �1 � �i ( samsung 2025:06:17 17:07:13 H H SM-A556E A556EXXS7AYC8 '� �� �� ? �
��"