Startup PANIC on standby promotion due to zero-filled WAL segment

Started by Alena Vinter20 days ago8 messages
#1Alena Vinter
dlaaren8@gmail.com
3 attachment(s)

Hi hackers,

During replication, when a new timeline is detected, PostgreSQL creates a
new zero-filled WAL segment on the new timeline instead of copying the
partial segment from the previous timeline. This diverges from the behavior
during timeline switches at startup.
This discrepancy can cause problems — especially under slow replication.
Consider the following scenario:

last record in TLI | | timeline switch point
v v
|-----TLI N---------------|0000000000000000000
|
|-----TLI N+1--00000000000|0000000000000000000

If a standby is promoted before the WAL segment containing the last record
of the previous timeline has been fully copied to the new timeline, startup
may fail. We have observed this in production, where recovery aborts with
"PANIC: invalid magic number 0000 in WAL segment ..."

I’ve attached:
* a patch and a TAP test that reproduce the issue;
* a draft patch that, on timeline switch during recovery, copies the
remainder of the current WAL segment from the old timeline — matching the
behavior used after crash recovery at startup.
All existing regression tests pass with the patch applied, but I plan to
add more targeted test cases.

I’d appreciate your feedback. In particular:
* Is this behavior (not copying the segment during replication) intentional?
* Are there edge cases I might be overlooking?

---
Best wishes,
Alena Vinter

Attachments:

recovery_tli_switch_bug_reproduction.difftext/x-patch; charset=US-ASCII; name=recovery_tli_switch_bug_reproduction.diffDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..8bc48b77430 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -463,6 +463,10 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 							WalRcvComputeNextWakeup(WALRCV_WAKEUP_PING, now);
 							XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1,
 												 startpointTLI);
+							for (;startpointTLI == 2;)
+							{
+								ProcessInterrupts();
+							}
 						}
 						else if (len == 0)
 							break;
recovery_tli_switch_test.plapplication/x-perl; name=recovery_tli_switch_test.plDownload
v1-recovery_tli_switch_bug_fix.difftext/x-patch; charset=US-ASCII; name=v1-recovery_tli_switch_bug_fix.diffDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 430a38b1a21..2cc37ef0c5b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5262,7 +5262,7 @@ str_time(pg_time_t tnow, char *buf, size_t bufsize)
 /*
  * Initialize the first WAL segment on new timeline.
  */
-static void
+void
 XLogInitNewTimeline(TimeLineID endTLI, XLogRecPtr endOfLog, TimeLineID newTLI)
 {
 	char		xlogfname[MAXFNAMELEN];
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 38b594d2170..33da9d68e5c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -4235,6 +4235,10 @@ rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN)
 	list_free_deep(expectedTLEs);
 	expectedTLEs = newExpectedTLEs;
 
+	SetInstallXLogFileSegmentActive();
+	XLogInitNewTimeline(oldtarget, replayLSN, newtarget);
+	ResetInstallXLogFileSegmentActive();
+
 	/*
 	 * As in StartupXLOG(), try to ensure we have all the history files
 	 * between the old target and new target in pg_wal.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 605280ed8fb..87cd59d74b9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -210,6 +210,9 @@ extern bool XLogNeedsFlush(XLogRecPtr record);
 extern int	XLogFileInit(XLogSegNo logsegno, TimeLineID logtli);
 extern int	XLogFileOpen(XLogSegNo segno, TimeLineID tli);
 
+extern void XLogInitNewTimeline(TimeLineID endTLI, XLogRecPtr endOfLog,
+								TimeLineID newTLI);
+
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
 extern XLogSegNo XLogGetOldestSegno(TimeLineID tli);
#2Michael Paquier
michael@paquier.xyz
In reply to: Alena Vinter (#1)
Re: Startup PANIC on standby promotion due to zero-filled WAL segment

On Tue, Dec 23, 2025 at 02:02:15PM +0700, Alena Vinter wrote:

If a standby is promoted before the WAL segment containing the last record
of the previous timeline has been fully copied to the new timeline, startup
may fail. We have observed this in production, where recovery aborts with
"PANIC: invalid magic number 0000 in WAL segment ..."

I’ve attached:
* a patch and a TAP test that reproduce the issue;
* a draft patch that, on timeline switch during recovery, copies the
remainder of the current WAL segment from the old timeline — matching the
behavior used after crash recovery at startup.
All existing regression tests pass with the patch applied, but I plan to
add more targeted test cases.

I’d appreciate your feedback. In particular:
* Is this behavior (not copying the segment during replication) intentional?
* Are there edge cases I might be overlooking?

The failure pattern is different in v18/master vs the rest of the
world. v17 and older branches just wait for the standby node to start
at the end with your test. Anyway, the problem is the same as far as
I can see, with the test generating the following post-patch:
2025-12-23 17:08:37.494 JST startup[32689] LOG: unexpected pageaddr
0/0305E000 in WAL segment 000000020000000000000003, LSN 0/03060000,
offset 393216
2025-12-23 17:08:37.494 JST startup[32689] FATAL: according to
history file, WAL location 0/0305FFD0 belongs to timeline 1, but
previous recovered WAL file came from timeline 2

This would be right, because you are losing the records of the first
INSERT and TLI 1 diverges on the primary. Now, the reason why you are
losing these records is because of the way the test is set up. fsync
is off on the primary, hence you are forcing what looks like a
corruption scenario by forcing a node to be promoted with some of its
WAL records missing. I am unconvinced with the problem the way you
are showing it. This primarily shows that setting fsync=off is a bad
idea to force a divergence in timelines, with the segment missing
while the records should be there.

Perhaps it is a matter of proving your point in a cleaner way? I am
open to your potential arguments, but I don't see something here based
on the test you are sending; I am just seeing something that should
not be done.

I am not asking how you are able to see these failures in your
Postgres setups, but perhaps there is something in your HA flow that
you should not do, especially if you do the same things as in this
test.. Just saying.
--
Michael

#3Alena Vinter
dlaaren8@gmail.com
In reply to: Michael Paquier (#2)
Re: Startup PANIC on standby promotion due to zero-filled WAL segment

Hi Michael,

Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
replicated to the standby before the timeline switch. The test then
intentionally slows down replication on TLI 2 (e.g., by delaying WAL
shipping), reproducing the scenario I illustrated. As far as I’m aware,
`fsync` is `on` by default, and the test does not modify it — so no WAL
records are lost due to unsafe flushing.

The core issue is that the new timeline’s segment is zero-initialized
instead of copying the same segment from the previous timeline (as done in
crash-recovery startup). As a result, startup cannot finish recovery due
to non-replicated end of WAL causing failures like “invalid magic number”.

---
Alena Vinter

#4Michael Paquier
michael@paquier.xyz
In reply to: Alena Vinter (#3)
Re: Startup PANIC on standby promotion due to zero-filled WAL segment

On Tue, Dec 23, 2025 at 04:33:30PM +0700, Alena Vinter wrote:

Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
replicated to the standby before the timeline switch. The test then
intentionally slows down replication on TLI 2 (e.g., by delaying WAL
shipping), reproducing the scenario I illustrated. As far as I’m aware,
`fsync` is `on` by default, and the test does not modify it — so no WAL
records are lost due to unsafe flushing.

Don't think so, based on what is in the tree:
$ git grep "fsync = " -- *.pm
src/test/perl/PostgreSQL/Test/Cluster.pm: print $conf "fsync = off\n";

The core issue is that the new timeline’s segment is zero-initialized
instead of copying the same segment from the previous timeline (as done in
crash-recovery startup). As a result, startup cannot finish recovery due
to non-replicated end of WAL causing failures like “invalid magic number”.

The following addition to your proposed test is telling me an entirely
 different story, making the test pass as the records of TLI 1 are
 around:
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
+#$node_primary->append_conf('postgresql.conf', 'fsync=on');
 $node_primary->start;
--
Michael
#5Alena Vinter
dlaaren8@gmail.com
In reply to: Michael Paquier (#4)
Re: Startup PANIC on standby promotion due to zero-filled WAL segment

Don't think so, based on what is in the tree:
$ git grep "fsync = " -- *.pm
src/test/perl/PostgreSQL/Test/Cluster.pm: print $conf "fsync = off\n";

Oh, didn’t know it's in the `init` function — I apologize.

The following addition to your proposed test is telling me an entirely
different story, making the test pass as the records of TLI 1 are
around:
my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(allows_streaming => 1);
+#$node_primary->append_conf('postgresql.conf', 'fsync=on');
$node_primary->start;

I've tried this way, and yes, this works fine. Now I'm really interested in
how this parameter prevents the scenario with startup panic. Thank you very
much!!!
But I’m still unclear why the segment isn’t copied during replication, as
it is in crash recovery (I’d prefer uniform behavior across both
paths). Could you help me figure out the answer to that question?
---
Alena Vinter

#6Alena Vinter
dlaaren8@gmail.com
In reply to: Alena Vinter (#5)
2 attachment(s)
Re: Startup PANIC on standby promotion due to zero-filled WAL segment

Michael, I left my pipeline running the TAP test until it failed — and
after some time, it did fail. I then changed the test slightly, and simply
by adding a short sleep, I was able to reproduce the same failure more
reliably. Moreover, attempting to restart the standby server after a failed
promotion triggers startup PANIC again.

Attachments:

v2-recovery_tli_switch_test.plapplication/x-perl; name=v2-recovery_tli_switch_test.plDownload
recovery_tli_switch_test_replica.logtext/x-log; charset=US-ASCII; name=recovery_tli_switch_test_replica.logDownload
#7Michael Paquier
michael@paquier.xyz
In reply to: Alena Vinter (#6)
Re: Startup PANIC on standby promotion due to zero-filled WAL segment

On Tue, Dec 23, 2025 at 08:49:20PM +0700, Alena Vinter wrote:

Michael, I left my pipeline running the TAP test until it failed — and
after some time, it did fail. I then changed the test slightly, and simply
by adding a short sleep, I was able to reproduce the same failure more
reliably. Moreover, attempting to restart the standby server after a failed
promotion triggers startup PANIC again.

This is a better argument, yes. ProcessInterrupts() is just a way to
force the WAL receiver to do nothing. We could see the same if a WAL
receiver fails a palloc() or an allocation repeatedly, shutting it
down before it is able to stream any changes, and we could also have a
test with an injection point that forces an error based on a specific
specific timeline number, or something like that.

Hmm. Like in the case where the WAL receiver is not able to connect
to a primary, shouldn't we prevent the promotion request to process at
all? So while you have your finger on something here, I don't think
that your suggested solution is a good nor correct one: it sounds to
me that the startup process assumes that the WAL receiver is doing
some work, then the promotion request comes it and we assume that it
is OK to process through the promotion while we should obviously not
do so, because the WAL receiver has streamed zero contents from TLI 2.
It sounds to me that we should let the startup process know that
something is wrong with the WAL receiver, meaning that it may be up to
the WAL receiver to save some information in shared memory so as the
startup process should not allow the promotion to go through at all.
--
Michael

#8Alena Vinter
dlaaren8@gmail.com
In reply to: Michael Paquier (#7)
2 attachment(s)
Re: Startup PANIC on standby promotion due to zero-filled WAL segment

I like the idea of preventing promotion to avoid such failures -- it sounds
reasonable.

However, we still have the problem: if the standby is stopped with
non-replicated TLI 2, it will fail to start:
"FATAL: according to history file, WAL location Y belongs to timeline X,
but previous recovered WAL file came from timeline X+1".
This happens even if no promotion is attempted — just a plain restart of
the standby. So the issue isn’t only about when to allow promotion.

Regarding my proposed solution: could you clarify why it isn’t correct? I’d
appreciate more detail so I can address your concerns.

---
Alena Vinter

Attachments:

recovery_tli_switch_test_without_standby_promotion_replica.logtext/x-log; charset=US-ASCII; name=recovery_tli_switch_test_without_standby_promotion_replica.logDownload
recovery_tli_switch_test_without_standby_promotion.plapplication/x-perl; name=recovery_tli_switch_test_without_standby_promotion.plDownload