Startup PANIC on standby promotion due to zero-filled WAL segment

Started by Alena Vinter3 months ago10 messages

dlaaren8@gmail.com

3 months ago

Hi hackers,

During replication, when a new timeline is detected, PostgreSQL creates a
new zero-filled WAL segment on the new timeline instead of copying the
partial segment from the previous timeline. This diverges from the behavior
during timeline switches at startup.
This discrepancy can cause problems — especially under slow replication.
Consider the following scenario:

last record in TLI | | timeline switch point
v v
|-----TLI N---------------|0000000000000000000
|
|-----TLI N+1--00000000000|0000000000000000000

If a standby is promoted before the WAL segment containing the last record
of the previous timeline has been fully copied to the new timeline, startup
may fail. We have observed this in production, where recovery aborts with
"PANIC: invalid magic number 0000 in WAL segment ..."

I’ve attached:
* a patch and a TAP test that reproduce the issue;
* a draft patch that, on timeline switch during recovery, copies the
remainder of the current WAL segment from the old timeline — matching the
behavior used after crash recovery at startup.
All existing regression tests pass with the patch applied, but I plan to
add more targeted test cases.

I’d appreciate your feedback. In particular:
* Is this behavior (not copying the segment during replication) intentional?
* Are there edge cases I might be overlooking?

---
Best wishes,
Alena Vinter

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Alena Vinter (#1)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

On Tue, Dec 23, 2025 at 02:02:15PM +0700, Alena Vinter wrote:

If a standby is promoted before the WAL segment containing the last record
of the previous timeline has been fully copied to the new timeline, startup
may fail. We have observed this in production, where recovery aborts with
"PANIC: invalid magic number 0000 in WAL segment ..."

I’ve attached:
* a patch and a TAP test that reproduce the issue;
* a draft patch that, on timeline switch during recovery, copies the
remainder of the current WAL segment from the old timeline — matching the
behavior used after crash recovery at startup.
All existing regression tests pass with the patch applied, but I plan to
add more targeted test cases.

I’d appreciate your feedback. In particular:
* Is this behavior (not copying the segment during replication) intentional?
* Are there edge cases I might be overlooking?

The failure pattern is different in v18/master vs the rest of the
world. v17 and older branches just wait for the standby node to start
at the end with your test. Anyway, the problem is the same as far as
I can see, with the test generating the following post-patch:
2025-12-23 17:08:37.494 JST startup[32689] LOG: unexpected pageaddr
0/0305E000 in WAL segment 000000020000000000000003, LSN 0/03060000,
offset 393216
2025-12-23 17:08:37.494 JST startup[32689] FATAL: according to
history file, WAL location 0/0305FFD0 belongs to timeline 1, but
previous recovered WAL file came from timeline 2

This would be right, because you are losing the records of the first
INSERT and TLI 1 diverges on the primary. Now, the reason why you are
losing these records is because of the way the test is set up. fsync
is off on the primary, hence you are forcing what looks like a
corruption scenario by forcing a node to be promoted with some of its
WAL records missing. I am unconvinced with the problem the way you
are showing it. This primarily shows that setting fsync=off is a bad
idea to force a divergence in timelines, with the segment missing
while the records should be there.

Perhaps it is a matter of proving your point in a cleaner way? I am
open to your potential arguments, but I don't see something here based
on the test you are sending; I am just seeing something that should
not be done.

I am not asking how you are able to see these failures in your
Postgres setups, but perhaps there is something in your HA flow that
you should not do, especially if you do the same things as in this
test.. Just saying.
--
Michael

Alena Vinter

dlaaren8@gmail.com

3 months ago

In reply to: Michael Paquier (#2)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

Hi Michael,

Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
replicated to the standby before the timeline switch. The test then
intentionally slows down replication on TLI 2 (e.g., by delaying WAL
shipping), reproducing the scenario I illustrated. As far as I’m aware,
`fsync` is `on` by default, and the test does not modify it — so no WAL
records are lost due to unsafe flushing.

The core issue is that the new timeline’s segment is zero-initialized
instead of copying the same segment from the previous timeline (as done in
crash-recovery startup). As a result, startup cannot finish recovery due
to non-replicated end of WAL causing failures like “invalid magic number”.

---
Alena Vinter

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Alena Vinter (#3)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

On Tue, Dec 23, 2025 at 04:33:30PM +0700, Alena Vinter wrote:

Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
replicated to the standby before the timeline switch. The test then
intentionally slows down replication on TLI 2 (e.g., by delaying WAL
shipping), reproducing the scenario I illustrated. As far as I’m aware,
`fsync` is `on` by default, and the test does not modify it — so no WAL
records are lost due to unsafe flushing.

Don't think so, based on what is in the tree:
$ git grep "fsync = " -- *.pm
src/test/perl/PostgreSQL/Test/Cluster.pm: print $conf "fsync = off\n";

The core issue is that the new timeline’s segment is zero-initialized
instead of copying the same segment from the previous timeline (as done in
crash-recovery startup). As a result, startup cannot finish recovery due
to non-replicated end of WAL causing failures like “invalid magic number”.

The following addition to your proposed test is telling me an entirely
 different story, making the test pass as the records of TLI 1 are
 around:
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
+#$node_primary->append_conf('postgresql.conf', 'fsync=on');
 $node_primary->start;
--
Michael

Alena Vinter

dlaaren8@gmail.com

3 months ago

In reply to: Michael Paquier (#4)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

Don't think so, based on what is in the tree:
$ git grep "fsync = " -- *.pm
src/test/perl/PostgreSQL/Test/Cluster.pm: print $conf "fsync = off\n";

Oh, didn’t know it's in the `init` function — I apologize.

The following addition to your proposed test is telling me an entirely
different story, making the test pass as the records of TLI 1 are
around:
my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(allows_streaming => 1);
+#$node_primary->append_conf('postgresql.conf', 'fsync=on');
$node_primary->start;

I've tried this way, and yes, this works fine. Now I'm really interested in
how this parameter prevents the scenario with startup panic. Thank you very
much!!!
But I’m still unclear why the segment isn’t copied during replication, as
it is in crash recovery (I’d prefer uniform behavior across both
paths). Could you help me figure out the answer to that question?
---
Alena Vinter

Alena Vinter

dlaaren8@gmail.com

3 months ago

In reply to: Alena Vinter (#5)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

Michael, I left my pipeline running the TAP test until it failed — and
after some time, it did fail. I then changed the test slightly, and simply
by adding a short sleep, I was able to reproduce the same failure more
reliably. Moreover, attempting to restart the standby server after a failed
promotion triggers startup PANIC again.

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Alena Vinter (#6)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

On Tue, Dec 23, 2025 at 08:49:20PM +0700, Alena Vinter wrote:

Michael, I left my pipeline running the TAP test until it failed — and
after some time, it did fail. I then changed the test slightly, and simply
by adding a short sleep, I was able to reproduce the same failure more
reliably. Moreover, attempting to restart the standby server after a failed
promotion triggers startup PANIC again.

This is a better argument, yes. ProcessInterrupts() is just a way to
force the WAL receiver to do nothing. We could see the same if a WAL
receiver fails a palloc() or an allocation repeatedly, shutting it
down before it is able to stream any changes, and we could also have a
test with an injection point that forces an error based on a specific
specific timeline number, or something like that.

Hmm. Like in the case where the WAL receiver is not able to connect
to a primary, shouldn't we prevent the promotion request to process at
all? So while you have your finger on something here, I don't think
that your suggested solution is a good nor correct one: it sounds to
me that the startup process assumes that the WAL receiver is doing
some work, then the promotion request comes it and we assume that it
is OK to process through the promotion while we should obviously not
do so, because the WAL receiver has streamed zero contents from TLI 2.
It sounds to me that we should let the startup process know that
something is wrong with the WAL receiver, meaning that it may be up to
the WAL receiver to save some information in shared memory so as the
startup process should not allow the promotion to go through at all.
--
Michael

Alena Vinter

dlaaren8@gmail.com

3 months ago

In reply to: Michael Paquier (#7)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

I like the idea of preventing promotion to avoid such failures -- it sounds
reasonable.

However, we still have the problem: if the standby is stopped with
non-replicated TLI 2, it will fail to start:
"FATAL: according to history file, WAL location Y belongs to timeline X,
but previous recovered WAL file came from timeline X+1".
This happens even if no promotion is attempted — just a plain restart of
the standby. So the issue isn’t only about when to allow promotion.

Regarding my proposed solution: could you clarify why it isn’t correct? I’d
appreciate more detail so I can address your concerns.

---
Alena Vinter

Alena Vinter

dlaaren8@gmail.com

about 2 months ago

In reply to: Alena Vinter (#8)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

I revisited the issue and examined the problem and the proposed solution in
more depth. It’s now clear to me why the approach won’t work: if multiple
timelines are discovered at once (for example, if a replica is stopped, a
standby is promoted several times, and only then the replica resumes
replication), the current design may copy part of an earlier timeline from
the wrong predecessor into the current segment.

I’ll get back with a more correct solution soon. In the meantime,
suggestions are welcome!
---
Alena Vinter

#10

Michael Paquier

michael@paquier.xyz

about 2 months ago

In reply to: Alena Vinter (#9)

Re: Startup PANIC on standby promotion due to zero-filled WAL segment

On Wed, Jan 14, 2026 at 06:38:27PM +0700, Alena Vinter wrote:

I revisited the issue and examined the problem and the proposed solution in
more depth. It’s now clear to me why the approach won’t work: if multiple
timelines are discovered at once (for example, if a replica is stopped, a
standby is promoted several times, and only then the replica resumes
replication), the current design may copy part of an earlier timeline from
the wrong predecessor into the current segment.

This case enforces standby to be used after an equivalent of crash
recovery. I doubt that somebody in their right mind would do that.

Assuming that they do, at quick glance, I don't quite see why the
solution of preventing the promotion request "$node_primary->promote"
would not be a bad one: we just do not want the promotion to be
acklowledged in the startup process until the first record of the
new timeline has been written down. FWIW, I have spent a couple of
minutes to look at what one solution could look like, finishing
with the ugly hack attached, for reference. It's definitely not
thought through, TBH, just to show what one idea could look like if we
have not received and written the first record of the new timeline
yet..
--
Michael

Startup PANIC on standby promotion due to zero-filled WAL segment

Attachments:

Attachments:

Attachments:

Attachments: