Fix XLogFileReadAnyTLI silently applying divergent WAL from wrong timeline

Started by Andrey Borodin4 months ago4 messageshackers

amborodin@acm.org

4 months ago

Hi hackers!

We hit an interesting bug(?) in archive recovery during a planned HA
switchover. One of standbys followed wrong timeline.
I think I saw this problem before in pgsql-hackers but googling didn't help.

Here's employed switchover algorithm:
0. Old Primary's synchronoud_standbby_names turned to only New Primary.
1. Standbys (except New Primary) conninfor turned to New Primary, so
they are cascade now.
2. Old Primary is told to shut down asynchronously
3. After 5 seconds New Primary is promoted
4. When Old Primary shutted down, it is rewinded if shutdown took more
than 5s.

Here are logs from standby that stayed on timeline of Old Primary, sorry for verbosity:

2026-02-18 11:50:01.173 MSK,,,2534,,69957d39.9e6,1,,2026-02-18 11:50:01 MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A3000000 on timeline 9",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:01.017 MSK,,,2318,,69957d35.90e,7,,2026-02-18 11:49:57 MSK,,0,LOG,00000,"received SIGHUP, reloading configuration files",,,,,,,,,"","postmaster",,0
2026-02-18 11:52:01.018 MSK,,,2318,,69957d35.90e,8,,2026-02-18 11:49:57 MSK,,0,LOG,00000,"parameter ""primary_conninfo"" changed to ""host=rc1a-skip.mdb.yandexcloud.net port=5432 user=repl keepalives_idle=15 keepalives_interval=3 keepalives_count=5 application_name=rc1d_skip_mdb_yandexcloud_n>
2026-02-18 11:52:01.020 MSK,,,2320,,69957d36.910,26,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"WAL receiver process shutdown requested",,,,,,,,,"","startup",,0
2026-02-18 11:52:01.020 MSK,,,2534,,69957d39.9e6,2,,2026-02-18 11:50:01 MSK,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:01.186 MSK,,,4227,,69957db1.1083,1,,2026-02-18 11:52:01 MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A7000000 on timeline 9",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:12.193 MSK,,,4227,,69957db1.1083,2,,2026-02-18 11:52:01 MSK,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 9 at 2EC2/A71F0820.",,,,,,,,"","walreceiver",,0
2026-02-18 11:52:12.198 MSK,,,4227,,69957db1.1083,3,,2026-02-18 11:52:01 MSK,,0,LOG,00000,"fetching timeline history file for timeline 10 from primary server",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:12.207 MSK,,,4227,,69957db1.1083,4,,2026-02-18 11:52:01 MSK,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:12.585 MSK,,,2320,,69957d36.910,27,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"restored log file ""0000000A.history"" from archive",,,,,,,,,"","startup",,0
2026-02-18 11:52:12.588 MSK,,,2320,,69957d36.910,28,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"new target timeline is 10",,,,,,,,,"","startup",,0
2026-02-18 11:52:12.861 MSK,,,2320,,69957d36.910,29,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"restored log file ""0000000900002EC2000000A7"" from archive",,,,,,,,,"","startup",,0
2026-02-18 11:52:13.122 MSK,,,2320,,69957d36.910,30,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"unexpected pageaddr 2EC2/8000000 in log segment 0000000900002EC2000000A8, offset 0",,,,,,,,,"","startup",,0
2026-02-18 11:52:13.178 MSK,,,4382,,69957dbd.111e,1,,2026-02-18 11:52:13 MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A8000000 on timeline 10",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:13.178 MSK,,,4382,,69957dbd.111e,2,,2026-02-18 11:52:13 MSK,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested starting point 2EC2/A8000000 is ahead of the WAL flush position of this server 2EC2/A71F6AE0",,,,,,,,,"","walreceiver",,0

One interesting part is "fetching timeline history file for timeline
10 from primary" was killed, probably by startup process reaching
timeline end. (that's a separate issue worth looking at, but survivable
- the archive is there as a fallback)

We restored history file successfully from archive and decided to follow
timeline 10. But 0000000_A_00002EC2000000A7 was not archived yet, while
0000000_9_00002EC2000000A7 already was there. So we took
0000000900002EC2000000A7 and applied. Now we are beyond timeline
switchpoint!

The archive contained the new timeline's history file (so recovery
correctly identified TL10 as the target) but the WAL segment at the
switch-point position had not yet been archived for TL10. The same
segment number existed on TL9 (the old primary continued writing after
promotion). Recovery applied it:

restored log file "0000000900002EC2000000A7" from archive

That segment carries post-divergence data from the old primary.
Applying it is wrong and the damage is silent, we cannot get to timeline
10 now.

Root cause: XLogFileReadAnyTLI iterates expectedTLEs newest-first and
falls back to older timelines when a segment is not found. The
existing beginseg check is meant to prevent this, but it only guards
the final timeline's switch point (targetBeginSeg = target TL's
begin_seg). In a three-timeline chain (TL1 -> TL2 -> TL3), an absent
TL2 segment at TL2's own switch point is not guarded because
segno < TL3.begin_seg, so TL1's divergent segment is used.

Fix: replace the targetBeginSeg pre-check with a found_eligible flag.
The invariant is simple: for any segment, exactly one timeline owns it
-- the newest one whose begin_seg <= segno. If that segment is absent,
recovery must stop, not fall back. Once the loop identifies the first
eligible timeline and fails to open its segment, it breaks rather than
continuing to older timelines.

Two TAP tests cover this:
- t/052: two-timeline case (the original scenario)
- t/053: three-timeline case that the old targetBeginSeg guard missed

I think same fix is correct even if we have multiple swithpoints in one
segment.

I recognise this changes existing behaviour. Someone who intentionally
archives only a parent timeline's segments and relies on fallback will
now see recovery stall instead of silently proceeding. I think that
outcome is strictly better: the current behaviour opens a window for
applying divergent data with no error, which is worse than a clear
stall waiting for the right segment.

Two related issues I'll address in separate threads later:
- recovery_target_timeline='latest' picks by timeline ID. In a
partition scenario the numerically highest TL may not have the most
data. Choosing by end LSN would be safer. Maybe even at some point
we will have to stop using incremented numbers for timelines...
- The startup-kills-walreceiver interaction that created the fallback
to archive in the first place.

Patch attached. WDYT?

Best regards, Andrey Borodin.

Ants Aasma

ants.aasma@cybertec.at

4 months ago

In reply to: Andrey Borodin (#1)

Re: Fix XLogFileReadAnyTLI silently applying divergent WAL from wrong timeline

On Fri, 20 Feb 2026 at 15:09, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

We hit an interesting bug(?) in archive recovery during a planned HA
switchover. One of standbys followed wrong timeline.
I think I saw this problem before in pgsql-hackers but googling didn't help.

I saw something very similar some time ago. [1]/messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com

[1]: /messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com

Regards,
Ants Aasma

Andrey Borodin

amborodin@acm.org

4 months ago

In reply to: Ants Aasma (#2)

Re: Fix XLogFileReadAnyTLI silently applying divergent WAL from wrong timeline

On 20 Feb 2026, at 22:53, Ants Aasma <ants.aasma@cybertec.at> wrote:

I saw something very similar some time ago. [1]

Thanks for the link. Yes, this thread is about the same bug and your
patch is functionally equivalent to what I just proposed.

Basically, it all boils down to simple invariant: "When restoring to
specific timeline we should not take turns to other timelines."

Best regards, Andrey Borodin.

Andrey Borodin

amborodin@acm.org

4 months ago

In reply to: Andrey Borodin (#3)

Re: Fix XLogFileReadAnyTLI silently applying divergent WAL from wrong timeline

On 20 Feb 2026, at 23:36, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Basically, it all boils down to simple invariant: "When restoring to
specific timeline we should not take turns to other timelines."

Here is patch 0002 -- an optimization, independent of 0001.

After walrcv_endstreaming() returns, walreceiver fetches the new
timeline's history file (WalRcvFetchTimeLineHistoryFiles) before
transitioning to WALRCV_WAITING. During this window walreceiver
remains in WALRCV_STREAMING.

Startup sleeps in WAIT_EVENT_RECOVERY_WAL_STREAM receiving no new
data. When it wakes and finds WalRcvStreaming()==true in the
XLOG_FROM_STREAM handler, it kills walreceiver. The new walreceiver
must reconnect and re-request the same switch -- wasteful but
harmless. (In the original report this appears as "terminating
walreceiver process due to administrator command" at 11:52:12.)

Fix: add WALRCV_SWITCHING_TIMELINE. Walreceiver enters it just
before WalRcvFetchTimeLineHistoryFiles(). WalRcvStreaming() returns
false for this state, so startup backs off instead of killing
walreceiver. WakeupRecovery() is called immediately after the
transition so startup exits its indefinite RECOVERY_WAL_STREAM sleep
without waiting for WalRcvWaitForStartPosition().

A guard in RequestXLogStreaming() is also required: because
WALRCV_SWITCHING_TIMELINE is not "streaming", the XLOG_FROM_STREAM
failure path no longer calls XLogShutdownWalRcv() before retrying
archive. When startup cycles back to RequestXLogStreaming(),
walreceiver may still be in WALRCV_SWITCHING_TIMELINE, which would
Assert-fail the STOPPED||WAITING check. The guard returns early in
that case.

One concern: WALRCV_SWITCHING_TIMELINE is not protected by
wal_receiver_timeout, which only runs inside the streaming loop.
Before this patch startup's kill provided an implicit bound on the
history fetch; now only TCP-level timeouts apply. I think this
warrants a follow-up, but it is out of scope here.

Test 054 uses an injection point to freeze walreceiver in
WALRCV_SWITCHING_TIMELINE and verifies startup enters
RecoveryRetrieveRetryInterval rather than killing walreceiver.

WDYT?

Best regards, Andrey Borodin.

Fix XLogFileReadAnyTLI silently applying divergent WAL from wrong timeline

Attachments:

Attachments: