BUG: Cascading standby fails to reconnect after falling back to archive recovery
Hi hackers,
I've encountered a bug in PostgreSQL's streaming replication where cascading
standbys fail to reconnect after falling back to archive recovery. The issue
occurs when the upstream standby uses archive-only recovery.
The standby requests streaming from the wrong WAL position (next segment
boundary
instead of the current position), causing connection failures with this
error:
ERROR: requested starting point 0/A000000 is ahead of the WAL flush
position of this server 0/9000000
Attached are two shell scripts that reliably reproduce the issue on
PostgreSQL
17.x and 18.x:
1. reproducer_restart_upstream_portable.sh - triggers by restarting upstream
2. reproducer_cascade_restart_portable.sh - triggers by restarting the
cascade
The scripts set up this topology:
- Primary with archiving enabled
- Standby using only archive recovery (no streaming from primary)
- Cascading standby streaming from the archive-only standby
When the cascade loses its streaming connection and falls back to archive
recovery,
it cannot reconnect. The issue appears to be in xlogrecovery.c around line
3880,
where the position passed to RequestXLogStreaming() determines which segment
boundary is requested.
The cascade restart reproducer shows that even restarting the cascade itself
triggers the bug, which affects routine maintenance operations.
Scripts require PostgreSQL binaries in PATH and use ports 15432-15434.
Best regards,
Marco
Attachments:
pgsql-hackers-bug-report-final.mdtext/markdown; charset=UTF-8; name=pgsql-hackers-bug-report-final.mdDownload
On Thu, Jan 29, 2026 at 2:03 AM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
Hi hackers,
I've encountered a bug in PostgreSQL's streaming replication where cascading
standbys fail to reconnect after falling back to archive recovery. The issue
occurs when the upstream standby uses archive-only recovery.The standby requests streaming from the wrong WAL position (next segment boundary
instead of the current position), causing connection failures with this error:ERROR: requested starting point 0/A000000 is ahead of the WAL flush
position of this server 0/9000000
Thanks for the report!
I was also able to reproduce this issue on the master branch.
Interestingly, I couldn't reproduce it on v11 using the same test case.
This makes me wonder whether the issue was introduced in v12 or later.
Do you see the same behavior in your environment?
Regards,
--
Fujii Masao
Hi Marco,
On Thu, Jan 29, 2026 at 1:03 AM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
Hi hackers,
I've encountered a bug in PostgreSQL's streaming replication where cascading
standbys fail to reconnect after falling back to archive recovery. The issue
occurs when the upstream standby uses archive-only recovery.The standby requests streaming from the wrong WAL position (next segment boundary
instead of the current position), causing connection failures with this error:ERROR: requested starting point 0/A000000 is ahead of the WAL flush
position of this server 0/9000000Attached are two shell scripts that reliably reproduce the issue on PostgreSQL
17.x and 18.x:1. reproducer_restart_upstream_portable.sh - triggers by restarting upstream
2. reproducer_cascade_restart_portable.sh - triggers by restarting the cascadeThe scripts set up this topology:
- Primary with archiving enabled
- Standby using only archive recovery (no streaming from primary)
- Cascading standby streaming from the archive-only standbyWhen the cascade loses its streaming connection and falls back to archive recovery,
it cannot reconnect. The issue appears to be in xlogrecovery.c around line 3880,
where the position passed to RequestXLogStreaming() determines which segment
boundary is requested.The cascade restart reproducer shows that even restarting the cascade itself
triggers the bug, which affects routine maintenance operations.Scripts require PostgreSQL binaries in PATH and use ports 15432-15434.
Best regards,
Marco
Thanks for your report. I can reliably reproduce the issue on HEAD
using your scripts. I’ve analyzed the problem and am proposing a patch
to fix it.
--- Analysis
When a cascading standby streams from an archive-only upstream:
1. The upstream's GetStandbyFlushRecPtr() returns only replay position
(no received-but-not-replayed buffer since there's no walreceiver)
2. When streaming ends and the cascade falls back to archive recovery,
it can restore WAL segments from its own archive access
3. The cascade's read position (RecPtr) advances beyond what the
upstream has replayed
4. On reconnect, the cascade requests streaming from RecPtr, which the
upstream rejects as "ahead of flush position"
--- Proposed Fix
Track the last confirmed flush position from streaming
(lastStreamedFlush) and clamp the streaming start request when it
exceeds that position:
- Same timeline: clamp to lastStreamedFlush if RecPtr > lastStreamedFlush
- Timeline switch: fall back to timeline switchpoint as safe boundary
This ensures the cascade requests from a position the upstream
definitely has, rather than assuming the upstream can serve whatever
the cascade restored locally from archive.
I’m not a fan of using sleep in TAP tests, but I haven’t found a
better way to reproduce this behavior yet.
--
Best,
Xuneng
Attachments:
v1-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchapplication/octet-stream; name=v1-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchDownload+122-1
On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Thanks for your report. I can reliably reproduce the issue on HEAD
using your scripts. I’ve analyzed the problem and am proposing a patch
to fix it.--- Analysis When a cascading standby streams from an archive-only upstream:1. The upstream's GetStandbyFlushRecPtr() returns only replay position
(no received-but-not-replayed buffer since there's no walreceiver)
2. When streaming ends and the cascade falls back to archive recovery,
it can restore WAL segments from its own archive access
3. The cascade's read position (RecPtr) advances beyond what the
upstream has replayed
4. On reconnect, the cascade requests streaming from RecPtr, which the
upstream rejects as "ahead of flush position"--- Proposed FixTrack the last confirmed flush position from streaming
(lastStreamedFlush) and clamp the streaming start request when it
exceeds that position:
I haven't read the patch yet, but doesn't lastStreamedFlush represent
the same LSN as tliRecPtr or replayLSN (the arguments to
WaitForWALToBecomeAvailable())? If so, we may not need to introduce
a new variable to track this LSN.
The choice of which LSN is used as the replication start point has varied
over time to handle corner cases (for example, commit 06687198018).
That makes me wonder whether we should first better understand
why WaitForWALToBecomeAvailable() currently uses RecPtr as
the starting point.
BTW, with v1 patch, I was able to reproduce the issue using the following steps:
--------------------------------------------
initdb -D data
mkdir arch
cat <<EOF >> data/postgresql.conf
archive_mode = on
archive_command = 'cp %p ../arch/%f'
restore_command = 'cp ../arch/%f %p'
EOF
pg_ctl -D data start
pg_basebackup -D sby1 -c fast
cp -a sby1 sby2
cat <<EOF >> sby1/postgresql.conf
port = 5433
EOF
touch sby1/standby.signal
pg_ctl -D sby1 start
cat <<EOF >> sby2/postgresql.conf
port = 5434
primary_conninfo = 'port=5433'
EOF
touch sby2/standby.signal
pg_ctl -D sby2 start
pgbench -i -s2
pg_ctl -D sby2 restart
--------------------------------------------
In this case, after restarting the standby connecting to another
(cascading) standby, I observed the following error.
FATAL: could not receive data from WAL stream: ERROR: requested
starting point 0/04000000 is ahead of the WAL flush position of this
server 0/03FFE8D0
Regards,
--
Fujii Masao
Hi Fujii'san,
Thanks for looking into this.
On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Thanks for your report. I can reliably reproduce the issue on HEAD
using your scripts. I’ve analyzed the problem and am proposing a patch
to fix it.--- Analysis When a cascading standby streams from an archive-only upstream:1. The upstream's GetStandbyFlushRecPtr() returns only replay position
(no received-but-not-replayed buffer since there's no walreceiver)
2. When streaming ends and the cascade falls back to archive recovery,
it can restore WAL segments from its own archive access
3. The cascade's read position (RecPtr) advances beyond what the
upstream has replayed
4. On reconnect, the cascade requests streaming from RecPtr, which the
upstream rejects as "ahead of flush position"--- Proposed FixTrack the last confirmed flush position from streaming
(lastStreamedFlush) and clamp the streaming start request when it
exceeds that position:I haven't read the patch yet, but doesn't lastStreamedFlush represent
the same LSN as tliRecPtr or replayLSN (the arguments to
WaitForWALToBecomeAvailable())? If so, we may not need to introduce
a new variable to track this LSN.
I think they refer to different types of LSNs. I don’t have access to my
computer at the moment, but I’ll look into it and get back to you shortly.
The choice of which LSN is used as the replication start point has varied
over time to handle corner cases (for example, commit 06687198018).
That makes me wonder whether we should first better understand
why WaitForWALToBecomeAvailable() currently uses RecPtr as
the starting point.BTW, with v1 patch, I was able to reproduce the issue using the following
steps:
--------------------------------------------
initdb -D data
mkdir arch
cat <<EOF >> data/postgresql.conf
archive_mode = on
archive_command = 'cp %p ../arch/%f'
restore_command = 'cp ../arch/%f %p'
EOF
pg_ctl -D data start
pg_basebackup -D sby1 -c fast
cp -a sby1 sby2
cat <<EOF >> sby1/postgresql.conf
port = 5433
EOF
touch sby1/standby.signal
pg_ctl -D sby1 start
cat <<EOF >> sby2/postgresql.conf
port = 5434
primary_conninfo = 'port=5433'
EOF
touch sby2/standby.signal
pg_ctl -D sby2 start
pgbench -i -s2
pg_ctl -D sby2 restart
--------------------------------------------In this case, after restarting the standby connecting to another
(cascading) standby, I observed the following error.FATAL: could not receive data from WAL stream: ERROR: requested
starting point 0/04000000 is ahead of the WAL flush position of this
server 0/03FFE8D0Regards,
--
Fujii Masao
Best,
Xuneng
Hi,
On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Thanks for your report. I can reliably reproduce the issue on HEAD
using your scripts. I’ve analyzed the problem and am proposing a patch
to fix it.--- Analysis When a cascading standby streams from an archive-only upstream:1. The upstream's GetStandbyFlushRecPtr() returns only replay position
(no received-but-not-replayed buffer since there's no walreceiver)
2. When streaming ends and the cascade falls back to archive recovery,
it can restore WAL segments from its own archive access
3. The cascade's read position (RecPtr) advances beyond what the
upstream has replayed
4. On reconnect, the cascade requests streaming from RecPtr, which the
upstream rejects as "ahead of flush position"--- Proposed FixTrack the last confirmed flush position from streaming
(lastStreamedFlush) and clamp the streaming start request when it
exceeds that position:I haven't read the patch yet, but doesn't lastStreamedFlush represent
the same LSN as tliRecPtr or replayLSN (the arguments to
WaitForWALToBecomeAvailable())? If so, we may not need to introduce
a new variable to track this LSN.
lastStreamedFlush is the upstream’s confirmed flush point from the
last streaming session—what the sender guaranteed it had. tliRecPtr is
the LSN of the start of the current WAL record which used to determine
which timeline that record belongs to (tliOfPointInHistory), and
replayLSN is how far we’ve applied locally. After archive fallback,
both tliRecPtr and replayLSN can be ahead of what the upstream has, so
they can’t safely cap a reconnect. LastStreamedFlush is used as the
upstream-capability bound.
The choice of which LSN is used as the replication start point has varied
over time to handle corner cases (for example, commit 06687198018).
That makes me wonder whether we should first better understand
why WaitForWALToBecomeAvailable() currently uses RecPtr as
the starting point.
AFAICS, fix 06687198018 addresses a scenario where a standby gets
stuck reading a continuation record that spans multiple pages/segments
when the pages must come from different sources.
The problem: if the first page is read successfully from local pg_wal
but the second page contains garbage from a recycled segment, the old
code would enter an infinite loop. This happened because:
Late failure detection: Page header validation occurred inside
XLogReadRecord(), which triggered ReadRecord()'s retry-from-beginning
logic—restarting the entire record read from local sources without
ever trying streaming.
Wrong streaming start position: Even if streaming was eventually
attempted, it started from tliRecPtr (record start) rather than RecPtr
(current read position), potentially re-requesting segments the
primary had already recycled.
The fix has two parts:
Early page header validation: Validate the page header immediately
after reading, before returning to the caller. If garbage is detected
(typically via xlp_pageaddr mismatch), jump directly to
next_record_is_invalid to try an alternative source (streaming),
bypassing ReadRecord()'s retry loop.
Correct streaming start position: Change from ptr = tliRecPtr to ptr =
RecPtr, so streaming begins at the position where data is actually
needed. The record start position (tliRecPtr) is still used for
timeline determination, but no longer for the streaming start LSN.
Together, these changes ensure the standby escapes the local-read
retry loop and fetches the continuation data from the correct position
via streaming.
BTW, with v1 patch, I was able to reproduce the issue using the following steps:
--------------------------------------------
initdb -D data
mkdir arch
cat <<EOF >> data/postgresql.conf
archive_mode = on
archive_command = 'cp %p ../arch/%f'
restore_command = 'cp ../arch/%f %p'
EOF
pg_ctl -D data start
pg_basebackup -D sby1 -c fast
cp -a sby1 sby2
cat <<EOF >> sby1/postgresql.conf
port = 5433
EOF
touch sby1/standby.signal
pg_ctl -D sby1 start
cat <<EOF >> sby2/postgresql.conf
port = 5434
primary_conninfo = 'port=5433'
EOF
touch sby2/standby.signal
pg_ctl -D sby2 start
pgbench -i -s2
pg_ctl -D sby2 restart
--------------------------------------------In this case, after restarting the standby connecting to another
(cascading) standby, I observed the following error.FATAL: could not receive data from WAL stream: ERROR: requested
starting point 0/04000000 is ahead of the WAL flush position of this
server 0/03FFE8D0
After sby2 restarts, its WAL read position (RecPtr) is set to the
segment boundary 0/04000000, but the upstream sby1 (archive-only
standby with no walreceiver) can only serve up to its replay position
0/03FFE8D0. The cascade requests WAL ahead of what the upstream can
provide.The issue is that no in-memory state survives the restart to
cap the streaming start request. Before restart, the walreceiver knew
what the upstream had confirmed; after restart, that information is
lost.
One potential solution is a "handshake clamp": after connecting,
obtain the upstream's current flush LSN from IDENTIFY_SYSTEM and clamp
the streaming start position to Min(startpoint, primaryFlush) before
sending START_REPLICATION. But I think this is somewhat complicated.
--
Best,
Xuneng
On Thu, Jan 29, 2026 at 11:33 AM Fujii Masao wrote:
Interestingly, I couldn't reproduce it on v11 using the same test case.
This makes me wonder whether the issue was introduced in v12 or later.
I did some investigation on this. The bug actually reproduces on v11
with the same setup (archive-only upstream standby + cascading standby
with restore_command). I ran the test and got the exact same error:
FATAL: could not receive data from WAL stream:
ERROR: requested starting point 0/4000000 is ahead of the WAL
flush position of this server 0/3FFEED8
The reason Fujii-san might not have seen it on v11 is likely related
to how pg_basebackup works with recovery.conf. In the PG12+
reproducer, sby1 has no primary_conninfo (just standby.signal), making
it archive-only. But in PG11, when adapting the test, if sby1 retains
any primary_conninfo from the basebackup setup, it would stream from
the primary and its flush position would stay current, masking the bug.
I bisected further and found that the check causing the rejection was
introduced by commit abfd192b1b5 ("Allow a streaming replication
standby to follow a timeline switch", 2012-12-13, Heikki Linnakangas),
which first appeared in PG 9.3. That commit added this validation in
StartReplication():
if (am_cascading_walsender)
FlushPtr = GetStandbyFlushRecPtr();
else
FlushPtr = GetFlushRecPtr();
if (FlushPtr < cmd->startpoint)
{
ereport(ERROR,
errmsg("requested starting point ... is ahead of
the WAL flush position ..."));
}
Before that commit (PG 9.2), the walsender had no such check and would
just start sending from whatever position was requested, waiting for the
data to become available if needed. So the bug has existed since PG 9.3,
not since PG 12.
The check itself is correct -- you shouldn't serve WAL you don't have.
The real issue is on the requesting side: the cascading standby asks for
a position it advanced to via archive recovery, which the upstream hasn't
reached yet.
Best regards,
Marco
On Thu, Jan 29, 2026 at 12:33 PM Fujii Masao <masao.fujii@gmail.com> wrote:
Show quoted text
On Thu, Jan 29, 2026 at 2:03 AM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:Hi hackers,
I've encountered a bug in PostgreSQL's streaming replication where
cascading
standbys fail to reconnect after falling back to archive recovery. The
issue
occurs when the upstream standby uses archive-only recovery.
The standby requests streaming from the wrong WAL position (next segment
boundary
instead of the current position), causing connection failures with this
error:
ERROR: requested starting point 0/A000000 is ahead of the WAL flush
position of this server 0/9000000Thanks for the report!
I was also able to reproduce this issue on the master branch.Interestingly, I couldn't reproduce it on v11 using the same test case.
This makes me wonder whether the issue was introduced in v12 or later.Do you see the same behavior in your environment?
Regards,
--
Fujii Masao
Attached is a v2 patch that implements the "handshake clamp" approach
Xuneng suggested. Rather than tracking lastStreamedFlush in
process-local state (which doesn't survive a cascade restart, as
Fujii-san demonstrated), it uses the WAL flush position already
returned by IDENTIFY_SYSTEM.
The walreceiver now checks the upstream's flush position before issuing
START_REPLICATION. If the requested startpoint is ahead (on the same
timeline), it waits for wal_retrieve_retry_interval and retries. This
works across restarts since it queries the upstream's live position on
every connection attempt, and requires no new state variables.
When timelines differ, we let START_REPLICATION handle the timeline
negotiation as before.
The patch includes a TAP test (053_cascade_reconnect.pl) that
reproduces the scenario and verifies the fix.
Attachments:
v2-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchDownload+203-8
Hi,
Thanks for the patch.
On Tue, Mar 17, 2026 at 5:49 AM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
Attached is a v2 patch that implements the "handshake clamp" approach
Xuneng suggested. Rather than tracking lastStreamedFlush in
process-local state (which doesn't survive a cascade restart, as
Fujii-san demonstrated), it uses the WAL flush position already
returned by IDENTIFY_SYSTEM.The walreceiver now checks the upstream's flush position before issuing
START_REPLICATION. If the requested startpoint is ahead (on the same
timeline), it waits for wal_retrieve_retry_interval and retries. This
works across restarts since it queries the upstream's live position on
every connection attempt, and requires no new state variables.When timelines differ, we let START_REPLICATION handle the timeline
negotiation as before.The patch includes a TAP test (053_cascade_reconnect.pl) that
reproduces the scenario and verifies the fix.
I haven’t looked into it in detail yet, but it looks good overall.
I’ll test it further and verify that the issue has been resolved.
--
Best,
Xuneng
On Tue, Mar 17, 2026 at 9:04 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
Thanks for the patch.
On Tue, Mar 17, 2026 at 5:49 AM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:Attached is a v2 patch that implements the "handshake clamp" approach
Xuneng suggested. Rather than tracking lastStreamedFlush in
process-local state (which doesn't survive a cascade restart, as
Fujii-san demonstrated), it uses the WAL flush position already
returned by IDENTIFY_SYSTEM.The walreceiver now checks the upstream's flush position before issuing
START_REPLICATION. If the requested startpoint is ahead (on the same
timeline), it waits for wal_retrieve_retry_interval and retries. This
works across restarts since it queries the upstream's live position on
every connection attempt, and requires no new state variables.When timelines differ, we let START_REPLICATION handle the timeline
negotiation as before.The patch includes a TAP test (053_cascade_reconnect.pl) that
reproduces the scenario and verifies the fix.I haven’t looked into it in detail yet, but it looks good overall.
I’ll test it further and verify that the issue has been resolved.
One thing I’m not sure about is whether we need to create a standalone
test file for this patch, or if it would fit well within existing TAP
tests.
I found several places for integration:
001_stream_rep.pl: it already has a primary -> standby -> cascading
standby setup, and it even touches primary_conninfo reload behavior.
But it is already a large mixed-purpose file, and this bug needs a
fairly specific archive-fallback reconnection story. Adding it there
would make that file even less focused.
025_stuck_on_old_timeline.pl: this is the nearest thematic neighbor
since it combines cascading replication and archive/stream
interactions. But it is really about timeline-following after
promotion, not “downstream advances via archive and then must
reconnect to an upstream that is still behind”.
048_vacuum_horizon_floor.pl: it already exercises stopping and
restarting walreceiver via primary_conninfo reload, but it has nothing
to do with archive fallback or cascading reconnect logic.
The failure scenario is specific enough, and the three-node setup plus
archive fallback plus reconnect check seems to be a coherent
reproducer on its own.
--
Best,
Xuneng
I agree, a standalone test file is the right call here.
I looked at the same candidates. 025_stuck_on_old_timeline.pl is the
closest thematic match, but its archive command intentionally copies
only history files and the whole test revolves around promotion and
timeline following. Adapting it would mean replacing the archive
command and skipping the promotion, which defeats its original purpose.
The reconnect-after-archive-fallback scenario is distinct enough to
justify its own file, and at 143 lines it's reasonably small.
Best regards,
Marco
Since this bug dates back to 9.3, the fix will likely need backpatching.
The v2 patch changes the walrcv_identify_system() signature, which would
be an ABI break on stable branches (walrcv_identify_system_fn is a
function pointer in the WalReceiverFunctionsType struct).
Attached is a backpatch-compatible variant that avoids the API change.
Instead of adding a parameter, libpqrcv_identify_system() stores the
flush position in a new global variable (WalRcvIdentifySystemLsn), and
the walreceiver reads it directly. The fix logic and TAP test are
otherwise identical.
For master I'd still prefer the v2 approach with the extended signature,
since it's cleaner and there's no ABI constraint.
Best regards,
Marco
Attachments:
v2-backpatch-0001-Fix-cascading-standby-reconnect-failure-after-archiv.patchtext/x-patch; charset=US-ASCII; name=v2-backpatch-0001-Fix-cascading-standby-reconnect-failure-after-archiv.patchDownload+202-1
On Tue, Mar 17, 2026 at 4:13 PM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
I agree, a standalone test file is the right call here.
I looked at the same candidates. 025_stuck_on_old_timeline.pl is the
closest thematic match, but its archive command intentionally copies
only history files and the whole test revolves around promotion and
timeline following. Adapting it would mean replacing the archive
command and skipping the promotion, which defeats its original purpose.The reconnect-after-archive-fallback scenario is distinct enough to
justify its own file, and at 143 lines it's reasonably small.Best regards,
Marco
I’ve applied the patch and verified the fix using the two scripts you
provided earlier, as well as the failing test from v1 provided by
Fujii-san. I’ve also made some small improvements to the TAP test:
1) Added a positive synchronization point using wait_for_event() on
walreceiver / WalReceiverUpstreamCatchup, so the test now proves it
enters the reconnect-behind-upstream window before asserting outcomes.
2) Replaced broad log scanning with a scoped log window:
- capture logfile offset after rotation
- use slurp_file(..., $offset) for post-restart assertions only
- assert absence of the old “requested starting point … ahead of the
WAL flush position” error in that bounded window.
Please check it.
--
Best,
Xuneng
Attachments:
v3-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchapplication/octet-stream; name=v3-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchDownload+210-8
Thanks for verifying the fix and improving the test, Xuneng.
The wait_for_event() synchronization is a nice addition — it gives
deterministic proof that the walreceiver actually entered the
upstream-catchup path. The scoped log window with slurp_file() is
also cleaner than the broad log_contains() I had before.
The v3 test improvements look good to me.
Best regards,
Marco
On Tue, Mar 17, 2026 at 5:31 PM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
Since this bug dates back to 9.3, the fix will likely need backpatching.
The v2 patch changes the walrcv_identify_system() signature, which would
be an ABI break on stable branches (walrcv_identify_system_fn is a
function pointer in the WalReceiverFunctionsType struct).Attached is a backpatch-compatible variant that avoids the API change.
Instead of adding a parameter, libpqrcv_identify_system() stores the
flush position in a new global variable (WalRcvIdentifySystemLsn), and
the walreceiver reads it directly. The fix logic and TAP test are
otherwise identical.For master I'd still prefer the v2 approach with the extended signature,
since it's cleaner and there's no ABI constraint.Best regards,
Marco
I think that the ABI concern for backpatching is valid, and the
proposed workaround looks reasonable to me. Resetting
WalRcvIdentifySystemLsn before walrcv_identify_system() seems like a
sensible defensive move, so I’ve added it into v3. The TAP test has
also been updated as well.
--
Best,
Xuneng
Attachments:
v3-backpatch-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchapplication/octet-stream; name=v3-backpatch-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchDownload+212-1
On Tue, Mar 17, 2026 at 7:56 PM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
Thanks for verifying the fix and improving the test, Xuneng.
The wait_for_event() synchronization is a nice addition — it gives
deterministic proof that the walreceiver actually entered the
upstream-catchup path. The scoped log window with slurp_file() is
also cleaner than the broad log_contains() I had before.The v3 test improvements look good to me.
Best regards,
Marco
Thanks for checking. I think we also need to add the new tap test to
meson.build for the master patch as well.
--
Best,
Xuneng
Attachments:
v3-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchapplication/octet-stream; name=v3-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchDownload+211-8
On Tue, Mar 17, 2026 at 8:20 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Tue, Mar 17, 2026 at 7:56 PM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:Thanks for verifying the fix and improving the test, Xuneng.
The wait_for_event() synchronization is a nice addition — it gives
deterministic proof that the walreceiver actually entered the
upstream-catchup path. The scoped log window with slurp_file() is
also cleaner than the broad log_contains() I had before.
After thinking about this more, I’m less satisfied and convinced with
polling at wal_retrieve_retry_interval. If the upstream stalls for a
long time, or permanently, the walreceiver can loop indefinitely,
leaving startup effectively pinned in the streaming path instead of
switching to other WAL sources. In that case, repeated “ahead of flush
position” log entries can also become noisy. On the other hand, if the
upstream catches up quickly, walreceiver still won’t notice until the
next interval, adding unnecessary latency of up to one full
wal_retrieve_retry_interval.
--
Best,
Xuneng
On Wed, Mar 18, 2026 at 2:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Tue, Mar 17, 2026 at 8:20 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Tue, Mar 17, 2026 at 7:56 PM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:Thanks for verifying the fix and improving the test, Xuneng.
The wait_for_event() synchronization is a nice addition — it gives
deterministic proof that the walreceiver actually entered the
upstream-catchup path. The scoped log window with slurp_file() is
also cleaner than the broad log_contains() I had before.After thinking about this more, I’m less satisfied and convinced with
polling at wal_retrieve_retry_interval. If the upstream stalls for a
long time, or permanently, the walreceiver can loop indefinitely,
leaving startup effectively pinned in the streaming path instead of
switching to other WAL sources. In that case, repeated “ahead of flush
position” log entries can also become noisy. On the other hand, if the
upstream catches up quickly, walreceiver still won’t notice until the
next interval, adding unnecessary latency of up to one full
wal_retrieve_retry_interval.
Good points, Xuneng.
For the log noise: we could emit the first "ahead of flush position"
message at LOG level, then demote subsequent attempts to DEBUG1 until
the condition clears. That keeps the initial occurrence visible for
diagnostics without flooding the log during a long wait.
For the indefinite loop: I agree that unbounded polling is not ideal.
The gap this fix targets is bounded in practice: the startup process
alternates between archive recovery and streaming attempts, so at
each streaming attempt the cascade is at most one WAL segment ahead
of the upstream. If the gap is larger than that, something more
fundamental is wrong and the walreceiver should get out of the way
so the startup process can fall back to other WAL sources.
We could cap the wait with a threshold: if startpoint is more than
one wal_segment_size ahead of the upstream's flush position, skip the
wait and let START_REPLICATION proceed normally (and fail), so the
walreceiver exits and the startup process can switch to archive.
That way we absorb the one-segment gap that arises naturally from
archive recovery, without masking larger problems.
Thoughts on whether wal_segment_size is the right bound, or if
something else would be more appropriate?
Best regards,
Marco
Here are the v4 patches implementing what I described above.
On top of Xuneng's v3 (keeping the wait_for_event and scoped log
window test improvements), the main changes are:
- The wait is now capped at one wal_segment_size. If the gap is
larger, we skip the wait and let START_REPLICATION fail normally
so the startup process can fall back to archive. This avoids
indefinite polling when the upstream is fundamentally behind.
- The first "ahead of flush position" message is logged at LOG,
subsequent ones at DEBUG1, to cut down on noise during a long wait.
Two patches attached: v4-0001 for master (extends the
walrcv_identify_system API with an optional server_lsn output
parameter) and v4-backpatch-0001 for stable branches (uses a global
variable to preserve ABI, per Alvaro's suggestion).
Both pass the new TAP test.
Best regards,
Marco
Attachments:
v4-backpatch-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchtext/x-patch; charset=US-ASCII; name=v4-backpatch-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchDownload+222-2
v4-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Fix-cascading-standby-reconnect-failure-after-arc.patchDownload+219-8
Hi Marco,
On Wed, Mar 18, 2026 at 4:34 PM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
On Wed, Mar 18, 2026 at 2:51 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Tue, Mar 17, 2026 at 8:20 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Tue, Mar 17, 2026 at 7:56 PM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:Thanks for verifying the fix and improving the test, Xuneng.
The wait_for_event() synchronization is a nice addition — it gives
deterministic proof that the walreceiver actually entered the
upstream-catchup path. The scoped log window with slurp_file() is
also cleaner than the broad log_contains() I had before.After thinking about this more, I’m less satisfied and convinced with
polling at wal_retrieve_retry_interval. If the upstream stalls for a
long time, or permanently, the walreceiver can loop indefinitely,
leaving startup effectively pinned in the streaming path instead of
switching to other WAL sources. In that case, repeated “ahead of flush
position” log entries can also become noisy. On the other hand, if the
upstream catches up quickly, walreceiver still won’t notice until the
next interval, adding unnecessary latency of up to one full
wal_retrieve_retry_interval.Good points, Xuneng.
For the log noise: we could emit the first "ahead of flush position"
message at LOG level, then demote subsequent attempts to DEBUG1 until
the condition clears. That keeps the initial occurrence visible for
diagnostics without flooding the log during a long wait.For the indefinite loop: I agree that unbounded polling is not ideal.
The gap this fix targets is bounded in practice: the startup process
alternates between archive recovery and streaming attempts, so at
each streaming attempt the cascade is at most one WAL segment ahead
of the upstream. If the gap is larger than that, something more
fundamental is wrong and the walreceiver should get out of the way
so the startup process can fall back to other WAL sources.
I am not sure about this bound here. It seems to me that the gap could
be several segments due to the upstream lag. With this assumption, It
also seems not very ideal to just clamp the ptr to the flush lsn of
the upstream server and proceed the handshake since the potential
duplication of segments could be large.
We could cap the wait with a threshold: if startpoint is more than
one wal_segment_size ahead of the upstream's flush position, skip the
wait and let START_REPLICATION proceed normally (and fail), so the
walreceiver exits and the startup process can switch to archive.
That way we absorb the one-segment gap that arises naturally from
archive recovery, without masking larger problems.Thoughts on whether wal_segment_size is the right bound, or if
something else would be more appropriate?
Even with only a one-segment gap, if the upstream server’s flush LSN
does not advance, we would remain stuck polling indefinitely.
--
Best,
Xuneng