t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Started by Alexander Lakhinabout 2 months ago24 messageshackers

exclusion@gmail.com

about 2 months ago

Hello hackers,

I'd like to present a recent failure of 035_standby_logical_decoding on
buildfarm [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=scorpion&dt=2026-06-03%2008%3A31%3A43:
### Promoting node "standby"
# Running: pg_ctl --pgdata
/home/bf/bf-build/scorpion/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata
--log
/home/bf/bf-build/scorpion/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log
promote
waiting for server to promote.... done
server promoted
Waiting for replication conn cascading_standby's replay_lsn to pass 0/0648A290 on standby
done
[10:40:16.996](0.446s) ok 84 - got expected output from SQL decoding session on promoted standby
[10:40:16.997](0.001s) # pump_until: process terminated unexpectedly when searching for "(?^s:^.*COMMIT.*COMMIT$)" with
stream: ""
[10:40:16.997](0.000s) not ok 85 - got 2 COMMIT from pg_recvlogical output
[10:40:16.997](0.000s) #   Failed test 'got 2 COMMIT from pg_recvlogical output'
#   at /home/bf/bf-build/scorpion/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 1036.
[10:40:16.997](0.000s) not ok 86 - got same expected output from pg_recvlogical decoding session
[10:40:16.997](0.000s) #   Failed test 'got same expected output from pg_recvlogical decoding session'
#   at /home/bf/bf-build/scorpion/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 1040.
[10:40:16.998](0.000s) #          got: ''
#     expected: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
# table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
# table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
# table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
# COMMIT
# BEGIN
# table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
# table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
# table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
# COMMIT'
[10:40:17.009](0.011s) ok 87 - got expected output from SQL decoding session on cascading standby
[10:40:17.009](0.000s) ok 88 - got 2 COMMIT from pg_recvlogical output

035_standby_logical_decoding_standby.log contains:
2026-06-03 10:40:16.745 CEST [3955206][startup][29/0:0] LOG: received promote request
...
2026-06-03 10:40:16.784 CEST [3955206][startup][29/0:0] LOG: archive recovery complete
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] LOG: starting logical decoding for slot "promotion_activeslot"
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] DETAIL: Streaming transactions committing after 0/06487A20,
reading WAL from 0/064879E8.
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] STATEMENT: START_REPLICATION SLOT "promotion_activeslot"
LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] ERROR: requested WAL segment 000000010000000000000006 has
already been removed
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] STATEMENT: START_REPLICATION SLOT "promotion_activeslot"
LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] LOG: released logical replication slot "promotion_activeslot"
2026-06-03 10:40:16.862 CEST [3963805][walsender][:0] LOG: disconnection: session time: 0:00:00.212 user=bf
database=testdb host=[local]
...

That is, walsender requested WAL segment for timeline 1, while in a
successful run, it reads WAL for timeline 2.

I've managed to reproduce this failure with:
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1502,2 +1502,3 @@ StartLogicalReplication(StartReplicationCmd *cmd)
      ReplicationSlotAcquire(cmd->slotname, true, true);
+pg_usleep(200000);

Being executed in a loop, the test fails for me within 100 iterations
(within 10 on a slower VM):
ITERATION 23
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 84/? # pump_until: process terminated unexpectedly when searching for
"(?^s:^.*COMMIT.*COMMIT$)" with stream: ""

# Failed test 'got 2 COMMIT from pg_recvlogical output'
# at t/035_standby_logical_decoding.pl line 1038.

#   Failed test 'got same expected output from pg_recvlogical decoding session'
#   at t/035_standby_logical_decoding.pl line 1042.
#          got: ''
#     expected: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
...

As far as I can see, the timeline is chosen in logical_read_xlog_page()
depending on the recovery state:
am_cascading_walsender = RecoveryInProgress();

        if (am_cascading_walsender)
                GetXLogReplayRecPtr(&currTLI);
        else
                currTLI = GetWALInsertionTimeLine();

With diagnostic logging added here, a failed run shows:
2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl LOG: starting logical decoding for slot
"promotion_activeslot"
2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl DETAIL: Streaming transactions committing
after 0/0647C7E8, reading WAL from 0/0647C7B0.
2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl LOG: !!!logical_read_xlog_page|
am_cascading_walsender: 1, currTLI: 1
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000010000000000000006
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl ERROR: requested WAL segment
000000010000000000000006 has already been removed

In a successful run, am_cascading_walsender/RecoveryInProgress() == false and currTLI == 2:
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG: starting logical decoding for slot
"promotion_activeslot"
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl DETAIL: Streaming transactions committing
after 0/0647C7E8, reading WAL from 0/0647C7B0.
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG: !!!logical_read_xlog_page|
am_cascading_walsender: 0, currTLI: 2
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT
"promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000020000000000000006

Reproduced at the commit fcd77d532, which added the test.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=scorpion&dt=2026-06-03%2008%3A31%3A43

Best regards,
Alexander

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 2 months ago

In reply to: Alexander Lakhin (#1)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi Alexander,

On Sat, Jun 06, 2026 at 12:00:00PM +0300, Alexander Lakhin wrote:

Hello hackers,

That is, walsender requested WAL segment for timeline 1, while in a
successful run, it reads WAL for timeline 2.

I've managed to reproduce this failure with:

Thanks for the report and the repro!

As far as I can see, the timeline is chosen in logical_read_xlog_page()
depending on the recovery state:
        am_cascading_walsender = RecoveryInProgress();

        if (am_cascading_walsender)
                GetXLogReplayRecPtr(&currTLI);
        else
                currTLI = GetWALInsertionTimeLine();

Yeah, it looks like there is a race condition here. I think we should check if
the insertion timeline has already been set (like the walsummarizer is doing).

I'll work on a fix early next week.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Xuneng Zhou

xunengzhou@gmail.com

about 2 months ago

In reply to: Bertrand Drouvot (#2)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi Bertrand,

On Sat, Jun 6, 2026 at 7:07 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi Alexander,

On Sat, Jun 06, 2026 at 12:00:00PM +0300, Alexander Lakhin wrote:

Hello hackers,

That is, walsender requested WAL segment for timeline 1, while in a
successful run, it reads WAL for timeline 2.

I've managed to reproduce this failure with:

Thanks for the report and the repro!

As far as I can see, the timeline is chosen in logical_read_xlog_page()
depending on the recovery state:
am_cascading_walsender = RecoveryInProgress();

if (am_cascading_walsender)
GetXLogReplayRecPtr(&currTLI);
else
currTLI = GetWALInsertionTimeLine();

Yeah, it looks like there is a race condition here. I think we should check if
the insertion timeline has already been set (like the walsummarizer is doing).

I'll work on a fix early next week.

This looks like the right direction to fix. We may want to apply
similar logic to read_local_xlog_page_guts as well. Although the
failure is reported in walsender, SQL logical decoding uses the local
WAL reader and has the same recovery/TLI pattern.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 2 months ago

In reply to: Bertrand Drouvot (#2)

RE: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi Alexander, Bertrand, Xuneng,

Thanks for seeing the failure. Our team also recognized but could not find the reason.

Yeah, it looks like there is a race condition here. I think we should check if
the insertion timeline has already been set (like the walsummarizer is doing).

Sorry for stupid question; I tried to reproduce the failure but could not, see attached.

IIUC, the issue can happen if the walsender must read the WAL record generated
after the promotion but the timeline could not be updated.

However, I think logical_read_xlog_page() is called after the new WAL records
are generated, i.e., am_cascading_walsender has already been false at that time.
So not sure where is the race?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Xuneng Zhou

xunengzhou@gmail.com

about 2 months ago

In reply to: Hayato Kuroda (Fujitsu) (#4)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi Kuroda-san,

On Mon, Jun 8, 2026 at 12:25 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Hi Alexander, Bertrand, Xuneng,

Thanks for seeing the failure. Our team also recognized but could not find the reason.

Yeah, it looks like there is a race condition here. I think we should check if
the insertion timeline has already been set (like the walsummarizer is doing).

Sorry for stupid question; I tried to reproduce the failure but could not, see attached.

IIUC, the issue can happen if the walsender must read the WAL record generated
after the promotion but the timeline could not be updated.

I think the race is that the logical walsender is trying to read the
pre-promotion wal from the slot's restart_lsn.

1) In 035_standby_logical_decoding.pl, the active slot /
pg_recvlogical process is started before:
INSERT rows 1..4 # before promotion
promote standby
INSERT rows 5..7 # after promotion

2) Alexander’s reproducer sleeps immediately after:
ReplicationSlotAcquire(cmd->slotname, true, true);
pg_usleep(200000);

So with the delay:
slot is active but walsender has not yet create decoding context and
called xlogbeginread();

3) Then the test continues and promotes the standby while the
walsender still sleeping, therefore promotion could happen before the
walsender starts its actual wal read.

4) The logical walsender is trying to read the pre-promotion wal from
the slot's restart_lsn. The failing log implies this:
Streaming transactions committing after 0/06487A20,
reading WAL from 0/064879E8.

This line is emitted from:
errdetail("Streaming transactions committing after %X/%08X, reading
WAL from %X/%08X.",
slot->data.confirmed_flush,
slot->data.restart_lsn)

Function StartLogicalReplication() initiates the xlogreader at the restart lsn:
XLogBeginRead(logical_decoding_ctx->reader,
MyReplicationSlot->data.restart_lsn);

Because the slot was created and acquired before promotion, its
restart_lsn is from the standby-era state which points to wal required
for decoding the pre-promotion stream. In the test, the first expected
output is rows 1..4, inserted before promotion.

The failing output was empty:
got: ''
expected:
BEGIN
rows 1..4
COMMIT
BEGIN
rows 5..7
COMMIT

Empty output suggests that the walsender died before emitting even the
first pre-promotion transaction.

5) The walsender try to read wal but failed
Failed run:
LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000010000000000000006
ERROR: requested WAL segment
000000010000000000000006 has already been removed

Passed run
LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000020000000000000006

As the log suggested, the walsender of the successful run readed the
wal from timeline 2 but failed to do so for timeline1.
This comment in xlogutils.is almost exactly the issue:

We care about timelines ... reading xlog generated prior to a promotion
...
The server copied the segment to the new timeline ... there's no guarantee
the old segment will still exist. It may have been deleted or renamed with
a .partial suffix

So the need for tli1 does not always imply that the wal was generated
after promotion. It could be bytes from before the switchpoint, but
they now be available through the new tli segment file.

6) XLogReadDetermineTimeline() needs the current system-wide TLI.

The current code in does roughly:

am_cascading_walsender = RecoveryInProgress();

if (am_cascading_walsender)
GetXLogReplayRecPtr(&currTLI);
else
currTLI = GetWALInsertionTimeLine();

During promotion, we set the wal insertion timeline before marking
recovery as fully done. So there is a window where:

- InsertTimeLineID is already the new timeline, e.g. tli 2.
- RecoveryInProgress() can still return true in another backend.
- A logical walsender that started before promotion can wake up and
choose its read timeline using the old recovery-state test.

In the race, replay tli 1 is chosen even though the promotion has
created the new tli 2 wal segment copy. Then WalSndSegmentOpen() tries
to open 000000010000000000000006, which may already be gone or
renamed, while the correct file is 000000020000000000000006.

7) Bertrand's proposal to fix this
Recognizing the intermediate promotion state -- if recovery still
shows progress but the insertion tli is already advanced, use that
insertion tli for the timeline to be chosen.

However, I think logical_read_xlog_page() is called after the new WAL records
are generated, i.e., am_cascading_walsender has already been false at that time.
So not sure where is the race?

This function is not only called after new post-promotion records are
generated. When START_REPLICATION begins, CreateDecodingContext() /
XLogBeginRead() starts from the slot's restart_lsn, which can be well
before promotion.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 2 months ago

In reply to: Hayato Kuroda (Fujitsu) (#4)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi,

On Mon, Jun 08, 2026 at 04:25:45AM +0000, Hayato Kuroda (Fujitsu) wrote:

Hi Alexander, Bertrand, Xuneng,

Thanks for seeing the failure. Our team also recognized but could not find the reason.

Yeah, it looks like there is a race condition here. I think we should check if
the insertion timeline has already been set (like the walsummarizer is doing).

IIUC, the issue can happen if the walsender must read the WAL record generated
after the promotion but the timeline could not be updated.

However, I think logical_read_xlog_page() is called after the new WAL records
are generated, i.e., am_cascading_walsender has already been false at that time.
So not sure where is the race?

I ended up with this conclusion:

During promotion, there is a window where RecoveryInProgress() still
returns true but old timeline WAL segments have already been removed or
recycled by RemoveNonParentXlogFiles() in CleanupAfterArchiveRecovery().
This is because, in StartupXLOG(), WAL segments are cleaned up before
SharedRecoveryState transitions to RECOVERY_STATE_DONE.

If a walsender performing logical decoding calls logical_read_xlog_page()
during this window, it would get the old timeline from GetXLogReplayRecPtr(),
then attempt to open a WAL segment on that old timeline which no longer exists.

Attached:

0001: To fix this race

Fix by checking GetWALInsertionTimeLineIfSet() when RecoveryInProgress()
returns true. If InsertTimeLineID is already set (non-zero), the new timeline is
established and we use it directly, avoiding attempts to read from segments that
may have been removed.

0002: Adding a test in 035_standby_logical_decoding.pl

It makes use of a new injection point "promotion-after-wal-segment-cleanup" in
StartupXLOG(), right after CleanupAfterArchiveRecovery() removes old timeline
WAL segments but before SharedRecoveryState is set to RECOVERY_STATE_DONE.

The test fails without the fix in 0001 so it also somehow proves that the
diagnostic is right.

0003: Apply the same timeline fix to read_local_xlog_page_guts()

Indeed, it could hit the same race as mentioned by Xuneng-San.

0004: Add a test for 0003

Remark:

As far as the backpatching down to 16, it looks like 0001 to 0004 could be
backpatched as they are down to 17. For 16, we may want to also introduce
GetWALInsertionTimeLineIfSet().

I can have a closer look for the backpatch once we agree on how to fix those
races on master.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Xuneng Zhou

xunengzhou@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#5)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

On Mon, Jun 8, 2026 at 4:31 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Kuroda-san,

On Mon, Jun 8, 2026 at 12:25 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Hi Alexander, Bertrand, Xuneng,

Thanks for seeing the failure. Our team also recognized but could not find the reason.

Yeah, it looks like there is a race condition here. I think we should check if
the insertion timeline has already been set (like the walsummarizer is doing).

Sorry for stupid question; I tried to reproduce the failure but could not, see attached.

IIUC, the issue can happen if the walsender must read the WAL record generated
after the promotion but the timeline could not be updated.

I think the race is that the logical walsender is trying to read the
pre-promotion wal from the slot's restart_lsn.

1) In 035_standby_logical_decoding.pl, the active slot /
pg_recvlogical process is started before:
INSERT rows 1..4 # before promotion
promote standby
INSERT rows 5..7 # after promotion

2) Alexander’s reproducer sleeps immediately after:
ReplicationSlotAcquire(cmd->slotname, true, true);
pg_usleep(200000);

So with the delay:
slot is active but walsender has not yet create decoding context and
called xlogbeginread();

3) Then the test continues and promotes the standby while the
walsender still sleeping, therefore promotion could happen before the
walsender starts its actual wal read.

4) The logical walsender is trying to read the pre-promotion wal from
the slot's restart_lsn. The failing log implies this:
Streaming transactions committing after 0/06487A20,
reading WAL from 0/064879E8.

This line is emitted from:
errdetail("Streaming transactions committing after %X/%08X, reading
WAL from %X/%08X.",
slot->data.confirmed_flush,
slot->data.restart_lsn)

Function StartLogicalReplication() initiates the xlogreader at the restart lsn:
XLogBeginRead(logical_decoding_ctx->reader,
MyReplicationSlot->data.restart_lsn);

Because the slot was created and acquired before promotion, its
restart_lsn is from the standby-era state which points to wal required
for decoding the pre-promotion stream. In the test, the first expected
output is rows 1..4, inserted before promotion.

The failing output was empty:
got: ''
expected:
BEGIN
rows 1..4
COMMIT
BEGIN
rows 5..7
COMMIT

Empty output suggests that the walsender died before emitting even the
first pre-promotion transaction.

5) The walsender try to read wal but failed
Failed run:
LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000010000000000000006
ERROR: requested WAL segment
000000010000000000000006 has already been removed

Passed run
LOG: !!!WalSndSegmentOpen| nextSegNo: 6,
path: pg_wal/000000020000000000000006

As the log suggested, the walsender of the successful run readed the
wal from timeline 2 but failed to do so for timeline1.
This comment in xlogutils.is almost exactly the issue:

We care about timelines ... reading xlog generated prior to a promotion
...
The server copied the segment to the new timeline ... there's no guarantee
the old segment will still exist. It may have been deleted or renamed with
a .partial suffix

So the need for tli1 does not always imply that the wal was generated
after promotion. It could be bytes from before the switchpoint, but
they now be available through the new tli segment file.

6) XLogReadDetermineTimeline() needs the current system-wide TLI.

The current code in does roughly:

am_cascading_walsender = RecoveryInProgress();

if (am_cascading_walsender)
GetXLogReplayRecPtr(&currTLI);
else
currTLI = GetWALInsertionTimeLine();

During promotion, we set the wal insertion timeline before marking
recovery as fully done. So there is a window where:

- InsertTimeLineID is already the new timeline, e.g. tli 2.
- RecoveryInProgress() can still return true in another backend.
- A logical walsender that started before promotion can wake up and
choose its read timeline using the old recovery-state test.

In the race, replay tli 1 is chosen even though the promotion has
created the new tli 2 wal segment copy. Then WalSndSegmentOpen() tries
to open 000000010000000000000006, which may already be gone or
renamed, while the correct file is 000000020000000000000006.

7) Bertrand's proposal to fix this
Recognizing the intermediate promotion state -- if recovery still
shows progress but the insertion tli is already advanced, use that
insertion tli for the timeline to be chosen.

However, I think logical_read_xlog_page() is called after the new WAL records
are generated, i.e., am_cascading_walsender has already been false at that time.
So not sure where is the race?

This function is not only called after new post-promotion records are
generated. When START_REPLICATION begins, CreateDecodingContext() /
XLogBeginRead() starts from the slot's restart_lsn, which can be well
before promotion.

I tweaked the reproducer based on the theory outlined above. The main
changes from the original reproducer are:

1) blocks at logical-walsender-after-slot-acquire in walsender.c,
before the decoding context is created and before the reader starts
from restart_lsn, matching the delay set by Alexander

2) Forces the first read to occur during promotion. It inserts rows
1..4, waits for replay, starts promotion with pg_promote(false), holds
startup at startup-logical-decoding-status-change-end-of-recovery,
then wakes the walsender.

The test failed on HEAD and passed after applying the patch by Bertrand.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#7)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi,

On Mon, Jun 08, 2026 at 09:00:17PM +0800, Xuneng Zhou wrote:

I tweaked the reproducer based on the theory outlined above. The main
changes from the original reproducer are:

1) blocks at logical-walsender-after-slot-acquire in walsender.c,
before the decoding context is created and before the reader starts
from restart_lsn, matching the delay set by Alexander

2) Forces the first read to occur during promotion. It inserts rows
1..4, waits for replay, starts promotion with pg_promote(false), holds
startup at startup-logical-decoding-status-change-end-of-recovery,
then wakes the walsender.

Yeah, so this existing startup-logical-decoding-status-change-end-of-recovery
injection point already exists in the code tree and is also called after
CleanupAfterArchiveRecovery() and before RECOVERY_STATE_DONE change in
StartupXLOG().

So this is the same window as the new injection point that was added in the new
tests in v1-0002 shared up-thread [1]/messages/by-id/aiaBtENl7tTf2MM8@bdtpg.

That said, I think I prefer the v1-0002 shared up-thread [1]/messages/by-id/aiaBtENl7tTf2MM8@bdtpg approach for the
tests as:

- the injection point name clearly describes the tested condition
- no new injection point is added in walsender.c (it pauses startup mid-promotion
and lets the walsender connect)
- the test relies on one injection point (not two)

[1]: /messages/by-id/aiaBtENl7tTf2MM8@bdtpg

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Xuneng Zhou

xunengzhou@gmail.com

about 2 months ago

In reply to: Bertrand Drouvot (#8)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

On Mon, Jun 8, 2026 at 10:22 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Mon, Jun 08, 2026 at 09:00:17PM +0800, Xuneng Zhou wrote:

I tweaked the reproducer based on the theory outlined above. The main
changes from the original reproducer are:

1) blocks at logical-walsender-after-slot-acquire in walsender.c,
before the decoding context is created and before the reader starts
from restart_lsn, matching the delay set by Alexander

2) Forces the first read to occur during promotion. It inserts rows
1..4, waits for replay, starts promotion with pg_promote(false), holds
startup at startup-logical-decoding-status-change-end-of-recovery,
then wakes the walsender.

Yeah, so this existing startup-logical-decoding-status-change-end-of-recovery
injection point already exists in the code tree and is also called after
CleanupAfterArchiveRecovery() and before RECOVERY_STATE_DONE change in
StartupXLOG().

So this is the same window as the new injection point that was added in the new
tests in v1-0002 shared up-thread [1].

That said, I think I prefer the v1-0002 shared up-thread [1] approach for the
tests as:

- the injection point name clearly describes the tested condition
- no new injection point is added in walsender.c (it pauses startup mid-promotion
and lets the walsender connect)
- the test relies on one injection point (not two)

[1]: /messages/by-id/aiaBtENl7tTf2MM8@bdtpg

Thanks for the clarification. I haven't reviewed the patch set other
than applying the patch 1 yet, but I plan to do so tomorrow.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

#10

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 2 months ago

In reply to: Bertrand Drouvot (#6)

RE: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Dear Bertrand, Xuneng,

Thanks for sharing the reproducer and opinion. I could now understand.

While seeing, I felt the same issue can happen even for the streaming case.
Please see attached reproducer and possible fix.

Can you also include that point? Or this thread discusses only for the logical case?
If separated, I can take initiative for the streaming part, after the logical decoding
case.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#11

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 2 months ago

In reply to: Hayato Kuroda (Fujitsu) (#10)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi Hayato-san,

On Tue, Jun 09, 2026 at 04:04:20AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Bertrand, Xuneng,

Thanks for sharing the reproducer and opinion. I could now understand.

While seeing, I felt the same issue can happen even for the streaming case.
Please see attached reproducer and possible fix.

I just had a quick look and I feel that's probably an issue too (good find!).
I think that your fix should use GetWALInsertionTimeLineIfSet() though (but I
only did a quick scan on your patch).

Can you also include that point? Or this thread discusses only for the logical case?
If separated, I can take initiative for the streaming part, after the logical decoding
case.

I would says it's probably better to keep this thread focusing on Alexander's issue
(logical decoding). Once this thread is fixed, then I'd suggest to open a dedicated
thread for the streaming one (mentioning this thread and probably trying to mimic
its fix as close as possible).

Also I would suggest not to post "real" patches here as they conflict with mine
in the cfbot (means the cfbot won't tell me anymore if my patches need a rebase).
Please use .txt files or use "nocfbot" instead if you feel the need.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#12

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 2 months ago

In reply to: Bertrand Drouvot (#6)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi,

On Mon, Jun 08, 2026 at 08:47:48AM +0000, Bertrand Drouvot wrote:

Attached:

0001: To fix this race

Fix by checking GetWALInsertionTimeLineIfSet() when RecoveryInProgress()
returns true. If InsertTimeLineID is already set (non-zero), the new timeline is
established and we use it directly, avoiding attempts to read from segments that
may have been removed.

0002: Adding a test in 035_standby_logical_decoding.pl

It makes use of a new injection point "promotion-after-wal-segment-cleanup" in
StartupXLOG(), right after CleanupAfterArchiveRecovery() removes old timeline
WAL segments but before SharedRecoveryState is set to RECOVERY_STATE_DONE.

The test fails without the fix in 0001 so it also somehow proves that the
diagnostic is right.

0003: Apply the same timeline fix to read_local_xlog_page_guts()

Indeed, it could hit the same race as mentioned by Xuneng-San.

0004: Add a test for 0003

Re-attaching v1 patches here to have the cfbot focusing on those.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Xuneng Zhou

xunengzhou@gmail.com

about 2 months ago

In reply to: Bertrand Drouvot (#11)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

On Tue, Jun 9, 2026 at 1:57 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi Hayato-san,

On Tue, Jun 09, 2026 at 04:04:20AM +0000, Hayato Kuroda (Fujitsu) wrote:

Dear Bertrand, Xuneng,

Thanks for sharing the reproducer and opinion. I could now understand.

While seeing, I felt the same issue can happen even for the streaming case.
Please see attached reproducer and possible fix.

Good catch!

I just had a quick look and I feel that's probably an issue too (good find!).
I think that your fix should use GetWALInsertionTimeLineIfSet() though (but I
only did a quick scan on your patch).

Can you also include that point? Or this thread discusses only for the logical case?
If separated, I can take initiative for the streaming part, after the logical decoding
case.

I would says it's probably better to keep this thread focusing on Alexander's issue
(logical decoding). Once this thread is fixed, then I'd suggest to open a dedicated
thread for the streaming one (mentioning this thread and probably trying to mimic
its fix as close as possible).

+1 for starting a dedicated thread. Please CC me in that thread in
case I miss it.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

#14

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 2 months ago

In reply to: Bertrand Drouvot (#12)

RE: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Dear Bertrand,

Thanks for re-posting. Let's focus on the logical decoding stuff.

0004: Add a test for 0003

I found a comment for the test.

```
+# Issue SQL decoding (read_local_xlog_page_guts path) on the pre-connected
+# session.
+$decode_session->query_until(
+       qr/decoding_started/, qq(
+       \\echo decoding_started
+       SELECT count(*) FROM pg_logical_slot_peek_changes('race_slot_sql', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+));
```

Not sure, pg_logical_slot_peek_changes() can be finished without waiting anything,
right? So why do we use query_until() and check the output here?
IIUC, the command can immediately raise an ERROR if the race happened. So isn't it
enough to use safe_psql() to ensure the SQL function can work?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#15

Xuneng Zhou

xunengzhou@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#9)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

On Mon, Jun 8, 2026 at 10:34 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Mon, Jun 8, 2026 at 10:22 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Mon, Jun 08, 2026 at 09:00:17PM +0800, Xuneng Zhou wrote:

I tweaked the reproducer based on the theory outlined above. The main
changes from the original reproducer are:

1) blocks at logical-walsender-after-slot-acquire in walsender.c,
before the decoding context is created and before the reader starts
from restart_lsn, matching the delay set by Alexander

2) Forces the first read to occur during promotion. It inserts rows
1..4, waits for replay, starts promotion with pg_promote(false), holds
startup at startup-logical-decoding-status-change-end-of-recovery,
then wakes the walsender.

Yeah, so this existing startup-logical-decoding-status-change-end-of-recovery
injection point already exists in the code tree and is also called after
CleanupAfterArchiveRecovery() and before RECOVERY_STATE_DONE change in
StartupXLOG().

So this is the same window as the new injection point that was added in the new
tests in v1-0002 shared up-thread [1].

That said, I think I prefer the v1-0002 shared up-thread [1] approach for the
tests as:

- the injection point name clearly describes the tested condition
- no new injection point is added in walsender.c (it pauses startup mid-promotion
and lets the walsender connect)
- the test relies on one injection point (not two)

[1]: /messages/by-id/aiaBtENl7tTf2MM8@bdtpg

Thanks for the clarification. I haven't reviewed the patch set other
than applying the patch 1 yet, but I plan to do so tomorrow.

I've readed through the patch set. They look good overall. Here're
some comments on them:

1) In the commit messages and comments for all four patches, the
reason why the target WAL segment cannot be found on the old timeline
is described as follows:

"old timeline WAL segments have already been removed or
recycled by RemoveNonParentXlogFiles() in CleanupAfterArchiveRecovery()."

Is mentioning the 'remove' case only a bit narrow?

The timeline-selection comment says this explicitly:
"there's no guarantee the old segment will still exist. It may have been
deleted or renamed with a .partial suffix"

How about phrasing it like:
old timeline WAL files may have been removed, recycled, or renamed to .partial.

After running the reproducer provided by Hayato-san, the standby’s
pg_wal directory looked like this following the failure:
000000010000000000000003.partial
00000002.history
000000020000000000000003
000000020000000000000004

So in this repro, the requested file:

000000010000000000000003

was not unlinked as a regular "removed" file. It had been renamed to:

000000010000000000000003.partial

but the log says this explicitly:
ERROR: requested WAL segment 000000010000000000000003 has already been removed

It appears inconsistent to me...

2) Injection points in tests 0002 and 0004

- the injection point name clearly describes the tested condition

I think this is true. It better reflects the intention than
startup-logical-decoding-status-change-end-of-recovery does. However,
I don't know whether this alone warrants a new injection point.

- no new injection point is added in walsender.c (it pauses startup mid-promotion
and lets the walsender connect)
- the test relies on one injection point (not two)

I am unsure that the injection point could replace the walsender-side
synchronization need.

The race has two moving processes:

- startup process: promotion cleanup / RECOVERY_STATE_DONE transition
- walsender process: logical_read_xlog_page() timeline choice

A startup injection point controls only this:
startup is paused in the vulnerable promotion window

It does not prove this:
walsender has reached logical_read_xlog_page() while startup is paused

3) Stricter synchronization point in both tests
Both tests use this condition "active_pid IS NOT NULL" for
synchronization at the walsender side. However, it only proves that
pg_recvlogical has connected walsender has acquired the logical slot,
not necessarily the walsender is paused after acquiring the slot and
before the promotion window is set. There are several potential states
for walsender in this condition:

walsender is just after ReplicationSlotAcquire()
it has called XLogBeginRead()
it is already inside logical_read_xlog_page()
it already opened the WAL segment
it already failed or succeeded

The test cannot distinguish those states reliably.

So we may still need another injection point for synchronization at
the walsender side

4) Stricter result checks in test files
The surrounding 035 test is stricter than the test in 0002. It first
waits for COMMITs, then compares exact decoded output. Should we
adhere to this pattern too?

0004's \d+ check seems somewhat loose to me. Should we capture the
query result and compare it to the expected count, or select data and
compare the decoded rows?

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

#16

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#15)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi,

On Tue, Jun 09, 2026 at 03:49:50PM +0800, Xuneng Zhou wrote:

On Mon, Jun 8, 2026 at 10:34 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Mon, Jun 8, 2026 at 10:22 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I've readed through the patch set. They look good overall.

Thanks for the review!

Here're
some comments on them:

1) In the commit messages and comments for all four patches, the
reason why the target WAL segment cannot be found on the old timeline
is described as follows:

"old timeline WAL segments have already been removed or
recycled by RemoveNonParentXlogFiles() in CleanupAfterArchiveRecovery()."

Is mentioning the 'remove' case only a bit narrow?

The timeline-selection comment says this explicitly:
"there's no guarantee the old segment will still exist. It may have been
deleted or renamed with a .partial suffix"

How about phrasing it like:
old timeline WAL files may have been removed, recycled, or renamed to .partial.

After running the reproducer provided by Hayato-san, the standby’s
pg_wal directory looked like this following the failure:
000000010000000000000003.partial
00000002.history
000000020000000000000003
000000020000000000000004

So in this repro, the requested file:

000000010000000000000003

was not unlinked as a regular "removed" file. It had been renamed to:

000000010000000000000003.partial

but the log says this explicitly:
ERROR: requested WAL segment 000000010000000000000003 has already been removed

It appears inconsistent to me...

I'm not sure. The error message says "has already been removed" and the commit
messages and comments says"removed or recycled": those are consistent with the
error message. We're describing the symptom from the walsender's perspective,
not the exact file operation that caused it.

2) Injection points in tests 0002 and 0004

It does not prove this:
walsender has reached logical_read_xlog_page() while startup is paused

3) Stricter synchronization point in both tests
Both tests use this condition "active_pid IS NOT NULL" for
synchronization at the walsender side. However, it only proves that
pg_recvlogical has connected walsender has acquired the logical slot,
not necessarily the walsender is paused after acquiring the slot and
before the promotion window is set. There are several potential states
for walsender in this condition:

walsender is just after ReplicationSlotAcquire()
it has called XLogBeginRead()
it is already inside logical_read_xlog_page()
it already opened the WAL segment
it already failed or succeeded

The test cannot distinguish those states reliably.

So we may still need another injection point for synchronization at
the walsender side

I agree that with v1 the test could have been fragile. It's fixed in v2 without
having to add a second injection point. All we have to do is to ensure that
the decoding occurred while the startup is paused on the new injection point.

0002 does that by starting the new walsender and doing the decoding while the
startup is paused
0004 does that by ensuring the pre-connected session triggers the decoding while
the startup is paused

4) Stricter result checks in test files
The surrounding 035 test is stricter than the test in 0002. It first
waits for COMMITs, then compares exact decoded output. Should we
adhere to this pattern too?

That's done in v2 (and also adress Hayato-san feedback).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#17

Xuneng Zhou

xunengzhou@gmail.com

about 1 month ago

In reply to: Bertrand Drouvot (#16)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

On Tue, Jun 9, 2026 at 7:01 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

On Tue, Jun 09, 2026 at 03:49:50PM +0800, Xuneng Zhou wrote:

On Mon, Jun 8, 2026 at 10:34 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Mon, Jun 8, 2026 at 10:22 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

I've readed through the patch set. They look good overall.

Thanks for the review!

Here're
some comments on them:

1) In the commit messages and comments for all four patches, the
reason why the target WAL segment cannot be found on the old timeline
is described as follows:

"old timeline WAL segments have already been removed or
recycled by RemoveNonParentXlogFiles() in CleanupAfterArchiveRecovery()."

Is mentioning the 'remove' case only a bit narrow?

The timeline-selection comment says this explicitly:
"there's no guarantee the old segment will still exist. It may have been
deleted or renamed with a .partial suffix"

How about phrasing it like:
old timeline WAL files may have been removed, recycled, or renamed to .partial.

After running the reproducer provided by Hayato-san, the standby’s
pg_wal directory looked like this following the failure:
000000010000000000000003.partial
00000002.history
000000020000000000000003
000000020000000000000004

So in this repro, the requested file:

000000010000000000000003

was not unlinked as a regular "removed" file. It had been renamed to:

000000010000000000000003.partial

but the log says this explicitly:
ERROR: requested WAL segment 000000010000000000000003 has already been removed

It appears inconsistent to me...

I'm not sure. The error message says "has already been removed" and the commit
messages and comments says"removed or recycled": those are consistent with the
error message. We're describing the symptom from the walsender's perspective,
not the exact file operation that caused it.

2) Injection points in tests 0002 and 0004

It does not prove this:
walsender has reached logical_read_xlog_page() while startup is paused

3) Stricter synchronization point in both tests
Both tests use this condition "active_pid IS NOT NULL" for
synchronization at the walsender side. However, it only proves that
pg_recvlogical has connected walsender has acquired the logical slot,
not necessarily the walsender is paused after acquiring the slot and
before the promotion window is set. There are several potential states
for walsender in this condition:

walsender is just after ReplicationSlotAcquire()
it has called XLogBeginRead()
it is already inside logical_read_xlog_page()
it already opened the WAL segment
it already failed or succeeded

The test cannot distinguish those states reliably.

So we may still need another injection point for synchronization at
the walsender side

I agree that with v1 the test could have been fragile. It's fixed in v2 without
having to add a second injection point. All we have to do is to ensure that
the decoding occurred while the startup is paused on the new injection point.

0002 does that by starting the new walsender and doing the decoding while the
startup is paused
0004 does that by ensuring the pre-connected session triggers the decoding while
the startup is paused

That should work, and it’s cleverer. I was fixated on the idea that we
needed to start the walsender, pause it, suspend the startup process
to enter the promotion window, and then resume the walsender. The
essential thing is just to ensure that the startup remains paused
until decoding output is observed.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

#18

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 month ago

In reply to: Xuneng Zhou (#17)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

Hi,

On Wed, Jun 10, 2026 at 04:36:14PM +0800, Xuneng Zhou wrote:

On Tue, Jun 9, 2026 at 7:01 PM Bertrand Drouvot

0002 does that by starting the new walsender and doing the decoding while the
startup is paused
0004 does that by ensuring the pre-connected session triggers the decoding while
the startup is paused

The
essential thing is just to ensure that the startup remains paused
until decoding output is observed.

Right, thanks for confirming. That's exactly what v2 is doing.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#19

Michael Paquier

michael@paquier.xyz

about 1 month ago

In reply to: Bertrand Drouvot (#18)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

On Wed, Jun 10, 2026 at 05:28:00PM +0000, Bertrand Drouvot wrote:

On Wed, Jun 10, 2026 at 04:36:14PM +0800, Xuneng Zhou wrote:

The
essential thing is just to ensure that the startup remains paused
until decoding output is observed.

Right, thanks for confirming. That's exactly what v2 is doing.

I have looked at this thread, and my first impression was that this
could be a data integrity issue while decoding changes due to the
transient errors one could see across the promotion requests.

But it's less severe than I thought initially: we have an availability
problem here, down to v16, with a correct recovery possible once the
promotion request has completed. That could be indeed surprising for
users that have HA setups with standbys doing logical decoding.. The
SQL function path is less worrying to me, there are as far as I know
few users of it compared to the "native" path with sync workers.

read_local_xlog_page_guts() does not only impact SQL-callable logirep
functions, even it is the spot that should be hit most of the time
(again, the RecoveryInProgress() vs promotion window is super narrow).
At quick glance, things are:
- walinspect.
- Slot advance.
- Slot creation (?), but it feels even narrower.

With two items dealt with on this thread for these two callback paths
changed, moving on the part related to physical replication into its
own thread would be better. This requires an entirely different
analysis and a different lookup.

The backpatch of PG16 is straight-forward and adding
GetWALInsertionTimeLineIfSet() down there does not look like an issue.
Not having any tests in v16 feels sad, but that's life. It does not
prevent addressing the availability issue on this branch.

I'll go take it up from here.
--
Michael

#20

Michael Paquier

michael@paquier.xyz

about 1 month ago

In reply to: Michael Paquier (#19)

Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

On Thu, Jun 11, 2026 at 10:15:01AM +0900, Michael Paquier wrote:

The backpatch of PG16 is straight-forward and adding
GetWALInsertionTimeLineIfSet() down there does not look like an issue.
Not having any tests in v16 feels sad, but that's life. It does not
prevent addressing the availability issue on this branch.

Backpatch of GetWALInsertionTimeLineIfSet() done in v16.

I'll go take it up from here.

I have spent some time on the first part of this patch series,
focusing on the noisiest path for logical replication, and applied it
down to v17. I have been able to reproduce the problem using
Alexander's sleep trick, and let 035 run with the same trick and the
patch applied for a few hundred runs.

Note that REL_17_STABLE has currently 4 failures in the buildfarm, due
to me messing up with the existence of injection_points. I'll
probably just remove the test on this branch, 105b2cb33617 adding the
requirement on the module only in v18~.
--
Michael

#21

Xuneng Zhou

xunengzhou@gmail.com

about 1 month ago

In reply to: Michael Paquier (#19)

#22

Michael Paquier

michael@paquier.xyz

about 1 month ago

In reply to: Xuneng Zhou (#21)

#23

Bertrand Drouvot