Newly created replication slot may be invalidated by checkpoint

Started by suyu.cmj3 months ago47 messages
#1suyu.cmj
suyu.cmj
mengjuan.cmj@alibaba-inc.com
1 attachment(s)

Hi, all,
I'd like to discuss an issue about getting the minimal restart_lsn for WAL segments removal during checkpoint. The discussion [1]/messages/by-id/1d12d2-67235980-35-19a406a0@63439497 </messages/by-id/1d12d2-67235980-35-19a406a0@63439497 > fixed the issue with the unexpected removal of old WAL segments after checkpoint, followed by an immediate restart. The commit 2090edc6f32f652a2c introduced a change that the minimal restart_lsn is obtained at the start of checkpoint creation. If a replication slot is created and performs a WAL reservation concurrently, the WAL segment contains the new slot's restart_lsn could be removed by the ongoing checkpoint. In the attached patch I add a perl test to reproduce this scenario.
Additionally, while studying the InvalidatePossiblyObsoleteSlot(), I noticed a behavioral difference between PG15 (and earlier) and PG16 (and later). In PG15 and earlier, while attempting to acquire a slot, if the slot's restart_lsn advanced to be greater than oldestLSN, the slot would not be marked invalid. Starting in PG16, whether a slot is marked invalid is determined solely based on initial_restart_lsn, even if the slot's restart_lsn advances above oldestLSN while waiting, the slot will still be marked invalid. The initial_restart_lsn is recorded to report the correct invalidation cause (see discussion [2]/messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal </messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal > Looking forward to your feedback. Best Regards, suyu.cmj), but why not decide whether to mark the slot as invalid based on the slot's current restart_lsn? If a slot's restart_lsn has already advanced sufficiently, shouldn't we refrain from invalidating it?
[1]: /messages/by-id/1d12d2-67235980-35-19a406a0@63439497 </messages/by-id/1d12d2-67235980-35-19a406a0@63439497 >
[2]: /messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal </messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal > Looking forward to your feedback. Best Regards, suyu.cmj
Looking forward to your feedback.
Best Regards,
suyu.cmj

Attachments:

0001-Newly-created-replication-slot-may-be-invalidated.patchapplication/octet-stream
#2Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: suyu.cmj (#1)
3 attachment(s)
Re: Newly created replication slot may be invalidated by checkpoint

Hi suyu.cmj

The commit 2090edc6f32f652a2c introduced a change that the
minimal restart_lsn is obtained at the start of checkpoint creation. If a
replication slot is created and performs a WAL reservation concurrently, the
WAL segment contains the new slot's restart_lsn could be removed by the ongoing
checkpoint.

Thank you for reporting this issue. I agree, the issue with slot invalidation
seems to take place in REL_17_STABLE and earlier, but it is not reproducible in
18+ versions because of different implementation. The problem may appear if
the first persistent slot is created during checkpoint, when slot's oldest lsn
is invalid. I'm not sure how it works when some other persistent slots exist.
Probably, invalidation is still possible if the reservation happens with lsn
older than the oldest lsn of existing slots.

In 17 and earlier verions, when checkpoint is started in takes slot's oldest lsn
using XLogGetReplicationSlotMinimumLSN(). This value will be used later in WAL
segments removal. If a new slot reserved the WAL between getting of slots'
oldest lsn and WAL removal, it may be invalidated. It happens because
ReplicationSlotReserveWal() checks XLogCtl->lastRemovedSegNo but the segments
are not yet removed. There is a subtle thing, when the wal reservation completes
at the same time when the checkpointer is between KeepLogSeg and
RemoveOldXlogFiles where XLogCtl->lastRemovedSegNo is updated. The slot will not
be invalidated but the segments, reserved by the new slot, may be removed, I guess.

In 17 and earlier we tried to create a compatible solution, when oldest lsn was
taken before slot syncing to disk. In the master branch we added a new
last_saved_restart_lsn into ReplicationSlot structure which seems to be a better
solution.

I prepared a simple fix [1]0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch for 17 and earlier versions. It seems it fixes the
problem with first persistent slot creation. I also think, it should work as it
was before the patch that added this bug.

I also did some changes in the original test script, for 17 ([2]v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patch) and 18 ([3]v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patch)
versions.

I continue to investigate and test it.

[1]: 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch
[2]: v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patch
[3]: v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patch

With best regards,
Vitaly

Attachments:

v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patchtext/x-patch
0001-Fix-invalidation-when-slot-is-created-during-checkpo.patchtext/x-patch
v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patchtext/x-patch
#3Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Vitaly Davydov (#2)
Re: Newly created replication slot may be invalidated by checkpoint

On Wed, Sep 17, 2025 at 4:19 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

[1] 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch

- /* Calculate how many segments are kept by slots. */
- keep = slotsMinReqLSN;
+ /*
+ * Calculate how many segments are kept by slots. Keep the wal using
+ * the minimal value from the current reserved LSN and the reserved LSN at
+ * the moment of checkpoint start (before CheckPointReplicationSlots).
+ */
+ keep = XLogGetReplicationSlotMinimumLSN();
+ if (!XLogRecPtrIsInvalid(slotsMinReqLSN))
+ keep = Min(keep, slotsMinReqLSN);

Can we add why we need this additional calculation here?

I have one question regarding commit 2090edc6f32f652a2c:
*
    if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED,
                                           _logSegNo, InvalidOid,
                                           InvalidTransactionId))
    {
+       /*
+        * Recalculate the current minimum LSN to be used in the WAL segment
+        * cleanup.  Then, we must synchronize the replication slots again in
+        * order to make this LSN safe to use.
+        */
+       slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN();
+       CheckPointReplicationSlots(shutdown);
+
        /*
         * Some slots have been invalidated; recalculate the old-segment
         * horizon, starting again from RedoRecPtr.
         */
        XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
-       KeepLogSeg(recptr, &_logSegNo);
+       KeepLogSeg(recptr, slotsMinReqLSN, &_logSegNo);

After invalidating the slots, we recalculate the slotsMinReqLSN with
the latest value of XLogGetReplicationSlotMinimumLSN(). Can't it
generate a more recent value of slot's restart_lsn which has not been
flushed and we may end up removing the corresponding WAL? We should
probably add some comments as to why such a race doesn't exist.

--
With Regards,
Amit Kapila.

#4Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: suyu.cmj (#1)
Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Sep 15, 2025 at 8:11 PM suyu.cmj <mengjuan.cmj@alibaba-inc.com> wrote:

Additionally, while studying the InvalidatePossiblyObsoleteSlot(), I noticed a behavioral difference between PG15 (and earlier) and PG16 (and later). In PG15 and earlier, while attempting to acquire a slot, if the slot's restart_lsn advanced to be greater than oldestLSN, the slot would not be marked invalid. Starting in PG16, whether a slot is marked invalid is determined solely based on initial_restart_lsn, even if the slot's restart_lsn advances above oldestLSN while waiting, the slot will still be marked invalid. The initial_restart_lsn is recorded to report the correct invalidation cause (see discussion [2]), but why not decide whether to mark the slot as invalid based on the slot's current restart_lsn? If a slot's restart_lsn has already advanced sufficiently, shouldn't we refrain from invalidating it?

I haven't tried to reproduce it but I see your point. I agree that if
this is happening then we should not invalidate such slots. This is a
different problem related to a different commit than what is fixd as
2090edc6f32f652a2c. I suggest you to either start a new thread for
this or report in the original thread where this change was made.

--
With Regards,
Amit Kapila.

#5蔡梦娟(玊于)
蔡梦娟(玊于)
mengjuan.cmj@alibaba-inc.com
In reply to: Amit Kapila (#4)
Re: Newly created replication slot may be invalidated by checkpoint

Hi Amit,
Thank you for your reply. Following your suggestion, I have started a new discussion thread for this issue:
/messages/by-id/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com </messages/by-id/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com >
Best regards,
suyu.cmj

#6Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#3)
Re: Newly created replication slot may be invalidated by checkpoint

On Tue, Sep 23, 2025 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 17, 2025 at 4:19 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

[1] 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch

- /* Calculate how many segments are kept by slots. */
- keep = slotsMinReqLSN;
+ /*
+ * Calculate how many segments are kept by slots. Keep the wal using
+ * the minimal value from the current reserved LSN and the reserved LSN at
+ * the moment of checkpoint start (before CheckPointReplicationSlots).
+ */
+ keep = XLogGetReplicationSlotMinimumLSN();
+ if (!XLogRecPtrIsInvalid(slotsMinReqLSN))
+ keep = Min(keep, slotsMinReqLSN);

Can we add why we need this additional calculation here?

I was thinking some more about this solution. Won't it lead to the
same problem if ReplicationSlotReserveWal() calls
ReplicationSlotsComputeRequiredLSN() after the above calculation of
checkpointer?

--
With Regards,
Amit Kapila.

#7Hayato Kuroda (Fujitsu)
Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#6)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Amit, Vitaly,

I was thinking some more about this solution. Won't it lead to the
same problem if ReplicationSlotReserveWal() calls
ReplicationSlotsComputeRequiredLSN() after the above calculation of
checkpointer?

Exactly. I verified that in your patch, the invalidation can still happen if we
cannot finish the LSN computation before the KeepLogSegments().

Attached file can be applied atop 0001-Fix-invalidation-... and
v2-17-0001-Newly-created-replication... patches. It could invalidate the given
slot.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

0001-Stop-using-injection_points-for-checkpointer.patchapplication/octet-stream; name=0001-Stop-using-injection_points-for-checkpointer.patch
#8Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#7)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Amit, Hayato

On Wednesday, September 24, 2025 14:31 MSK, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

I was thinking some more about this solution. Won't it lead to the
same problem if ReplicationSlotReserveWal() calls
ReplicationSlotsComputeRequiredLSN() after the above calculation of
checkpointer?

Exactly. I verified that in your patch, the invalidation can still happen if
we cannot finish the LSN computation before the KeepLogSegments().

Yes. The moment, when WAL reservation takes place is the call of
ReplicationSlotsComputeRequiredLSN which updates the oldest slots' lsn
(XLogCtl->replicationSlotMinLSN). If it occurs at the moment between KeepLogSeg
and RemoveOldXlogFiles, such reservation will not be taken into account. This
behaviour seems to be before commit 2090edc6f32f652a2c, but the probability of
such race condition was too slow due to the short time period between KeepLogSeg
and RemoveOldXlogFiles. The commit 2090edc6f32f652a2c increased the probability
of such race condition because CheckPointGuts can take greater time to execute.

The attached patch doesn't solve the original problem completely but it
decreases the probability of such race condition, as it was before the commit.
I propose to apply this patch and then to think how to resolve this race
condition, which seems to take place in 18 and master as well.

I updated the patch by improving some comments as suggested by Amit.

With best regards,
Vitaly

Attachments:

v2-0001-Fix-invalidation-when-slot-is-created-during-checkpo.patchtext/x-patch
#9Hayato Kuroda (Fujitsu)
Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Vitaly Davydov (#8)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

I propose to apply this patch and then to think how to resolve this race
condition, which seems to take place in 18 and master as well.

No, I think this invalidation can't happen in PG18/HEAD.
This is because in CheckpointGuts()->CheckPointReplicationSlots()->
ReplicationSlotsComputeRequiredLSN(), slots are examined one by one to determine
whether their restart_lsn has advanced since the last check. If any slot has
advanced, protection is applied starting from the oldest restart_lsn.
Crucially, this check is performed before WAL removal. The function call was
introduced in commit ca307d5cec.

Further analysis shows that it is also safe if a slot is being created and WAL
advances after CheckpointGuts() but before the removal segments are determined.
In this case the restart_lsn points the CHECKPOINT_REDO generated by the current
CHECKPOINT. This and later records won't be discarded.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#10Hayato Kuroda (Fujitsu)
Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#9)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

Would you have enough time to work on and fix the issue?
One idea is to compute the required LSN by the system at the slot checkpoint. This
partially follows what PG18/HEAD does but seems hacky and difficult to accept.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#11Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#10)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Hayato,

Would you have enough time to work on and fix the issue?
One idea is to compute the required LSN by the system at the slot checkpoint. This
partially follows what PG18/HEAD does but seems hacky and difficult to accept.

I'm working on the issue. Give me, please, a couple of days to finalize my work.

In short, I think to call ReplicationSlotsComputeRequiredLSN() right before
slotsMinReqLSN assignment in CreateCheckPoint in 17 and earlier versions. At
this moment, we already have a new redo lsn. I consider, that the WAL
reservation happens when we assign restart_lsn to a slot. Taking into account
this consideration, I distinguish two cases - WAL reservation happens before or
after new redo ptr assignment. If we reserve the WAL after new redo ptr, it will
protect the slot's reservation, as you've mentioned. The problem appears, when
we reserve the WAL before a new redo ptr, but the update of
XLogCtl->replicationSlotMinLSN was not yet hapenned. When we assign
slotsMinReqLSN, we use XLogCtl->replicationSlotMinLSN. The call of
ReplicationSlotsComputeRequiredLSN before slotsMinReqLSN assignment can help.
It will be guaranteed, that those slots with WAL reservation before a new redo
ptr will be protected by slotsMinReqLSN, but slots with wal reservation after
a new redo ptr will be protected by the redo ptr. I think it is about the same
as you proposed.

These reasonings are applied to physical slots, but it seems to be ok for
logical slots. One moment, I'm not sure, when we create a logical slot in
recovery. In this case, GetXLogReplayRecPtr() is used. I'm not sure, that
redo ptr will protect such slot in CreateRestartPoint.

With best regards,
Vitaly

#12Hayato Kuroda (Fujitsu)
Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Vitaly Davydov (#11)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

Would you have enough time to work on and fix the issue?
One idea is to compute the required LSN by the system at the slot checkpoint.

This

partially follows what PG18/HEAD does but seems hacky and difficult to accept.

I'm working on the issue. Give me, please, a couple of days to finalize my work.

Oh, sorry. I was rude.

In short, I think to call ReplicationSlotsComputeRequiredLSN() right before
slotsMinReqLSN assignment in CreateCheckPoint in 17 and earlier versions. At
this moment, we already have a new redo lsn. I consider, that the WAL
reservation happens when we assign restart_lsn to a slot. Taking into account
this consideration, I distinguish two cases - WAL reservation happens before or
after new redo ptr assignment. If we reserve the WAL after new redo ptr, it will
protect the slot's reservation, as you've mentioned. The problem appears, when
we reserve the WAL before a new redo ptr, but the update of
XLogCtl->replicationSlotMinLSN was not yet hapenned. When we assign
slotsMinReqLSN, we use XLogCtl->replicationSlotMinLSN. The call of
ReplicationSlotsComputeRequiredLSN before slotsMinReqLSN assignment can
help.
It will be guaranteed, that those slots with WAL reservation before a new redo
ptr will be protected by slotsMinReqLSN, but slots with wal reservation after
a new redo ptr will be protected by the redo ptr. I think it is about the same
as you proposed.

Per my understanding, this happened because there is a lag that restart_lsn of
the slot is set, and it is protected by the system. Your idea is to ensure the
restart_lsn is protected by the system before obtaining on-memory LSN, right?

These reasonings are applied to physical slots, but it seems to be ok for
logical slots. One moment, I'm not sure, when we create a logical slot in
recovery. In this case, GetXLogReplayRecPtr() is used. I'm not sure, that
redo ptr will protect such slot in CreateRestartPoint.

I considered a reproducer for the logical slot on the standby instance. Similar
with the physical one, the injection point while reserving the WAL is used, and
it would be discarded by the restartpoint command.
One difference with physical is that invalidated slot does not retain, because
it is the ephemeral at that time.

After adding the fix [1]``` --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7675,6 +7675,7 @@ CreateRestartPoint(int flags) MemSet(&CheckpointStats, 0, sizeof(CheckpointStats)); CheckpointStats.ckpt_start_t = GetCurrentTimestamp();, I confirmed my testcases are passed, but we should
understand more about the standby stuff.

[1]:
```
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7675,6 +7675,7 @@ CreateRestartPoint(int flags)
        MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
        CheckpointStats.ckpt_start_t = GetCurrentTimestamp();

+ XLogGetReplicationSlotMinimumLSN();
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

0001-Reproduce-the-slot-invalidation-on-standby.patchapplication/octet-stream; name=0001-Reproduce-the-slot-invalidation-on-standby.patch
#13Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#12)
2 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Hayato, All

On Friday, October 03, 2025 14:14 MSK, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

I'm working on the issue. Give me, please, a couple of days to finalize my work.

Oh, sorry. I was rude.

It is okay. I very appreciate your help.

Per my understanding, this happened because there is a lag that restart_lsn of
the slot is set, and it is protected by the system. Your idea is to ensure the
restart_lsn is protected by the system before obtaining on-memory LSN, right?

Not sure what you mean by on-memory LSN, but, the issue happens because we have
a lag between restart_lsn assignment and update of
XLogCtl->replicationSlotsMinLSN which is used to protect the WAL. Yes, I propose
to ensure that the protection happens when we assign restart_lsn. It seems to be
wrong that we invalidate slots by its restart_lsn but protect the wal for
slots using XLogCtl->replicationSlotsMinLSN.

Below I tried to write some summary and propose the patch which fixes the problem.

The issue was originally reported at [1]/messages/by-id/15922-68ca9280-4f-37de2c40@245457797 and it seems to appear in 17 and
earlier versions. The issue is not reproducible in 18+ versions.

The issue may appear when we create a persistent slot during checkpoint. The WAL
reservation in slots happens in ReplicationSlotReserveWal and executed in three
steps:
1. Assignment of slot->data.restart_lsn
2. Update of XLogCtl->replicationSlotMinLSN
3. Check if WAL segments at restart_lsn are removed, go to step 1 if removed.

When the checkpointer calculates the oldest lsn which is used as the lsn horizon
when removing old WAL segments, it takes XLogCtl->replicationSlotMinLSN.
There is a race condition may happen when slot's restart_lsn is already assigned
but XLogCtl->replicationSlotMinLSN is not updated yet. Consider the following
scenario with two processes executing in parallel (checkpointer and backend,
where a new slot is creating):

1. Assign of slot.data->restart_lsn in the backend from GetRedoRecPtr()
2. Assign a new redo LSN in the checkpointer
3. Assign slotsMinReqLSN from XLogCtl->replicationSlotMinLSN in the checkpointer
4. Update of XLogCtl->replicationSlotMinLSN in the backend.
5. Calculation of the WAL horizon for old segments cleanup (KeepLogSeg before
the call of InvalidateObsoleteReplicationSlots) in the checkpointer.
6. Exit from ReplicationSlotReserveWal in the backend, once the reserved WAL
segments are not removed at this moment (XLogGetLastRemovedSegno() < segno).
7. Call of InvalidateObsoleteReplicationSlots in the checkpointer will invalidate
the creating slot because its restart_lsn will be less than the calculated
WAL horizon (the min of slotsMinReqLSN and RedoRecPtr).

To fix the issue I propose to consider the following assumptions:
1. Slots do not cross WAL segment borders backward when moving.
2. Old WAL segments are removed in the checkpointer only.
3. The following LSNs are initially assigned during slot reservation:
- GetRedoRecPtr() for physical slots
- GetXLogInsertRecPtr() for logical slots
- GetXLogReplayRecPtr() for logical slots in recovery

Taking into account these assumptions, I would like to propose the fix [2]v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patch.

There is an idea to think that the WAL reservation happens when we assign
restart_lsn to the slot. The call of ReplicationSlotsComputeRequiredLSN() is not
required to be executed immediately in the backend where the slot is creating
concurrently. In the checkpointer we have to guarantee that we do WAL horizon
calculations based on actual values of restart_lsn of existing slots. If
we call ReplicationSlotsComputeRequiredLSN() in the checkpointer after a new
REDO assignment and before the calculation of WAL horizon, the value of
XLogCtl->replicationSlotMinLSN will correctly define the oldest LSN for existing
slots. If the WAL reservation by a new slot happens during checkpoint before
a new REDO assignment, it is guaranteed that its restart_lsn will be accounted
when we call ReplicationSlotsComputeRequiredLSN() in the checkpointer. If the
WAL reservation happens after a new redo LSN assignment, the slot's restart_lsn
will be protected by this new redo LSN, because this LSN will be lesser or equal
to initial restart_lsn (see assumption 3).

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

I also think that the cycle in ReplicationSlotReserveWal which checks for the
current restart_lsn to be greater than the XLogGetLastRemovedSegno() is not
necessary because it is guaranteed that the assigned restart_lsn will be
protected. Lets keep it unchanged until this suggestion will be clarified.

The proposed solution doesn't break the fix in ca307d5cec (unexpected removal of
old WAL segments after checkpoint). Once we call
ReplicationSlotsComputeRequiredLSN() before CheckPointReplicationSlots(), the
saved to disk restart_lsn values of existing slots will be not less than
the previously computed XLogCtl->replicationSlotMinLSN. They just may be
advanced to greater values concurrently. For new slots with restart_lsn
assignment after ReplicationSlotsComputeRequiredLSN(), the current redo LSN will
protect the WAL.

The fix for REL_17_STABLE is in [2]v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patch. The regression test is in [3]v3-0001-Newly-created-replication-slot-may-be-invalidated-by.patch.

I apologize for so long summary.

[1]: /messages/by-id/15922-68ca9280-4f-37de2c40@245457797
[2]: v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patch
[3]: v3-0001-Newly-created-replication-slot-may-be-invalidated-by.patch

With best regards,
Vitaly

Attachments:

v3-0001-Newly-created-replication-slot-may-be-invalidated-by.patchtext/x-patch
v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patchtext/x-patch
#14Hayato Kuroda (Fujitsu)
Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Vitaly Davydov (#13)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

Per my understanding, this happened because there is a lag that restart_lsn of
the slot is set, and it is protected by the system. Your idea is to ensure the
restart_lsn is protected by the system before obtaining on-memory LSN, right?

Not sure what you mean by on-memory LSN, but, the issue happens because we
have
a lag between restart_lsn assignment and update of
XLogCtl->replicationSlotsMinLSN which is used to protect the WAL.

Sorry I should say "before obtaining replicationSlotMinLSN".

Yes, I
propose
to ensure that the protection happens when we assign restart_lsn. It seems to be
wrong that we invalidate slots by its restart_lsn but protect the wal for
slots using XLogCtl->replicationSlotsMinLSN.

Seems valid. There is another corner case that another restart_lsn can be set in-between,
but they have larger LSN than RedoRecPtr, right?

Below I tried to write some summary and propose the patch which fixes the
problem.

Sorry but it is too long to understand properly for me :-(.

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

Oh, your point is there is another race condition, right? Do you have the reproducer for it?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#15Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#14)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Hayato,

On Tuesday, October 07, 2025 14:53 MSK, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

Yes, I
propose
to ensure that the protection happens when we assign restart_lsn. It seems to be
wrong that we invalidate slots by its restart_lsn but protect the wal for
slots using XLogCtl->replicationSlotsMinLSN.

Seems valid. There is another corner case that another restart_lsn can be set in-between,
but they have larger LSN than RedoRecPtr, right?

Here we talk about new creating slots. There should be no other processes that
can change restart_lsn during slot creation. Once, the slot is successfully
created it can be advanced to the greater values only. During advance, the old
restart_lsn will protect the slot, because it will be taken into account in
the checkpoint.

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

Oh, your point is there is another race condition, right? Do you have the reproducer for it?

The attached test for the master branch demonstrates a possible but very
rare race condition, because we read and assign slot's restart_lsn from redo rec
ptr non-atomically. The proposed scenario (see above) seems to be not complete.

With best regards,
Vitaly

Attachments:

0001-Test-a-specific-case-of-physical-slot-invalidation.patchtext/x-patch
#16Hayato Kuroda (Fujitsu)
Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Vitaly Davydov (#15)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

Thanks for sharing the reproducer. Agreed this is a real but minor issue.

Firstly I considered an ad-hoc way, which sets the candidate restart_lsn as
replicationSlotMinLSN before using as slot->data.restart_lsn. PSA the idea.
It can fix your reproducer.

However it still has a corner case; Assuming the checkpointer finishes computing
removal WALs before setting the restart_lsn to system_wide one, and checkpointer
tries to invalidate slots after restart_lsn of the slot is set.
In this case the checkpointer detects the creating slot and its restart_lsn is
older than oldest one. The checkpointer terminates the backend and
invalidates the slot.

This can be reproduced by moving 1) checkpoint-before-old-wal-removal to
in-between KeepLogSeg() and InvalidateObsoleteReplicationSlots(), and
2) physical-slot-reserve-wal-get-redo before the XLogMaybeSetReplicationSlotMinimumLSN().

For now, I cannot come up with the good fix. How about others?

BTW, can you update meson.build as well when you add .pl test code? Otherwise, it
cannot be run for meson builders.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

idea.diffsapplication/octet-stream; name=idea.diffs
#17Alexander Korotkov
Alexander Korotkov
aekorotkov@gmail.com
In reply to: Vitaly Davydov (#13)
1 attachment(s)
Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Oct 6, 2025 at 6:46 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

Thank you for highlighting this scenario. I've reviewed it. I think
we could avoid it by covering appropriate parts of
ReplicationSlotReserveWal() and Create{Check|Restart}Point() by a new
LWLock. The draft patch is attached. What do you think?

------
Regards,
Alexander Korotkov
Supabase

Attachments:

ReplicationSlotReserveWALLock.patchapplication/octet-stream; name=ReplicationSlotReserveWALLock.patch
#18Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Alexander Korotkov (#17)
Re: Newly created replication slot may be invalidated by checkpoint

On Wed, Nov 5, 2025 at 3:48 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Oct 6, 2025 at 6:46 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

Thank you for highlighting this scenario. I've reviewed it. I think
we could avoid it by covering appropriate parts of
ReplicationSlotReserveWal() and Create{Check|Restart}Point() by a new
LWLock. The draft patch is attached. What do you think?

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well. In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

I feel with the proposed patches for back branches, the code is
deviating too much and also makes it a bit complicated, which means it
could be difficult to maintain it in the future. Can we consider
reverting the original fix 2090edc6f32f652a2c995ca5f7e65748ae1e4c5d
and make it the same as we did in HEAD
ca307d5cec90a4fde62a50fafc8ab607ff1d8664? I know this would lead to
ABI breakage, but only for extensions using sizeof(ReplicationSlot),
if any. We can try to identify how many extensions rely on
sizeof(ReplicationSlot) and then decide accordingly? We recently did
something similar for another backbranch fix [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45c357e0e85d2dffe7af5440806150124a725a01 which requires adding
members at the end of structure.

OTOH, if we really want to go in the direction of deviating the back
branch code further, then we can review your fix, but I am hesitant to
go in that direction due to additional complexity and maintenance
burden.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45c357e0e85d2dffe7af5440806150124a725a01

--
With Regards,
Amit Kapila.

#19Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#18)
Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Nov 10, 2025 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 5, 2025 at 3:48 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Oct 6, 2025 at 6:46 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

Thank you for highlighting this scenario. I've reviewed it. I think
we could avoid it by covering appropriate parts of
ReplicationSlotReserveWal() and Create{Check|Restart}Point() by a new
LWLock. The draft patch is attached. What do you think?

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well. In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

We need to check whether a similar change is required in
reserve_wal_for_local_slot() as well for sync slots.

--
With Regards,
Amit Kapila.

#20Hayato Kuroda (Fujitsu)
Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#18)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Amit, Alexander,

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well.

Yes, I confirmed this can happen. [1]/messages/by-id/30938d-68ffa500-b-6328cc00@139466859 can be applied atop HEAD and
reproduce the invalidation.

In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

I feel this can fix the issue. The idea can serialize CheckpointReplicationSlots()
and ReplicationSlotReserveWal(), and upcoming XLogGetReplicationSlotMinimumLSN()
can take care the newly created slot.

Also, since CheckpointReplicationSlots() is called after setting RedoRecPtr, it
is OK if ReplicationSlotReserveWal() is called after the CheckpointReplicationSlots().
In this case the candidate restart_lsn has larger LSN than RedoRecPtr and won't
be removed in the checkpoint.

I feel with the proposed patches for back branches, the code is
deviating too much and also makes it a bit complicated, which means it
could be difficult to maintain it in the future. Can we consider
reverting the original fix 2090edc6f32f652a2c995ca5f7e65748ae1e4c5d
and make it the same as we did in HEAD
ca307d5cec90a4fde62a50fafc8ab607ff1d8664?

Is it allowed to add new LWLocks for backbranched? I'm afraid some other extensions
might be affected.

[1]: /messages/by-id/30938d-68ffa500-b-6328cc00@139466859

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#21Hayato Kuroda (Fujitsu)
Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#19)
2 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

Dear Amit,

We need to check whether a similar change is required in
reserve_wal_for_local_slot() as well for sync slots.

Good point. I confirmed similar issue could happen while synchronizing slots.
PSA the reproducer.

0001-PG17... patch simulates the issue that WAL segments were discarded in-between
slot's restart_lsn is set and the pointed record is being protected. Can be
applied atop PG17.
0001-HEAD... patch simulates the issue that restart_lsn obtained from primary is
discarded on standby. This can be applied atop HEAD.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

0001-PG17-Invalidate-newly-synchronized-slots.patchapplication/octet-stream; name=0001-PG17-Invalidate-newly-synchronized-slots.patch
0001-HEAD-invalidate-newly-synchronized-slot.patchapplication/octet-stream; name=0001-HEAD-invalidate-newly-synchronized-slot.patch
#22Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#21)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Friday, October 31, 2025 12:48 MSK, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

Firstly I considered an ad-hoc way, which sets the candidate restart_lsn as
replicationSlotMinLSN before using as slot->data.restart_lsn. PSA the idea.
It can fix your reproducer.

Hayato-san, thank you for the idea. Yes, it can fix the reproducer, but assume
the case when another backend calls ReplicationSlotsComputeRequiredLSN during
a new slot creation. It may reset the value which was set by
XLogMaybeSetReplicationSlotMinimumLSN. The slot may still be invalidated.

On Wednesday, November 05, 2025 13:17 MSK, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Thank you for highlighting this scenario. I've reviewed it. I think
we could avoid it by covering appropriate parts of
ReplicationSlotReserveWal() and Create{Check|Restart}Point() by a new
LWLock. The draft patch is attached. What do you think?

Alexander, thank you for the idea to use locks. I think, the lock may fix the
problem because we can not change the redo rec ptr during wal reservation by a
slot. Once, we serialize redo rec ptr assignment and wal reservation by a slot,
we fix this issue. Unfortunately, I can not apply the proposed test due
to a deadlock when waiting for injection points.

I think how to fix it without LW locks. The problem here is that the slot
may be invalidated immediately just after restart_lsn assignment. Without the
invalidation we may re-check that the assigned restart_lsn is still acceptable
To prevent it, we may implement a two-phase reservation in ReplicationSlotReserveWal.
We assign slot->data.restart_lsn and then compare it with RedoRecPtr just to make
sure, that it not less than the RedoRecPtr at the moment of comparison. If less,
repeat the loop. If, ok, confirm the assigned restart_lsn. The invalidation
function will ignore slots with unconfirmed restart_lsns. It would be better,
if the backend will suffer, than the checkpointer. Probably, a special flag like
slot->is_creating may help.

On Monday, November 10, 2025 12:11 MSK, Amit Kapila <amit.kapila16@gmail.com> wrote:

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well. In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

Amit, I'm not sure how it may help to avoid the change of redo rec ptr during
wal reservation by a slot, because it doesn't serialize redo rec ptr assignment
and slot's wal reservation, but the core issue is in reco rec ptr change
when we reserve the wal by a slot.

On Monday, November 10, 2025 12:11 MSK, Amit Kapila <amit.kapila16@gmail.com> wrote:

I feel with the proposed patches for back branches, the code is
deviating too much and also makes it a bit complicated, which means it
could be difficult to maintain it in the future. Can we consider
reverting the original fix 2090edc6f32f652a2c995ca5f7e65748ae1e4c5d
and make it the same as we did in HEAD
ca307d5cec90a4fde62a50fafc8ab607ff1d8664? I know this would lead to
ABI breakage, but only for extensions using sizeof(ReplicationSlot),
if any. We can try to identify how many extensions rely on
sizeof(ReplicationSlot) and then decide accordingly? We recently did
something similar for another backbranch fix [1] which requires adding
members at the end of structure.

If a decision to revert the patch is considered, I propose to consider one
more way to implement a replacement patch where last_saved_restart_lsn is
allocated in a separate array in shmem but not in ReplicationSlot struct, which
size will be unchanged. The logic will be the same except of accessing this value.
I think, I proposed such patch but it was rejected due to unwillingness to
change data allocations in the shmem. If needed, I may prepare the patch.

The attached patch is just an update of the original test. It adds the test to
the meson build file and adds one more scenario to check.

With best regards,
Vitaly

Attachments:

v2-Test-a-specific-case-of-physical-slot-invalidation.patchtext/x-patch
#23Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Vitaly Davydov (#22)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Wednesday, November 12, 2025 11:55 PMVitaly Davydov <v.davydov@postgrespro.ru> wrote:

On Monday, November 10, 2025 12:11 MSK, Amit Kapila
<amit.kapila16@gmail.com> wrote:

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well. In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

Amit, I'm not sure how it may help to avoid the change of redo rec ptr during
wal reservation by a slot, because it doesn't serialize redo rec ptr assignment
and slot's wal reservation, but the core issue is in reco rec ptr change when we
reserve the wal by a slot.

I think the main issue here lies in the possibility that the minimum restart_lsn
obtained during a checkpoint could be less than the WAL position that is being
reserved concurrently. So, instead of serializing the redo assignment and WAL
reservation, Amit proposes serializing the CheckPointReplicationSlots() and WAL
reservation. This would ensure the following:

1) If the WAL reservation occurs first, the checkpoint must wait for the
restart_lsn to be updated before proceeding with WAL removal. This guarantees
that the most recent restart_lsn position is detected.

2) If the checkpoint calls CheckPointReplicationSlots() first, then any
subsequent WAL reservation must take a position later than the redo pointer.

I find this approach promising, and thus, I am sharing a POC for this approach on
HEAD. (I will keep testing the patch locally)

BTW, I am not adding a test using an injection point because, once we acquire a
lwlock in the WAL reserve function, it becomes impractical to insert
an injection point there. The reason is that the injection point function
internally calls CHECK_FOR_INTERRUPTS(), and the lightweight lock holds
interrupts, preventing their execution.

On Monday, November 10, 2025 12:11 MSK, Amit Kapila
<amit.kapila16@gmail.com> wrote:

I feel with the proposed patches for back branches, the code is
deviating too much and also makes it a bit complicated, which means it
could be difficult to maintain it in the future. Can we consider
reverting the original fix 2090edc6f32f652a2c995ca5f7e65748ae1e4c5d
and make it the same as we did in HEAD
ca307d5cec90a4fde62a50fafc8ab607ff1d8664? I know this would lead to
ABI breakage, but only for extensions using sizeof(ReplicationSlot),
if any. We can try to identify how many extensions rely on
sizeof(ReplicationSlot) and then decide accordingly? We recently did
something similar for another backbranch fix [1] which requires adding
members at the end of structure.

If a decision to revert the patch is considered, I propose to consider one more
way to implement a replacement patch where last_saved_restart_lsn is
allocated in a separate array in shmem but not in ReplicationSlot struct, which
size will be unchanged. The logic will be the same except of accessing this
value.
I think, I proposed such patch but it was rejected due to unwillingness to
change data allocations in the shmem. If needed, I may prepare the patch.

Personally, allocating additional shared memory space seems somewhat hacky. I
prefer aligning the code with the current HEAD, as Amit suggested, to ensure
consistency and facilitate easier maintenance in the future. I can prepare
patches for back branches as well once an agreement is reached.

Best Regards,
Hou zj

Attachments:

v1-0001-Fix-unexpected-WAL-removal.patchapplication/octet-stream; name=v1-0001-Fix-unexpected-WAL-removal.patch
#24Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#23)
Re: Newly created replication slot may be invalidated by checkpoint

On Thu, Nov 13, 2025 at 8:03 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, November 12, 2025 11:55 PMVitaly Davydov <v.davydov@postgrespro.ru> wrote:

On Monday, November 10, 2025 12:11 MSK, Amit Kapila
<amit.kapila16@gmail.com> wrote:

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well. In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

Amit, I'm not sure how it may help to avoid the change of redo rec ptr during
wal reservation by a slot, because it doesn't serialize redo rec ptr assignment
and slot's wal reservation, but the core issue is in reco rec ptr change when we
reserve the wal by a slot.

I think the main issue here lies in the possibility that the minimum restart_lsn
obtained during a checkpoint could be less than the WAL position that is being
reserved concurrently. So, instead of serializing the redo assignment and WAL
reservation, Amit proposes serializing the CheckPointReplicationSlots() and WAL
reservation. This would ensure the following:

1) If the WAL reservation occurs first, the checkpoint must wait for the
restart_lsn to be updated before proceeding with WAL removal. This guarantees
that the most recent restart_lsn position is detected.

2) If the checkpoint calls CheckPointReplicationSlots() first, then any
subsequent WAL reservation must take a position later than the redo pointer.

Your understanding is correct. This whole difference of the way same
problem can happen in HEAD and back branches but in different ways
because the code varies a lot between branches. Now, if we again put a
different fix for the newly discovered problem, then it will take the
code much further away adding more to the maintenance burden. So, I
thought we should once discuss making the code same in HEAD and back
branches. Can we do some research to see if any of the available
extensions uses sizeof(ReplicationSlot)? I think even if we don't find
any such extension there will always be a chance that there is some
closed source extension that might rely on it but I think still this
is worth considering to change ABI in back branches for this fix. This
code area is quite subtle where the bugs exist from a long time and
the initial fix also didn't close all holes.

--
With Regards,
Amit Kapila.

#25Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Amit Kapila (#24)
1 attachment(s)
Re: Newly created replication slot may be invalidated by checkpoint

Re-sending the original message due to a fail...

Hi Zhijie Hou, Amit

I think the main issue here lies in the possibility that the minimum restart_lsn
obtained during a checkpoint could be less than the WAL position that is being
reserved concurrently. So, instead of serializing the redo assignment and WAL
reservation, Amit proposes serializing the CheckPointReplicationSlots() and WAL
reservation. This would ensure the following:

1) If the WAL reservation occurs first, the checkpoint must wait for the
restart_lsn to be updated before proceeding with WAL removal. This guarantees
that the most recent restart_lsn position is detected.

2) If the checkpoint calls CheckPointReplicationSlots() first, then any
subsequent WAL reservation must take a position later than the redo pointer.

Thank you for the explanation. I agree, the Amit's patch solves the problem
and it is the most promising solution. It is less risky to new bugs and there
is no need to avoid locks for a maximum possible performance. I tried to find
some corner cases but I failed.

FYI, there is another lock-less solution for ReplicationSlotReserveWal with
two-phase reservation that is attached as a diff file. It seems to solve the
redo rec race condition problem but it is not complete and more risky to bugs.
Just share here with the hope that such approach may be interested.

When investigating the solution I come to some questions. Below I shared them.
I do not ask for an answer but I think, they may be considered when preparing
the patch.

1) Looking at ReplicationSlotReserveWal, I'm not sure why we assign restart_lsn
in a loop and exit the loop if XLogGetLastRemovedSegno() < segno is true. I can
understand if we compare restart_lsn with XLogCtl->replicationSlotMinLSN
to handle parallel calls of ReplicationSlotsComputeRequiredLSN as described
in the comment. The old segments removal happens much later in the checkpointer.
I'm not sure, whether the comment describes the case inproperly or the code is
wrong.

2) Why we call XLogSetReplicationSlotMinimumLSN out of the control lock section.
It may lead to some race conditions when two more backends create, advance or
drop slots in parallel. Not sure, the control and allocation locks properly
serialise the updates.

With regards,
Vitaly

Attachments:

lockless.difftext/x-patch
#26Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Vitaly Davydov (#25)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Friday, November 14, 2025 3:58 AM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

Re-sending the original message due to a fail...

Hi Zhijie Hou, Amit

I think the main issue here lies in the possibility that the minimum

restart_lsn

obtained during a checkpoint could be less than the WAL position that is

being

reserved concurrently. So, instead of serializing the redo assignment and

WAL

reservation, Amit proposes serializing the CheckPointReplicationSlots() and

WAL

reservation. This would ensure the following:

1) If the WAL reservation occurs first, the checkpoint must wait for the
restart_lsn to be updated before proceeding with WAL removal. This

guarantees

that the most recent restart_lsn position is detected.

2) If the checkpoint calls CheckPointReplicationSlots() first, then any
subsequent WAL reservation must take a position later than the redo

pointer.

Thank you for the explanation. I agree, the Amit's patch solves the problem
and it is the most promising solution. It is less risky to new bugs and there
is no need to avoid locks for a maximum possible performance. I tried to find
some corner cases but I failed.

...

When investigating the solution I come to some questions. Below I shared
them.
I do not ask for an answer but I think, they may be considered when preparing
the patch.

Thanks for raising two good questions.

1) Looking at ReplicationSlotReserveWal, I'm not sure why we assign
restart_lsn
in a loop and exit the loop if XLogGetLastRemovedSegno() < segno is true. I
can
understand if we compare restart_lsn with XLogCtl->replicationSlotMinLSN
to handle parallel calls of ReplicationSlotsComputeRequiredLSN as described
in the comment. The old segments removal happens much later in the
checkpointer.
I'm not sure, whether the comment describes the case inproperly or the code
is
wrong.

After considering more, I think we no longer need this check and the loop in
ReplicationSlotReserveWal() after fixing all race conditions that could cause the
WALs being reserved to be removed. (Since we have ensured that the
checkpoint cannot remove WALs being reserved by the newly created slot.)

2) Why we call XLogSetReplicationSlotMinimumLSN out of the control lock
section.
It may lead to some race conditions when two more backends create,
advance or
drop slots in parallel. Not sure, the control and allocation locks properly
serialise the updates.

I researched this question and found that there is indeed another race condition here
that could cause the newly created slot to be invalidated. The scenario is that if
a backend end is advaning the restart_lsn of a slot and has reached
XLogSetReplicationSlotMinimumLSN but not update the minimum LSN yet, while
another backend created a new slot, then the minimum LSN could be set to more
recent value than the WAL position reserved by the newly created slot, causing
the new slot to be invalidated in next checkpoint.

The steps to reproduce is as follows:

1. Create a slot 'advtest' for later advancement.
select pg_create_logical_replication_slot('advtest', 'test_decoding');

2. Start a backend to create a slot (s) but block it before updating the
restart_lsn in ReplicationSlotReserveWal().
select pg_create_logical_replication_slot('s', 'test_decoding');
3. start another backend to generate some new WAL files and advance the
slot (advtest) to the latest position but block it from updating the LSN in
XLogSetReplicationSlotMinimumLSN()
select pg_switch_wal();
select pg_switch_wal();
SELECT pg_log_standby_snapshot();
SELECT pg_log_standby_snapshot();
select pg_replication_slot_advance('advtest', pg_current_wal_lsn());
select pg_replication_slot_advance('advtest', pg_current_wal_lsn());
4. Release the backend to create slot (s).
5. execute checkpoint but block it before calling XLogGetReplicationSlotMinimumLSN()
6. release the advancement backend and then the LSN will be set to a new position.
7. release the checkpoint and the WALs required by the slot (s) are removed.

This is similar to the concurrent slot_xmin update issue in another thread
(Assertion failure in SnapBuildInitialSnapshot()[1]/messages/by-id/CAA4eK1KcA=DrC3NTifig-x5DPXaxDEMLSZSz9gWS16m_d-+6rQ@mail.gmail.com), I think it's better to use
the same solution in both issue, e.g., we can increase the lock level of
ReplicationSlotControlLock to execlusive in ReplicationSlotsComputeRequiredLSN()
to block concurrnet LSN update.

Attach v2 patch that improves the fix based on above analysis.

(I am still testing some other scenarios related to slotsync to ensure the
current fix is complete)

[1]: /messages/by-id/CAA4eK1KcA=DrC3NTifig-x5DPXaxDEMLSZSz9gWS16m_d-+6rQ@mail.gmail.com

Best Regards,
Hou zj

Attachments:

v2-0001-Fix-race-conditions-causing-invalidation-of-newly.patchapplication/octet-stream; name=v2-0001-Fix-race-conditions-causing-invalidation-of-newly.patch
#27Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Zhijie Hou (Fujitsu) (#26)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Monday, November 17, 2025 6:50 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

Attach v2 patch that improves the fix based on above analysis.

(I am still testing some other scenarios related to slotsync to ensure the
current fix is complete)

I have been testing whether taking exclusive ReplicationSlotAllocationLock can
prevent newly synced slot from being invalidated, but found that it is
insufficient.

During slotsync, since it fetches the remote LSN (queried from the publisher) as
the initial restart_lsn for newly synced slot and we sync the slots in a
asynchronous manner, the synced restart_lsn could be behind the standby's redo
pointer. So, there could be race conditions that the checkpoint removes the
required WALs and invalidates the newly synced slot:

1. Assuming there is no slot on standby, the minimum slot LSN would be invalid,
and the checkpoint reaches InvalidateObsoleteReplicationSlots() with
oldestSegno =segno of redo.
2. The slotsync successfully sync the slot A that has old restart_lsn.
3. The checkpoint would then find that the newly synced slot A should be
invalidated due to old restart_lsn.

Unlike in ReplicationSlotReserveWal(), holding only the
ReplicationSlotAllocationLock does not safeguard a newly synced slot if a
checkpoint occurs before WAL reservation during slotsync as the initial
restart_lsn provided could be prior to the redo pointer.

To address this issue, I changed the WAL reservation to compare the
remote restart_lsn with both minimum slot LSN and redo pointer. If either local
LSN is less than the remote restart_lsn, we update the local slot with the
remote value. Otherwise, we use the minimum of the two local LSNs.

Additionally, we acquire both ReplicationSlotAllocationLock and
ReplicationSlotControlLock to prevent the checkpoint from updating the redo
pointer and other backend processes from updating the minimum slot LSN.

Here is the V3 patch that fixes all the race conditions found so far.

------

Apart from addressing the fix for HEAD, I would like to inquire if anyone holds
a differing opinion regarding the revert of the origin fix 2090edc6f32f652a2c in
the back branches and applying the same fix as HEAD.

I searched for extensions that rely on the size of ReplicationSlot and did not
find any, so I think it is OK to implement the same fix in the backpatches.

It there are no alternative views, I will proceed with preparing the patches for
the back branches in upcoming versions.

Best Regards,
Hou zj

Attachments:

v3-0001-Fix-race-conditions-causing-invalidation-of-newly.patchapplication/octet-stream; name=v3-0001-Fix-race-conditions-causing-invalidation-of-newly.patch
#28Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Zhijie Hou (Fujitsu) (#27)
2 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Wednesday, November 19, 2025 4:24 PM Hou, Zhijie<houzj.fnst@fujitsu.com> wrote:

On Monday, November 17, 2025 6:50 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Attach v2 patch that improves the fix based on above analysis.

(I am still testing some other scenarios related to slotsync to ensure the
current fix is complete)

I have been testing whether taking exclusive ReplicationSlotAllocationLock
can
prevent newly synced slot from being invalidated, but found that it is
insufficient.

During slotsync, since it fetches the remote LSN (queried from the publisher)
as
the initial restart_lsn for newly synced slot and we sync the slots in a
asynchronous manner, the synced restart_lsn could be behind the standby's
redo
pointer. So, there could be race conditions that the checkpoint removes the
required WALs and invalidates the newly synced slot:

1. Assuming there is no slot on standby, the minimum slot LSN would be
invalid,
and the checkpoint reaches InvalidateObsoleteReplicationSlots() with
oldestSegno =segno of redo.
2. The slotsync successfully sync the slot A that has old restart_lsn.
3. The checkpoint would then find that the newly synced slot A should be
invalidated due to old restart_lsn.

..

Here is the V3 patch that fixes all the race conditions found so far.

I am attaching a tap-test in 0002 that includes an injection point to reproduce
this scenario. The test expects that while syncing a slot to the standby server,
if the WAL before the remote restart_lsn is at risk of being removed by a
checkpoint, the slot cannot be synced. Without the fix in 0001, the test would
fail because the slot is synced to standby but immediately invalidated by the
checkpoint.

I am not adamant about merging this test but just for reference.

Besides, I fixed a bug in v3-0001 where the restart_lsn is set to a wrong value.

------

Apart from addressing the fix for HEAD, I would like to inquire if anyone holds
a differing opinion regarding the revert of the origin fix 2090edc6f32f652a2c in
the back branches and applying the same fix as HEAD.

I searched for extensions that rely on the size of ReplicationSlot and did not
find any, so I think it is OK to implement the same fix in the backpatches.

It there are no alternative views, I will proceed with preparing the patches for
the back branches in upcoming versions.

Best Regards,
Hou zj

Attachments:

v4-0002-Add-a-tap-test-using-injection-point.patchapplication/octet-stream; name=v4-0002-Add-a-tap-test-using-injection-point.patch
v4-0001-Fix-race-conditions-causing-invalidation-of-newly.patchapplication/octet-stream; name=v4-0001-Fix-race-conditions-causing-invalidation-of-newly.patch
#29Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Zhijie Hou (Fujitsu) (#28)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

Hi Zhijie Hou

I'm not sure, my previous email was successfully sent. I do not see it in the
pgsql-hackers mailing list. Re-sending it again, sorry for that.

Thank you for preparing the patch!

The new lock schema (Allocation, Control) seems to be working, without any
deadlocks. It is guaranteed by the lock orders - (1) Allocation lock is followed
by the Control lock, or (2) the Control lock without the Allocation lock. I've
attached the doc [1]lockschema.md where I tried to describe the lock schema to analyze
possible deadlocks. I haven't found any evident problems.

Concerning reserve_wal_for_local_slot. It seems it is used in synchronization
of failover logical slots. For me, it is tricky to change restart_lsn of a
synced logical slot to RedoRecPtr, because it may lead to problems with
logical replication using such slot after the replica promotion. But it seems
it is the architectural problem and it is not related to the problems, solved by
the patch.

The change of lock mode to EXCLUSIVE in ReplicationSlotsComputeRequiredLSN may
affect the performance when a lot of slots are advanced during some small period
of time. It may affect the walsender performance. It advances the logical or
physical slots when receive a confirmation from the replica. I guess, the slot
advancement may be pretty frequent operation.

There is an idea to keep the SHARED lock but put XLogSetReplicationSlotMinimumLSN
under this lock and implement some guarantees that we do not advance slots to
the past, taking into account that the slot can be advanced in past if it
doesn't cross wal segment boundaries (it happens). In this case, if a concurrent
process advances an existing slot, its old restart_lsn will protect the wal. In
case of wal reservation we may use EXCLUSIVE lock in
XLogSetReplicationSlotMinimumLSN.

Furthremore, I believe, ReplicationSlotsComputeRequiredLSN is required for
checkpointer to calculate the oldest wal segment but it is not required to be
called every time when a slot is advanced. It may affect the reports -
GetWALAvailability function, but I think it is not a big problem to deal with it.

Some typos in the patch commit message:

1) A typo: yet updated estart_lsn, while the

2) If a backend advances a slot's restart_lsn reaches
XLogSetReplicationSlotMinimumLSN... - may be to put 'and' before reaches?

[1]: lockschema.md

With best regards,
Vitaly

Attachments:

lockschema.mdtext/markdown
#30Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Vitaly Davydov (#29)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Thursday, November 20, 2025 4:26 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

Hi Zhijie Hou

I'm not sure, my previous email was successfully sent. I do not see it in the
pgsql-hackers mailing list. Re-sending it again, sorry for that.

Thank you for preparing the patch!

The new lock schema (Allocation, Control) seems to be working, without any
deadlocks. It is guaranteed by the lock orders - (1) Allocation lock is followed
by the Control lock, or (2) the Control lock without the Allocation lock. I've
attached the doc [1] where I tried to describe the lock schema to analyze
possible deadlocks. I haven't found any evident problems.

Thanks a lot for the test and analysis.

Concerning reserve_wal_for_local_slot. It seems it is used in synchronization
of failover logical slots. For me, it is tricky to change restart_lsn of a synced
logical slot to RedoRecPtr, because it may lead to problems with logical
replication using such slot after the replica promotion. But it seems it is the
architectural problem and it is not related to the problems, solved by the
patch.

I think this is not an issue because if we use the redo pointer instead of the
remote restart_lsn as the initial value, the synced slot won't be marked as
sync-ready, so user cannot use it after promotion (also well documented). This
is also the existing behavior before the patch, e.g., if the required WALs were
removed, the oldest available WAL was used as the initial value, similarly
resulting in the slot not being sync-ready.

The change of lock mode to EXCLUSIVE in
ReplicationSlotsComputeRequiredLSN may affect the performance when a lot
of slots are advanced during some small period of time. It may affect the
walsender performance. It advances the logical or physical slots when receive
a confirmation from the replica. I guess, the slot advancement may be pretty
frequent operation.

Yes, I had the same thought and considered a simple alternative (similar to your
suggestion below): use an exclusive lock only when updating the slot.restart_lsn
during WAL reservation, while continuing to use a shared lock in the computation
function. Additionally, place XLogSetReplicationSlotMinimumLSN() under the lock.
This approach will also help serialize the process.

There is an idea to keep the SHARED lock but put
XLogSetReplicationSlotMinimumLSN under this lock and implement some
guarantees that we do not advance slots to the past, taking into account that
the slot can be advanced in past if it doesn't cross wal segment boundaries (it
happens). In this case, if a concurrent process advances an existing slot, its old
restart_lsn will protect the wal. In case of wal reservation we may use
EXCLUSIVE lock in XLogSetReplicationSlotMinimumLSN.

Furthremore, I believe, ReplicationSlotsComputeRequiredLSN is required for
checkpointer to calculate the oldest wal segment but it is not required to be
called every time when a slot is advanced. It may affect the reports -
GetWALAvailability function, but I think it is not a big problem to deal with it.

I also think it's not a problem.

Some typos in the patch commit message:

1) A typo: yet updated estart_lsn, while the

2) If a backend advances a slot's restart_lsn reaches
XLogSetReplicationSlotMinimumLSN... - may be to put 'and' before reaches?

Fixed.

Best Regards,
Hou zj

Attachments:

v5-0001-Fix-race-conditions-causing-invalidation-of-newly.patchapplication/octet-stream; name=v5-0001-Fix-race-conditions-causing-invalidation-of-newly.patch
#31Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#30)
Re: Newly created replication slot may be invalidated by checkpoint

On Thu, Nov 20, 2025 at 4:07 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, November 20, 2025 4:26 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

Concerning reserve_wal_for_local_slot. It seems it is used in synchronization
of failover logical slots. For me, it is tricky to change restart_lsn of a synced
logical slot to RedoRecPtr, because it may lead to problems with logical
replication using such slot after the replica promotion. But it seems it is the
architectural problem and it is not related to the problems, solved by the
patch.

I think this is not an issue because if we use the redo pointer instead of the
remote restart_lsn as the initial value, the synced slot won't be marked as
sync-ready, so user cannot use it after promotion (also well documented). This
is also the existing behavior before the patch, e.g., if the required WALs were
removed, the oldest available WAL was used as the initial value, similarly
resulting in the slot not being sync-ready.

Would it be better to discuss this in a separate thread? Though this
is related to original problem but still in a separate part of code
(slotsync) which I think can have a separate fix especially when the
fix is also somewhat different.

The change of lock mode to EXCLUSIVE in
ReplicationSlotsComputeRequiredLSN may affect the performance when a lot
of slots are advanced during some small period of time. It may affect the
walsender performance. It advances the logical or physical slots when receive
a confirmation from the replica. I guess, the slot advancement may be pretty
frequent operation.

Yes, I had the same thought and considered a simple alternative (similar to your
suggestion below): use an exclusive lock only when updating the slot.restart_lsn
during WAL reservation, while continuing to use a shared lock in the computation
function. Additionally, place XLogSetReplicationSlotMinimumLSN() under the lock.
This approach will also help serialize the process.

Can we discuss this as well in a separate thread?

--
With Regards,
Amit Kapila.

#32Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#31)
3 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Thursday, November 20, 2025 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 20, 2025 at 4:07 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Thursday, November 20, 2025 4:26 PM Vitaly Davydov

<v.davydov@postgrespro.ru> wrote:

Concerning reserve_wal_for_local_slot. It seems it is used in
synchronization of failover logical slots. For me, it is tricky to
change restart_lsn of a synced logical slot to RedoRecPtr, because
it may lead to problems with logical replication using such slot
after the replica promotion. But it seems it is the architectural
problem and it is not related to the problems, solved by the patch.

I think this is not an issue because if we use the redo pointer
instead of the remote restart_lsn as the initial value, the synced
slot won't be marked as sync-ready, so user cannot use it after
promotion (also well documented). This is also the existing behavior
before the patch, e.g., if the required WALs were removed, the oldest
available WAL was used as the initial value, similarly resulting in the slot not

being sync-ready.

Would it be better to discuss this in a separate thread? Though this is related
to original problem but still in a separate part of code
(slotsync) which I think can have a separate fix especially when the fix is also
somewhat different.

The change of lock mode to EXCLUSIVE in
ReplicationSlotsComputeRequiredLSN may affect the performance when

a

lot of slots are advanced during some small period of time. It may
affect the walsender performance. It advances the logical or
physical slots when receive a confirmation from the replica. I
guess, the slot advancement may be pretty frequent operation.

Yes, I had the same thought and considered a simple alternative
(similar to your suggestion below): use an exclusive lock only when
updating the slot.restart_lsn during WAL reservation, while continuing
to use a shared lock in the computation function. Additionally, place

XLogSetReplicationSlotMinimumLSN() under the lock.

This approach will also help serialize the process.

Can we discuss this as well in a separate thread?

OK, I think it makes sense to start separate threads.

I have split the patches based on the different bugs they
address and am sharing them here for reference.

Best Regards,
Hou zj

Attachments:

v6-0003-Fix-race-conditions-causing-invalidation-of-newly.patchapplication/octet-stream; name=v6-0003-Fix-race-conditions-causing-invalidation-of-newly.patch
v6-0001-Fix-the-race-condition-between-slot-wal-reservati.patchapplication/octet-stream; name=v6-0001-Fix-the-race-condition-between-slot-wal-reservati.patch
v6-0002-Fix-the-race-condition-of-updating-slot-minimum-L.patchapplication/octet-stream; name=v6-0002-Fix-the-race-condition-of-updating-slot-minimum-L.patch
#33Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#32)
Re: Newly created replication slot may be invalidated by checkpoint

On Fri, Nov 21, 2025 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, November 20, 2025 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 20, 2025 at 4:07 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Thursday, November 20, 2025 4:26 PM Vitaly Davydov

<v.davydov@postgrespro.ru> wrote:

Concerning reserve_wal_for_local_slot. It seems it is used in
synchronization of failover logical slots. For me, it is tricky to
change restart_lsn of a synced logical slot to RedoRecPtr, because
it may lead to problems with logical replication using such slot
after the replica promotion. But it seems it is the architectural
problem and it is not related to the problems, solved by the patch.

I think this is not an issue because if we use the redo pointer
instead of the remote restart_lsn as the initial value, the synced
slot won't be marked as sync-ready, so user cannot use it after
promotion (also well documented). This is also the existing behavior
before the patch, e.g., if the required WALs were removed, the oldest
available WAL was used as the initial value, similarly resulting in the slot not

being sync-ready.

Would it be better to discuss this in a separate thread? Though this is related
to original problem but still in a separate part of code
(slotsync) which I think can have a separate fix especially when the fix is also
somewhat different.

The change of lock mode to EXCLUSIVE in
ReplicationSlotsComputeRequiredLSN may affect the performance when

a

lot of slots are advanced during some small period of time. It may
affect the walsender performance. It advances the logical or
physical slots when receive a confirmation from the replica. I
guess, the slot advancement may be pretty frequent operation.

Yes, I had the same thought and considered a simple alternative
(similar to your suggestion below): use an exclusive lock only when
updating the slot.restart_lsn during WAL reservation, while continuing
to use a shared lock in the computation function. Additionally, place

XLogSetReplicationSlotMinimumLSN() under the lock.

This approach will also help serialize the process.

Can we discuss this as well in a separate thread?

OK, I think it makes sense to start separate threads.

I have split the patches based on the different bugs they
address and am sharing them here for reference.

I'm reviewing the 0001 patch and the problem that can be addressed by
that patch. While the proposed patch addresses the race condition
between a checkpointing and newly created slot, could the same issue
happen between the checkpointing and copying a slot? I'm trying to
understand when we have to acquire ReplicationSlotAllocationLock in an
exclusive mode in the new lock scheme.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#34Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Masahiko Sawada (#33)
RE: Newly created replication slot may be invalidated by checkpoint

On Tuesday, December 2, 2025 1:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Nov 21, 2025 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

OK, I think it makes sense to start separate threads.

I have split the patches based on the different bugs they
address and am sharing them here for reference.

I'm reviewing the 0001 patch and the problem that can be addressed by
that patch. While the proposed patch addresses the race condition
between a checkpointing and newly created slot, could the same issue
happen between the checkpointing and copying a slot? I'm trying to
understand when we have to acquire ReplicationSlotAllocationLock in an
exclusive mode in the new lock scheme.

Thanks for reviewing !

I think the situation is somewhat different in the copy_replication_slot(). As
noted in the comments[1]/* * We need to prevent the source slot's reserved WAL from being removed, * but we don't want to lock that slot for very long, and it can advance * in the meantime. So obtain the source slot's data, and create a new * slot using its restart_lsn. Afterwards we lock the source slot again * and verify that the data we copied (name, type) has not changed * incompatibly. No inconvenient WAL removal can occur once the new slot * is created -- but since WAL removal could have occurred before we * managed to create the new slot, we advance the new slot's restart_lsn * to the source slot's updated restart_lsn the second time we lock it. */, it's considered acceptable for WALs preceding the
initial restart_lsn to be removed since the latest restart_lsn will be copied
again in the second phase, so latest WAL being reserved is safe. Aside from this
specific case, I think it's necessary to acquire the
ReplicationSlotAllocationLock when reserving WALs for newly created slots.

[1]: /* * We need to prevent the source slot's reserved WAL from being removed, * but we don't want to lock that slot for very long, and it can advance * in the meantime. So obtain the source slot's data, and create a new * slot using its restart_lsn. Afterwards we lock the source slot again * and verify that the data we copied (name, type) has not changed * incompatibly. No inconvenient WAL removal can occur once the new slot * is created -- but since WAL removal could have occurred before we * managed to create the new slot, we advance the new slot's restart_lsn * to the source slot's updated restart_lsn the second time we lock it. */

/*
* We need to prevent the source slot's reserved WAL from being removed,
* but we don't want to lock that slot for very long, and it can advance
* in the meantime. So obtain the source slot's data, and create a new
* slot using its restart_lsn. Afterwards we lock the source slot again
* and verify that the data we copied (name, type) has not changed
* incompatibly. No inconvenient WAL removal can occur once the new slot
* is created -- but since WAL removal could have occurred before we
* managed to create the new slot, we advance the new slot's restart_lsn
* to the source slot's updated restart_lsn the second time we lock it.
*/

Best Regards,
Hou zj

#35Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#34)
Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Dec 1, 2025 at 10:19 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, December 2, 2025 1:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Nov 21, 2025 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

OK, I think it makes sense to start separate threads.

I have split the patches based on the different bugs they
address and am sharing them here for reference.

I'm reviewing the 0001 patch and the problem that can be addressed by
that patch. While the proposed patch addresses the race condition
between a checkpointing and newly created slot, could the same issue
happen between the checkpointing and copying a slot? I'm trying to
understand when we have to acquire ReplicationSlotAllocationLock in an
exclusive mode in the new lock scheme.

Thanks for reviewing !

I think the situation is somewhat different in the copy_replication_slot(). As
noted in the comments[1], it's considered acceptable for WALs preceding the
initial restart_lsn to be removed since the latest restart_lsn will be copied
again in the second phase, so latest WAL being reserved is safe.

Right. But does it mean that the new slot could be invalidated while
being copied if the first copied restart_lsn becomes less than a new
redo ptr set by a concurrent checkpoint? I thought the problem the
0001 patch is trying to fix is that the slot could end up being
invalidated by a concurrent checkpoint even while being created, so I
wonder if the same problem could occur.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#36Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Masahiko Sawada (#35)
RE: Newly created replication slot may be invalidated by checkpoint

On Wednesday, December 3, 2025 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 1, 2025 at 10:19 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, December 2, 2025 1:03 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

On Fri, Nov 21, 2025 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

OK, I think it makes sense to start separate threads.

I have split the patches based on the different bugs they address
and am sharing them here for reference.

I'm reviewing the 0001 patch and the problem that can be addressed
by that patch. While the proposed patch addresses the race condition
between a checkpointing and newly created slot, could the same issue
happen between the checkpointing and copying a slot? I'm trying to
understand when we have to acquire ReplicationSlotAllocationLock in
an exclusive mode in the new lock scheme.

Thanks for reviewing !

I think the situation is somewhat different in the
copy_replication_slot(). As noted in the comments[1], it's considered
acceptable for WALs preceding the initial restart_lsn to be removed
since the latest restart_lsn will be copied again in the second phase, so

latest WAL being reserved is safe.

Right. But does it mean that the new slot could be invalidated while being
copied if the first copied restart_lsn becomes less than a new redo ptr set by a
concurrent checkpoint? I thought the problem the
0001 patch is trying to fix is that the slot could end up being invalidated by a
concurrent checkpoint even while being created, so I wonder if the same
problem could occur.

I think the invalidation cannot occur when copying because:

Currently, there are no CHECK_FOR_INTERRUPTS() calls between the initial
restart_lsn copy (first phase) and the latest restart_lsn copy (second phase).
As a result, even if a checkpoint attempts to invalidate a slot and sends a
SIGTERM to the backend, the backend will first update the restart_lsn during the
second phase before responding to the signal. Consequently, during the next
cycle of InvalidatePossiblyObsoleteSlot(), the checkpoint will observe the
updated restart_lsn and skip the invalidation.

For logical slots, although invoking the output plugin startup callback presents
a slight chance of processing the signal (when using third-party plugins), slot
invalidation in this scenario results in immediate slot dropping, because the
slot is in RS_EPHEMERAL state, thus preventing invalidation.

While theoretically, slot invalidation could occur if the code changes in the
future, addressing that possibility could be considered an independent
improvement task. What do you think ?

Best Regards,
Hou zj

#37Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#36)
Re: Newly created replication slot may be invalidated by checkpoint

On Tue, Dec 2, 2025 at 10:15 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, December 3, 2025 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 1, 2025 at 10:19 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, December 2, 2025 1:03 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

On Fri, Nov 21, 2025 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

OK, I think it makes sense to start separate threads.

I have split the patches based on the different bugs they address
and am sharing them here for reference.

I'm reviewing the 0001 patch and the problem that can be addressed
by that patch. While the proposed patch addresses the race condition
between a checkpointing and newly created slot, could the same issue
happen between the checkpointing and copying a slot? I'm trying to
understand when we have to acquire ReplicationSlotAllocationLock in
an exclusive mode in the new lock scheme.

Thanks for reviewing !

I think the situation is somewhat different in the
copy_replication_slot(). As noted in the comments[1], it's considered
acceptable for WALs preceding the initial restart_lsn to be removed
since the latest restart_lsn will be copied again in the second phase, so

latest WAL being reserved is safe.

Right. But does it mean that the new slot could be invalidated while being
copied if the first copied restart_lsn becomes less than a new redo ptr set by a
concurrent checkpoint? I thought the problem the
0001 patch is trying to fix is that the slot could end up being invalidated by a
concurrent checkpoint even while being created, so I wonder if the same
problem could occur.

I think the invalidation cannot occur when copying because:

Currently, there are no CHECK_FOR_INTERRUPTS() calls between the initial
restart_lsn copy (first phase) and the latest restart_lsn copy (second phase).
As a result, even if a checkpoint attempts to invalidate a slot and sends a
SIGTERM to the backend, the backend will first update the restart_lsn during the
second phase before responding to the signal. Consequently, during the next
cycle of InvalidatePossiblyObsoleteSlot(), the checkpoint will observe the
updated restart_lsn and skip the invalidation.

For logical slots, although invoking the output plugin startup callback presents
a slight chance of processing the signal (when using third-party plugins), slot
invalidation in this scenario results in immediate slot dropping, because the
slot is in RS_EPHEMERAL state, thus preventing invalidation.

Thank you for the analysis. I agree.

While theoretically, slot invalidation could occur if the code changes in the
future, addressing that possibility could be considered an independent
improvement task. What do you think ?

Okay. I find that it also might make sense for HEAD to use
RS_EPHEMERAL state for physical slots too to avoid being invalidated
during creation, which probably can be discussed later. For back
branches, the proposed idea of acquiring ReplicationSlotAllocationLock
in an exclusive mode would be better. I think we might want to have a
comment in CheckPointReplicationSlots() too that refers to
ReplicationSlotReserveWal().

Regarding whether we revert the original fix 2090edc6f32 and make it
the same as we did in HEAD ca307d5cec90a4f, we need to change the size
of ReplicationSlot struct. I'm concerned that it's really safe to
change it because the data resides on the shared memory. For example,
we typically iterate over all replication slots as follow:

for (i = 0; i < max_replication_slots; i++)
{
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];

I'm concerned that the arithmetic for calculating the slot address is
affected by the size of ReplicationSlot change.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#38Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#37)
Re: Newly created replication slot may be invalidated by checkpoint

On Thu, Dec 4, 2025 at 2:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 2, 2025 at 10:15 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

I think the invalidation cannot occur when copying because:

Currently, there are no CHECK_FOR_INTERRUPTS() calls between the initial
restart_lsn copy (first phase) and the latest restart_lsn copy (second phase).
As a result, even if a checkpoint attempts to invalidate a slot and sends a
SIGTERM to the backend, the backend will first update the restart_lsn during the
second phase before responding to the signal. Consequently, during the next
cycle of InvalidatePossiblyObsoleteSlot(), the checkpoint will observe the
updated restart_lsn and skip the invalidation.

For logical slots, although invoking the output plugin startup callback presents
a slight chance of processing the signal (when using third-party plugins), slot
invalidation in this scenario results in immediate slot dropping, because the
slot is in RS_EPHEMERAL state, thus preventing invalidation.

Thank you for the analysis. I agree.

While theoretically, slot invalidation could occur if the code changes in the
future, addressing that possibility could be considered an independent
improvement task. What do you think ?

Okay. I find that it also might make sense for HEAD to use
RS_EPHEMERAL state for physical slots too to avoid being invalidated
during creation, which probably can be discussed later. For back
branches, the proposed idea of acquiring ReplicationSlotAllocationLock
in an exclusive mode would be better. I think we might want to have a
comment in CheckPointReplicationSlots() too that refers to
ReplicationSlotReserveWal().

Regarding whether we revert the original fix 2090edc6f32 and make it
the same as we did in HEAD ca307d5cec90a4f, we need to change the size
of ReplicationSlot struct. I'm concerned that it's really safe to
change it because the data resides on the shared memory. For example,
we typically iterate over all replication slots as follow:

for (i = 0; i < max_replication_slots; i++)
{
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];

I'm concerned that the arithmetic for calculating the slot address is
affected by the size of ReplicationSlot change.

Yes, this is a valid concern. I think we can go-ahead with fixing the
0001's-fix in HEAD and 18. We can discuss separately the fix for
back-branches prior to 18.

--
With Regards,
Amit Kapila.

#39Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#38)
2 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Thursday, December 4, 2025 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 4, 2025 at 2:04 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Tue, Dec 2, 2025 at 10:15 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

I think the invalidation cannot occur when copying because:

Currently, there are no CHECK_FOR_INTERRUPTS() calls between the
initial restart_lsn copy (first phase) and the latest restart_lsn copy (second phase).
As a result, even if a checkpoint attempts to invalidate a slot and
sends a SIGTERM to the backend, the backend will first update the
restart_lsn during the second phase before responding to the signal.
Consequently, during the next cycle of
InvalidatePossiblyObsoleteSlot(), the checkpoint will observe the updated
restart_lsn and skip the invalidation.

For logical slots, although invoking the output plugin startup
callback presents a slight chance of processing the signal (when
using third-party plugins), slot invalidation in this scenario
results in immediate slot dropping, because the slot is in RS_EPHEMERAL

state, thus preventing invalidation.

Thank you for the analysis. I agree.

While theoretically, slot invalidation could occur if the code
changes in the future, addressing that possibility could be
considered an independent improvement task. What do you think ?

Okay. I find that it also might make sense for HEAD to use
RS_EPHEMERAL state for physical slots too to avoid being invalidated
during creation, which probably can be discussed later. For back
branches, the proposed idea of acquiring ReplicationSlotAllocationLock
in an exclusive mode would be better. I think we might want to have a
comment in CheckPointReplicationSlots() too that refers to
ReplicationSlotReserveWal().

Regarding whether we revert the original fix 2090edc6f32 and make it
the same as we did in HEAD ca307d5cec90a4f, we need to change the size
of ReplicationSlot struct. I'm concerned that it's really safe to
change it because the data resides on the shared memory. For example,
we typically iterate over all replication slots as follow:

for (i = 0; i < max_replication_slots; i++) {
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];

I'm concerned that the arithmetic for calculating the slot address is
affected by the size of ReplicationSlot change.

Yes, this is a valid concern. I think we can go-ahead with fixing the 0001's-fix
in HEAD and 18. We can discuss separately the fix for back-branches prior to
18.

Here are the updated patches for HEAD and 18. I did not add tests since, after
applying the patch and resolving the issue, the only observable behavior is that
the checkpoint will wait for another backend to create a slot due to the lwlock
lock, so it seems not worth to test solely lwlock wait event (I could not find similar
tests).

Best Regards,
Hou zj

Attachments:

v7_HEAD-0001-Fix-the-race-condition-between-slot-wal-reservati.patchapplication/octet-stream; name=v7_HEAD-0001-Fix-the-race-condition-between-slot-wal-reservati.patch
v7_PG18-0001-Fix-the-race-condition-between-slot-wal-rese.patchapplication/octet-stream; name=v7_PG18-0001-Fix-the-race-condition-between-slot-wal-rese.patch
#40Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#39)
1 attachment(s)
Re: Newly created replication slot may be invalidated by checkpoint

On Thu, Dec 4, 2025 at 12:12 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, December 4, 2025 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 4, 2025 at 2:04 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Tue, Dec 2, 2025 at 10:15 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

I think the invalidation cannot occur when copying because:

Currently, there are no CHECK_FOR_INTERRUPTS() calls between the
initial restart_lsn copy (first phase) and the latest restart_lsn copy (second phase).
As a result, even if a checkpoint attempts to invalidate a slot and
sends a SIGTERM to the backend, the backend will first update the
restart_lsn during the second phase before responding to the signal.
Consequently, during the next cycle of
InvalidatePossiblyObsoleteSlot(), the checkpoint will observe the updated
restart_lsn and skip the invalidation.

For logical slots, although invoking the output plugin startup
callback presents a slight chance of processing the signal (when
using third-party plugins), slot invalidation in this scenario
results in immediate slot dropping, because the slot is in RS_EPHEMERAL

state, thus preventing invalidation.

Thank you for the analysis. I agree.

While theoretically, slot invalidation could occur if the code
changes in the future, addressing that possibility could be
considered an independent improvement task. What do you think ?

Okay. I find that it also might make sense for HEAD to use
RS_EPHEMERAL state for physical slots too to avoid being invalidated
during creation, which probably can be discussed later. For back
branches, the proposed idea of acquiring ReplicationSlotAllocationLock
in an exclusive mode would be better. I think we might want to have a
comment in CheckPointReplicationSlots() too that refers to
ReplicationSlotReserveWal().

Regarding whether we revert the original fix 2090edc6f32 and make it
the same as we did in HEAD ca307d5cec90a4f, we need to change the size
of ReplicationSlot struct. I'm concerned that it's really safe to
change it because the data resides on the shared memory. For example,
we typically iterate over all replication slots as follow:

for (i = 0; i < max_replication_slots; i++) {
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];

I'm concerned that the arithmetic for calculating the slot address is
affected by the size of ReplicationSlot change.

Yes, this is a valid concern. I think we can go-ahead with fixing the 0001's-fix
in HEAD and 18. We can discuss separately the fix for back-branches prior to
18.

Here are the updated patches for HEAD and 18. I did not add tests since, after
applying the patch and resolving the issue, the only observable behavior is that
the checkpoint will wait for another backend to create a slot due to the lwlock
lock, so it seems not worth to test solely lwlock wait event (I could not find similar
tests).

Fair enough. The patch looks mostly good to me, attached are minor
comment improvements atop the HEAD patch. I'll do some more testing
before push.

Sawada-san/Vitaly, do you have any opinion on patch or the direction
to fix? The idea is to get this fixed for HEAD and 18, then continue
discussion for other bank-branches and the remaining patches.

--
With Regards,
Amit Kapila.

Attachments:

v7_amit_1.txttext/plain; charset=US-ASCII; name=v7_amit_1.txt
#41Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#40)
2 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Friday, December 5, 2025 8:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Fair enough. The patch looks mostly good to me, attached are minor comment
improvements atop the HEAD patch. I'll do some more testing before push.

Thanks for the diff, it looks good to me.

Here are the updated patches for HEAD and 18.

Best Regards,
Hou zj

Attachments:

v8_HEAD-0001-Fix-the-race-condition-between-slot-wal-rese.patchapplication/octet-stream; name=v8_HEAD-0001-Fix-the-race-condition-between-slot-wal-rese.patch
v8_PG18-0001-Fix-the-race-condition-between-slot-wal-rese.patchapplication/octet-stream; name=v8_PG18-0001-Fix-the-race-condition-between-slot-wal-rese.patch
#42Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#40)
Re: Newly created replication slot may be invalidated by checkpoint

On Fri, Dec 5, 2025 at 4:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 4, 2025 at 12:12 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, December 4, 2025 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 4, 2025 at 2:04 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Tue, Dec 2, 2025 at 10:15 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

I think the invalidation cannot occur when copying because:

Currently, there are no CHECK_FOR_INTERRUPTS() calls between the
initial restart_lsn copy (first phase) and the latest restart_lsn copy (second phase).
As a result, even if a checkpoint attempts to invalidate a slot and
sends a SIGTERM to the backend, the backend will first update the
restart_lsn during the second phase before responding to the signal.
Consequently, during the next cycle of
InvalidatePossiblyObsoleteSlot(), the checkpoint will observe the updated
restart_lsn and skip the invalidation.

For logical slots, although invoking the output plugin startup
callback presents a slight chance of processing the signal (when
using third-party plugins), slot invalidation in this scenario
results in immediate slot dropping, because the slot is in RS_EPHEMERAL

state, thus preventing invalidation.

Thank you for the analysis. I agree.

While theoretically, slot invalidation could occur if the code
changes in the future, addressing that possibility could be
considered an independent improvement task. What do you think ?

Okay. I find that it also might make sense for HEAD to use
RS_EPHEMERAL state for physical slots too to avoid being invalidated
during creation, which probably can be discussed later. For back
branches, the proposed idea of acquiring ReplicationSlotAllocationLock
in an exclusive mode would be better. I think we might want to have a
comment in CheckPointReplicationSlots() too that refers to
ReplicationSlotReserveWal().

Regarding whether we revert the original fix 2090edc6f32 and make it
the same as we did in HEAD ca307d5cec90a4f, we need to change the size
of ReplicationSlot struct. I'm concerned that it's really safe to
change it because the data resides on the shared memory. For example,
we typically iterate over all replication slots as follow:

for (i = 0; i < max_replication_slots; i++) {
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];

I'm concerned that the arithmetic for calculating the slot address is
affected by the size of ReplicationSlot change.

Yes, this is a valid concern. I think we can go-ahead with fixing the 0001's-fix
in HEAD and 18. We can discuss separately the fix for back-branches prior to
18.

Here are the updated patches for HEAD and 18. I did not add tests since, after
applying the patch and resolving the issue, the only observable behavior is that
the checkpoint will wait for another backend to create a slot due to the lwlock
lock, so it seems not worth to test solely lwlock wait event (I could not find similar
tests).

Fair enough. The patch looks mostly good to me, attached are minor
comment improvements atop the HEAD patch. I'll do some more testing
before push.

Sawada-san/Vitaly, do you have any opinion on patch or the direction
to fix? The idea is to get this fixed for HEAD and 18, then continue
discussion for other bank-branches and the remaining patches.

+1

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#43Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#42)
Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Dec 8, 2025 at 12:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 5, 2025 at 4:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 4, 2025 at 12:12 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Here are the updated patches for HEAD and 18. I did not add tests since, after
applying the patch and resolving the issue, the only observable behavior is that
the checkpoint will wait for another backend to create a slot due to the lwlock
lock, so it seems not worth to test solely lwlock wait event (I could not find similar
tests).

Fair enough. The patch looks mostly good to me, attached are minor
comment improvements atop the HEAD patch. I'll do some more testing
before push.

Sawada-san/Vitaly, do you have any opinion on patch or the direction
to fix? The idea is to get this fixed for HEAD and 18, then continue
discussion for other bank-branches and the remaining patches.

+1

Thanks, Pushed. I'll continue thinking on how to fix it in branches
prior to 18 and other problems reported in this thread.

--
With Regards,
Amit Kapila.

#44Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#43)
1 attachment(s)
RE: Newly created replication slot may be invalidated by checkpoint

On Monday, December 8, 2025 5:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 8, 2025 at 12:53 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Fri, Dec 5, 2025 at 4:10 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Dec 4, 2025 at 12:12 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Here are the updated patches for HEAD and 18. I did not add tests
since, after applying the patch and resolving the issue, the only
observable behavior is that the checkpoint will wait for another
backend to create a slot due to the lwlock lock, so it seems not
worth to test solely lwlock wait event (I could not find similar tests).

Fair enough. The patch looks mostly good to me, attached are minor
comment improvements atop the HEAD patch. I'll do some more testing
before push.

Sawada-san/Vitaly, do you have any opinion on patch or the direction
to fix? The idea is to get this fixed for HEAD and 18, then continue
discussion for other bank-branches and the remaining patches.

+1

Thanks, Pushed. I'll continue thinking on how to fix it in branches prior to 18
and other problems reported in this thread.

Thanks for pushing. I thought about whether it's possible to apply a similar fix
to back-branches and one approach could be to take ReplicationSlotAllocationLock
at two places. E.g., acquire an exclusive lock WAL reservation, and a shared
lock during the minimum LSN calculation at checkpoints to serialize the process.

The logic is similar to HEAD: it ensures that, if WAL reservation
occurs first, the checkpoint waits until restart_lsn is updated before
calculating the minimum LSN. If the checkpoint runs first, subsequent WAL
reservations pick a position at or after the latest checkpoint's redo pointer.

Here is the patch based on PG17 for reference.

Best Regards,
Hou zj

Attachments:

v9_PG17-0001-Prevent-invalidation-of-newly-created-replic.patchapplication/octet-stream; name=v9_PG17-0001-Prevent-invalidation-of-newly-created-replic.patch
#45Vitaly Davydov
Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Zhijie Hou (Fujitsu) (#44)
RE: Newly created replication slot may be invalidated by checkpoint

On Monday, December 08, 2025 13:24 MSK, "Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com> wrote:

On Monday, December 8, 2025 5:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Sawada-san/Vitaly, do you have any opinion on patch or the direction
to fix? The idea is to get this fixed for HEAD and 18, then continue
discussion for other bank-branches and the remaining patches.

Hi Amit, Zhijie Hou

Thank you for preparing and comiting 0001 patch. I'm ok with it. I did some auto
testing of the patch and haven't found any problems. As I realized, another two
patches (0002, 0003) are still in review.

In my previous email I wrote about copy_replication_slot, where restart_lsn is
assigned without any locks, but I'm not sure that email was successfully
delivered. Masahiko Sagawa mentioned about it in one of the latest emails as
well. I also read the answer but not completely understood it at the moment,
sorry (need some more time to investigate). Anyway, I would prefer to use locks
in create_physical_replication_slot rather than rely on signals handling which
may be changed in the future.

One more thing, when we copy a logical replication slot,
DecodingContextFindStartpoint reads the WAL from the specified restart_lsn which
may be removed by a concurrent checkpoint. It can produce an error and stop slot
copying, I guess. This behaviour may be not desirable.

With best regards,
Vitaly

#46Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Vitaly Davydov (#45)
RE: Newly created replication slot may be invalidated by checkpoint

On Tuesday, December 9, 2025 12:40 AM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

On Monday, December 08, 2025 13:24 MSK, "Zhijie Hou (Fujitsu)"
<houzj.fnst@fujitsu.com> wrote:

On Monday, December 8, 2025 5:47 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

Sawada-san/Vitaly, do you have any opinion on patch or the
direction to fix? The idea is to get this fixed for HEAD and 18,
then continue discussion for other bank-branches and the remaining

patches.

Hi Amit, Zhijie Hou

Thank you for preparing and comiting 0001 patch. I'm ok with it. I did some
auto testing of the patch and haven't found any problems. As I realized,
another two patches (0002, 0003) are still in review.

Thanks for testing!

In my previous email I wrote about copy_replication_slot, where restart_lsn is
assigned without any locks, but I'm not sure that email was successfully
delivered. Masahiko Sagawa mentioned about it in one of the latest emails as
well. I also read the answer but not completely understood it at the moment,
sorry (need some more time to investigate). Anyway, I would prefer to use
locks in create_physical_replication_slot rather than rely on signals handling
which may be changed in the future.

If we want to improve that, taking lock when updating restart_lsn does not work,
because the initial restart_lsn is an old position copied from another slot,
where the WALs could have already been removed, so unlike the mechanism
mentioned in 006dd4b2, the lock cannot ensure the same. I think we might need
to some other solutions for it which can be discussed separately.

One more thing, when we copy a logical replication slot,
DecodingContextFindStartpoint reads the WAL from the specified restart_lsn
which may be removed by a concurrent checkpoint. It can produce an error
and stop slot copying, I guess. This behaviour may be not desirable.

Given that we don't search for a starting point for copied slots (we
pass find_startpoint=false when copying), I don't think this issue exists.

Best Regards,
Hou zj

#47Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#44)
Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Dec 8, 2025 at 3:54 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, December 8, 2025 5:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 8, 2025 at 12:53 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Fri, Dec 5, 2025 at 4:10 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Dec 4, 2025 at 12:12 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Here are the updated patches for HEAD and 18. I did not add tests
since, after applying the patch and resolving the issue, the only
observable behavior is that the checkpoint will wait for another
backend to create a slot due to the lwlock lock, so it seems not
worth to test solely lwlock wait event (I could not find similar tests).

Fair enough. The patch looks mostly good to me, attached are minor
comment improvements atop the HEAD patch. I'll do some more testing
before push.

Sawada-san/Vitaly, do you have any opinion on patch or the direction
to fix? The idea is to get this fixed for HEAD and 18, then continue
discussion for other bank-branches and the remaining patches.

+1

Thanks, Pushed. I'll continue thinking on how to fix it in branches prior to 18
and other problems reported in this thread.

Thanks for pushing. I thought about whether it's possible to apply a similar fix
to back-branches and one approach could be to take ReplicationSlotAllocationLock
at two places. E.g., acquire an exclusive lock WAL reservation, and a shared
lock during the minimum LSN calculation at checkpoints to serialize the process.

+ *
+ * Additionally, acquiring the Allocation lock to serialize the minimum LSN
+ * calculation with concurrent slot WAL reservation. This ensures that the
+ * WAL position being reserved is either included in the miminum LSN or is
+ * beyond or equal to the redo pointer of the current checkpoint (See
+ * ReplicationSlotReserveWal for details).
  */
+ LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
  slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN();
+ LWLockRelease(ReplicationSlotAllocationLock);

Yeah, this will fix the reported issue but doesn't it look odd to take
an unrelated lock here? I mean it appears that if someone has to call
XLogGetReplicationSlotMinimumLSN(), they should acquire
ReplicationSlotAllocationLock in LW_SHARED mode? If we want to go in
this direction and don't have better ideas to fix then we should add
comments suggesting this is a special case and shouldn't be used as an
example for other places.

The other idea to fix this problem is suggested by Alexander in his
email [1]/messages/by-id/CAPpHfduZY7_pRCrbLdsLty4zP5x2EDmwk4CYiofiyjdt1iK+zA@mail.gmail.com which is to introduce a new ReplicationSlotReserveWALLock
for this purpose. I think introducing LWLock in back branches could be
questionable. Did you evaluate the pros and cons of using that
approach?

Yet, another possibility is that we don't fix this in back branches
prior to 18 but not sure how frequently it can impact users. Suyu, can
you please tell how you found this problem in the first place? Is it
via code-review or did you hit this in the production or while doing
some related tests?

BTW, I have asked a question regarding commit 2090edc6f32f652a2c in
email [2]/messages/by-id/CAA4eK1+wrNSee6PKQ0+DtUu_W0GdvewskpAEK5EiX6r3E+2Sxw@mail.gmail.com. Did you get a chance to look at that?

[1]: /messages/by-id/CAPpHfduZY7_pRCrbLdsLty4zP5x2EDmwk4CYiofiyjdt1iK+zA@mail.gmail.com
[2]: /messages/by-id/CAA4eK1+wrNSee6PKQ0+DtUu_W0GdvewskpAEK5EiX6r3E+2Sxw@mail.gmail.com

--
With Regards,
Amit Kapila.