Newly created replication slot may be invalidated by checkpoint

Started by 蔡梦娟(玊于)6 months ago71 messages

mengjuan.cmj@alibaba-inc.com

6 months ago

Hi, all,
I'd like to discuss an issue about getting the minimal restart_lsn for WAL segments removal during checkpoint. The discussion [1]/messages/by-id/1d12d2-67235980-35-19a406a0@63439497 </messages/by-id/1d12d2-67235980-35-19a406a0@63439497 > fixed the issue with the unexpected removal of old WAL segments after checkpoint, followed by an immediate restart. The commit 2090edc6f32f652a2c introduced a change that the minimal restart_lsn is obtained at the start of checkpoint creation. If a replication slot is created and performs a WAL reservation concurrently, the WAL segment contains the new slot's restart_lsn could be removed by the ongoing checkpoint. In the attached patch I add a perl test to reproduce this scenario.
Additionally, while studying the InvalidatePossiblyObsoleteSlot(), I noticed a behavioral difference between PG15 (and earlier) and PG16 (and later). In PG15 and earlier, while attempting to acquire a slot, if the slot's restart_lsn advanced to be greater than oldestLSN, the slot would not be marked invalid. Starting in PG16, whether a slot is marked invalid is determined solely based on initial_restart_lsn, even if the slot's restart_lsn advances above oldestLSN while waiting, the slot will still be marked invalid. The initial_restart_lsn is recorded to report the correct invalidation cause (see discussion [2]/messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal </messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal > Looking forward to your feedback. Best Regards, suyu.cmj), but why not decide whether to mark the slot as invalid based on the slot's current restart_lsn? If a slot's restart_lsn has already advanced sufficiently, shouldn't we refrain from invalidating it?
[1]: /messages/by-id/1d12d2-67235980-35-19a406a0@63439497 </messages/by-id/1d12d2-67235980-35-19a406a0@63439497 >
[2]: /messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal </messages/by-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal > Looking forward to your feedback. Best Regards, suyu.cmj
Looking forward to your feedback.
Best Regards,
suyu.cmj

Vitaly Davydov

v.davydov@postgrespro.ru

6 months ago

In reply to: 蔡梦娟(玊于) (#1)

Re: Newly created replication slot may be invalidated by checkpoint

Hi suyu.cmj

The commit 2090edc6f32f652a2c introduced a change that the
minimal restart_lsn is obtained at the start of checkpoint creation. If a
replication slot is created and performs a WAL reservation concurrently, the
WAL segment contains the new slot's restart_lsn could be removed by the ongoing
checkpoint.

Thank you for reporting this issue. I agree, the issue with slot invalidation
seems to take place in REL_17_STABLE and earlier, but it is not reproducible in
18+ versions because of different implementation. The problem may appear if
the first persistent slot is created during checkpoint, when slot's oldest lsn
is invalid. I'm not sure how it works when some other persistent slots exist.
Probably, invalidation is still possible if the reservation happens with lsn
older than the oldest lsn of existing slots.

In 17 and earlier verions, when checkpoint is started in takes slot's oldest lsn
using XLogGetReplicationSlotMinimumLSN(). This value will be used later in WAL
segments removal. If a new slot reserved the WAL between getting of slots'
oldest lsn and WAL removal, it may be invalidated. It happens because
ReplicationSlotReserveWal() checks XLogCtl->lastRemovedSegNo but the segments
are not yet removed. There is a subtle thing, when the wal reservation completes
at the same time when the checkpointer is between KeepLogSeg and
RemoveOldXlogFiles where XLogCtl->lastRemovedSegNo is updated. The slot will not
be invalidated but the segments, reserved by the new slot, may be removed, I guess.

In 17 and earlier we tried to create a compatible solution, when oldest lsn was
taken before slot syncing to disk. In the master branch we added a new
last_saved_restart_lsn into ReplicationSlot structure which seems to be a better
solution.

I prepared a simple fix [1]0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch for 17 and earlier versions. It seems it fixes the
problem with first persistent slot creation. I also think, it should work as it
was before the patch that added this bug.

I also did some changes in the original test script, for 17 ([2]v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patch) and 18 ([3]v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patch)
versions.

I continue to investigate and test it.

[1]: 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch
[2]: v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patch
[3]: v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patch

With best regards,
Vitaly

Amit Kapila

amit.kapila16@gmail.com

6 months ago

In reply to: Vitaly Davydov (#2)

Re: Newly created replication slot may be invalidated by checkpoint

On Wed, Sep 17, 2025 at 4:19 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

[1] 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch

- /* Calculate how many segments are kept by slots. */
- keep = slotsMinReqLSN;
+ /*
+ * Calculate how many segments are kept by slots. Keep the wal using
+ * the minimal value from the current reserved LSN and the reserved LSN at
+ * the moment of checkpoint start (before CheckPointReplicationSlots).
+ */
+ keep = XLogGetReplicationSlotMinimumLSN();
+ if (!XLogRecPtrIsInvalid(slotsMinReqLSN))
+ keep = Min(keep, slotsMinReqLSN);

Can we add why we need this additional calculation here?

I have one question regarding commit 2090edc6f32f652a2c:
*
    if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED,
                                           _logSegNo, InvalidOid,
                                           InvalidTransactionId))
    {
+       /*
+        * Recalculate the current minimum LSN to be used in the WAL segment
+        * cleanup.  Then, we must synchronize the replication slots again in
+        * order to make this LSN safe to use.
+        */
+       slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN();
+       CheckPointReplicationSlots(shutdown);
+
        /*
         * Some slots have been invalidated; recalculate the old-segment
         * horizon, starting again from RedoRecPtr.
         */
        XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
-       KeepLogSeg(recptr, &_logSegNo);
+       KeepLogSeg(recptr, slotsMinReqLSN, &_logSegNo);

After invalidating the slots, we recalculate the slotsMinReqLSN with
the latest value of XLogGetReplicationSlotMinimumLSN(). Can't it
generate a more recent value of slot's restart_lsn which has not been
flushed and we may end up removing the corresponding WAL? We should
probably add some comments as to why such a race doesn't exist.

--
With Regards,
Amit Kapila.

Amit Kapila

amit.kapila16@gmail.com

6 months ago

In reply to: 蔡梦娟(玊于) (#1)

Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Sep 15, 2025 at 8:11 PM suyu.cmj <mengjuan.cmj@alibaba-inc.com> wrote:

Additionally, while studying the InvalidatePossiblyObsoleteSlot(), I noticed a behavioral difference between PG15 (and earlier) and PG16 (and later). In PG15 and earlier, while attempting to acquire a slot, if the slot's restart_lsn advanced to be greater than oldestLSN, the slot would not be marked invalid. Starting in PG16, whether a slot is marked invalid is determined solely based on initial_restart_lsn, even if the slot's restart_lsn advances above oldestLSN while waiting, the slot will still be marked invalid. The initial_restart_lsn is recorded to report the correct invalidation cause (see discussion [2]), but why not decide whether to mark the slot as invalid based on the slot's current restart_lsn? If a slot's restart_lsn has already advanced sufficiently, shouldn't we refrain from invalidating it?

I haven't tried to reproduce it but I see your point. I agree that if
this is happening then we should not invalidate such slots. This is a
different problem related to a different commit than what is fixd as
2090edc6f32f652a2c. I suggest you to either start a new thread for
this or report in the original thread where this change was made.

--
With Regards,
Amit Kapila.

蔡梦娟(玊于)

mengjuan.cmj@alibaba-inc.com

6 months ago

In reply to: Amit Kapila (#4)

Re: Newly created replication slot may be invalidated by checkpoint

Hi Amit,
Thank you for your reply. Following your suggestion, I have started a new discussion thread for this issue:
/messages/by-id/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com </messages/by-id/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com >
Best regards,
suyu.cmj

Amit Kapila

amit.kapila16@gmail.com

6 months ago

In reply to: Amit Kapila (#3)

Re: Newly created replication slot may be invalidated by checkpoint

On Tue, Sep 23, 2025 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 17, 2025 at 4:19 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

[1] 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch
- /* Calculate how many segments are kept by slots. */
- keep = slotsMinReqLSN;
+ /*
+ * Calculate how many segments are kept by slots. Keep the wal using
+ * the minimal value from the current reserved LSN and the reserved LSN at
+ * the moment of checkpoint start (before CheckPointReplicationSlots).
+ */
+ keep = XLogGetReplicationSlotMinimumLSN();
+ if (!XLogRecPtrIsInvalid(slotsMinReqLSN))
+ keep = Min(keep, slotsMinReqLSN);
Can we add why we need this additional calculation here?

I was thinking some more about this solution. Won't it lead to the
same problem if ReplicationSlotReserveWal() calls
ReplicationSlotsComputeRequiredLSN() after the above calculation of
checkpointer?

--
With Regards,
Amit Kapila.

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

6 months ago

In reply to: Amit Kapila (#6)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Amit, Vitaly,

I was thinking some more about this solution. Won't it lead to the
same problem if ReplicationSlotReserveWal() calls
ReplicationSlotsComputeRequiredLSN() after the above calculation of
checkpointer?

Exactly. I verified that in your patch, the invalidation can still happen if we
cannot finish the LSN computation before the KeepLogSegments().

Attached file can be applied atop 0001-Fix-invalidation-... and
v2-17-0001-Newly-created-replication... patches. It could invalidate the given
slot.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Vitaly Davydov

v.davydov@postgrespro.ru

6 months ago

In reply to: Hayato Kuroda (Fujitsu) (#7)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Amit, Hayato

On Wednesday, September 24, 2025 14:31 MSK, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

I was thinking some more about this solution. Won't it lead to the
same problem if ReplicationSlotReserveWal() calls
ReplicationSlotsComputeRequiredLSN() after the above calculation of
checkpointer?

Exactly. I verified that in your patch, the invalidation can still happen if
we cannot finish the LSN computation before the KeepLogSegments().

Yes. The moment, when WAL reservation takes place is the call of
ReplicationSlotsComputeRequiredLSN which updates the oldest slots' lsn
(XLogCtl->replicationSlotMinLSN). If it occurs at the moment between KeepLogSeg
and RemoveOldXlogFiles, such reservation will not be taken into account. This
behaviour seems to be before commit 2090edc6f32f652a2c, but the probability of
such race condition was too slow due to the short time period between KeepLogSeg
and RemoveOldXlogFiles. The commit 2090edc6f32f652a2c increased the probability
of such race condition because CheckPointGuts can take greater time to execute.

The attached patch doesn't solve the original problem completely but it
decreases the probability of such race condition, as it was before the commit.
I propose to apply this patch and then to think how to resolve this race
condition, which seems to take place in 18 and master as well.

I updated the patch by improving some comments as suggested by Amit.

With best regards,
Vitaly

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

6 months ago

In reply to: Vitaly Davydov (#8)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

I propose to apply this patch and then to think how to resolve this race
condition, which seems to take place in 18 and master as well.

No, I think this invalidation can't happen in PG18/HEAD.
This is because in CheckpointGuts()->CheckPointReplicationSlots()->
ReplicationSlotsComputeRequiredLSN(), slots are examined one by one to determine
whether their restart_lsn has advanced since the last check. If any slot has
advanced, protection is applied starting from the oldest restart_lsn.
Crucially, this check is performed before WAL removal. The function call was
introduced in commit ca307d5cec.

Further analysis shows that it is also safe if a slot is being created and WAL
advances after CheckpointGuts() but before the removal segments are determined.
In this case the restart_lsn points the CHECKPOINT_REDO generated by the current
CHECKPOINT. This and later records won't be discarded.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#10

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

5 months ago

In reply to: Hayato Kuroda (Fujitsu) (#9)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

Would you have enough time to work on and fix the issue?
One idea is to compute the required LSN by the system at the slot checkpoint. This
partially follows what PG18/HEAD does but seems hacky and difficult to accept.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#11

Vitaly Davydov

v.davydov@postgrespro.ru

5 months ago

In reply to: Hayato Kuroda (Fujitsu) (#10)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Hayato,

Would you have enough time to work on and fix the issue?
One idea is to compute the required LSN by the system at the slot checkpoint. This
partially follows what PG18/HEAD does but seems hacky and difficult to accept.

I'm working on the issue. Give me, please, a couple of days to finalize my work.

In short, I think to call ReplicationSlotsComputeRequiredLSN() right before
slotsMinReqLSN assignment in CreateCheckPoint in 17 and earlier versions. At
this moment, we already have a new redo lsn. I consider, that the WAL
reservation happens when we assign restart_lsn to a slot. Taking into account
this consideration, I distinguish two cases - WAL reservation happens before or
after new redo ptr assignment. If we reserve the WAL after new redo ptr, it will
protect the slot's reservation, as you've mentioned. The problem appears, when
we reserve the WAL before a new redo ptr, but the update of
XLogCtl->replicationSlotMinLSN was not yet hapenned. When we assign
slotsMinReqLSN, we use XLogCtl->replicationSlotMinLSN. The call of
ReplicationSlotsComputeRequiredLSN before slotsMinReqLSN assignment can help.
It will be guaranteed, that those slots with WAL reservation before a new redo
ptr will be protected by slotsMinReqLSN, but slots with wal reservation after
a new redo ptr will be protected by the redo ptr. I think it is about the same
as you proposed.

These reasonings are applied to physical slots, but it seems to be ok for
logical slots. One moment, I'm not sure, when we create a logical slot in
recovery. In this case, GetXLogReplayRecPtr() is used. I'm not sure, that
redo ptr will protect such slot in CreateRestartPoint.

With best regards,
Vitaly

#12

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

5 months ago

In reply to: Vitaly Davydov (#11)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

Would you have enough time to work on and fix the issue?
One idea is to compute the required LSN by the system at the slot checkpoint.

This

partially follows what PG18/HEAD does but seems hacky and difficult to accept.

I'm working on the issue. Give me, please, a couple of days to finalize my work.

Oh, sorry. I was rude.

In short, I think to call ReplicationSlotsComputeRequiredLSN() right before
slotsMinReqLSN assignment in CreateCheckPoint in 17 and earlier versions. At
this moment, we already have a new redo lsn. I consider, that the WAL
reservation happens when we assign restart_lsn to a slot. Taking into account
this consideration, I distinguish two cases - WAL reservation happens before or
after new redo ptr assignment. If we reserve the WAL after new redo ptr, it will
protect the slot's reservation, as you've mentioned. The problem appears, when
we reserve the WAL before a new redo ptr, but the update of
XLogCtl->replicationSlotMinLSN was not yet hapenned. When we assign
slotsMinReqLSN, we use XLogCtl->replicationSlotMinLSN. The call of
ReplicationSlotsComputeRequiredLSN before slotsMinReqLSN assignment can
help.
It will be guaranteed, that those slots with WAL reservation before a new redo
ptr will be protected by slotsMinReqLSN, but slots with wal reservation after
a new redo ptr will be protected by the redo ptr. I think it is about the same
as you proposed.

Per my understanding, this happened because there is a lag that restart_lsn of
the slot is set, and it is protected by the system. Your idea is to ensure the
restart_lsn is protected by the system before obtaining on-memory LSN, right?

These reasonings are applied to physical slots, but it seems to be ok for
logical slots. One moment, I'm not sure, when we create a logical slot in
recovery. In this case, GetXLogReplayRecPtr() is used. I'm not sure, that
redo ptr will protect such slot in CreateRestartPoint.

I considered a reproducer for the logical slot on the standby instance. Similar
with the physical one, the injection point while reserving the WAL is used, and
it would be discarded by the restartpoint command.
One difference with physical is that invalidated slot does not retain, because
it is the ephemeral at that time.

After adding the fix [1]``` --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7675,6 +7675,7 @@ CreateRestartPoint(int flags) MemSet(&CheckpointStats, 0, sizeof(CheckpointStats)); CheckpointStats.ckpt_start_t = GetCurrentTimestamp();, I confirmed my testcases are passed, but we should
understand more about the standby stuff.

[1]:
```
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7675,6 +7675,7 @@ CreateRestartPoint(int flags)
        MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
        CheckpointStats.ckpt_start_t = GetCurrentTimestamp();

+ XLogGetReplicationSlotMinimumLSN();
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#13

Vitaly Davydov

v.davydov@postgrespro.ru

5 months ago

In reply to: Hayato Kuroda (Fujitsu) (#12)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Hayato, All

On Friday, October 03, 2025 14:14 MSK, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

I'm working on the issue. Give me, please, a couple of days to finalize my work.

Oh, sorry. I was rude.

It is okay. I very appreciate your help.

Per my understanding, this happened because there is a lag that restart_lsn of
the slot is set, and it is protected by the system. Your idea is to ensure the
restart_lsn is protected by the system before obtaining on-memory LSN, right?

Not sure what you mean by on-memory LSN, but, the issue happens because we have
a lag between restart_lsn assignment and update of
XLogCtl->replicationSlotsMinLSN which is used to protect the WAL. Yes, I propose
to ensure that the protection happens when we assign restart_lsn. It seems to be
wrong that we invalidate slots by its restart_lsn but protect the wal for
slots using XLogCtl->replicationSlotsMinLSN.

Below I tried to write some summary and propose the patch which fixes the problem.

The issue was originally reported at [1]/messages/by-id/15922-68ca9280-4f-37de2c40@245457797 and it seems to appear in 17 and
earlier versions. The issue is not reproducible in 18+ versions.

The issue may appear when we create a persistent slot during checkpoint. The WAL
reservation in slots happens in ReplicationSlotReserveWal and executed in three
steps:
1. Assignment of slot->data.restart_lsn
2. Update of XLogCtl->replicationSlotMinLSN
3. Check if WAL segments at restart_lsn are removed, go to step 1 if removed.

When the checkpointer calculates the oldest lsn which is used as the lsn horizon
when removing old WAL segments, it takes XLogCtl->replicationSlotMinLSN.
There is a race condition may happen when slot's restart_lsn is already assigned
but XLogCtl->replicationSlotMinLSN is not updated yet. Consider the following
scenario with two processes executing in parallel (checkpointer and backend,
where a new slot is creating):

1. Assign of slot.data->restart_lsn in the backend from GetRedoRecPtr()
2. Assign a new redo LSN in the checkpointer
3. Assign slotsMinReqLSN from XLogCtl->replicationSlotMinLSN in the checkpointer
4. Update of XLogCtl->replicationSlotMinLSN in the backend.
5. Calculation of the WAL horizon for old segments cleanup (KeepLogSeg before
the call of InvalidateObsoleteReplicationSlots) in the checkpointer.
6. Exit from ReplicationSlotReserveWal in the backend, once the reserved WAL
segments are not removed at this moment (XLogGetLastRemovedSegno() < segno).
7. Call of InvalidateObsoleteReplicationSlots in the checkpointer will invalidate
the creating slot because its restart_lsn will be less than the calculated
WAL horizon (the min of slotsMinReqLSN and RedoRecPtr).

To fix the issue I propose to consider the following assumptions:
1. Slots do not cross WAL segment borders backward when moving.
2. Old WAL segments are removed in the checkpointer only.
3. The following LSNs are initially assigned during slot reservation:
- GetRedoRecPtr() for physical slots
- GetXLogInsertRecPtr() for logical slots
- GetXLogReplayRecPtr() for logical slots in recovery

Taking into account these assumptions, I would like to propose the fix [2]v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patch.

There is an idea to think that the WAL reservation happens when we assign
restart_lsn to the slot. The call of ReplicationSlotsComputeRequiredLSN() is not
required to be executed immediately in the backend where the slot is creating
concurrently. In the checkpointer we have to guarantee that we do WAL horizon
calculations based on actual values of restart_lsn of existing slots. If
we call ReplicationSlotsComputeRequiredLSN() in the checkpointer after a new
REDO assignment and before the calculation of WAL horizon, the value of
XLogCtl->replicationSlotMinLSN will correctly define the oldest LSN for existing
slots. If the WAL reservation by a new slot happens during checkpoint before
a new REDO assignment, it is guaranteed that its restart_lsn will be accounted
when we call ReplicationSlotsComputeRequiredLSN() in the checkpointer. If the
WAL reservation happens after a new redo LSN assignment, the slot's restart_lsn
will be protected by this new redo LSN, because this LSN will be lesser or equal
to initial restart_lsn (see assumption 3).

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

I also think that the cycle in ReplicationSlotReserveWal which checks for the
current restart_lsn to be greater than the XLogGetLastRemovedSegno() is not
necessary because it is guaranteed that the assigned restart_lsn will be
protected. Lets keep it unchanged until this suggestion will be clarified.

The proposed solution doesn't break the fix in ca307d5cec (unexpected removal of
old WAL segments after checkpoint). Once we call
ReplicationSlotsComputeRequiredLSN() before CheckPointReplicationSlots(), the
saved to disk restart_lsn values of existing slots will be not less than
the previously computed XLogCtl->replicationSlotMinLSN. They just may be
advanced to greater values concurrently. For new slots with restart_lsn
assignment after ReplicationSlotsComputeRequiredLSN(), the current redo LSN will
protect the WAL.

The fix for REL_17_STABLE is in [2]v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patch. The regression test is in [3]v3-0001-Newly-created-replication-slot-may-be-invalidated-by.patch.

I apologize for so long summary.

[1]: /messages/by-id/15922-68ca9280-4f-37de2c40@245457797
[2]: v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patch
[3]: v3-0001-Newly-created-replication-slot-may-be-invalidated-by.patch

With best regards,
Vitaly

#14

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

5 months ago

In reply to: Vitaly Davydov (#13)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

Per my understanding, this happened because there is a lag that restart_lsn of
the slot is set, and it is protected by the system. Your idea is to ensure the
restart_lsn is protected by the system before obtaining on-memory LSN, right?

Not sure what you mean by on-memory LSN, but, the issue happens because we
have
a lag between restart_lsn assignment and update of
XLogCtl->replicationSlotsMinLSN which is used to protect the WAL.

Sorry I should say "before obtaining replicationSlotMinLSN".

Yes, I
propose
to ensure that the protection happens when we assign restart_lsn. It seems to be
wrong that we invalidate slots by its restart_lsn but protect the wal for
slots using XLogCtl->replicationSlotsMinLSN.

Seems valid. There is another corner case that another restart_lsn can be set in-between,
but they have larger LSN than RedoRecPtr, right?

Below I tried to write some summary and propose the patch which fixes the
problem.

Sorry but it is too long to understand properly for me :-(.

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

Oh, your point is there is another race condition, right? Do you have the reproducer for it?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#15

Vitaly Davydov

v.davydov@postgrespro.ru

4 months ago

In reply to: Hayato Kuroda (Fujitsu) (#14)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Hayato,

On Tuesday, October 07, 2025 14:53 MSK, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

Yes, I
propose
to ensure that the protection happens when we assign restart_lsn. It seems to be
wrong that we invalidate slots by its restart_lsn but protect the wal for
slots using XLogCtl->replicationSlotsMinLSN.

Seems valid. There is another corner case that another restart_lsn can be set in-between,
but they have larger LSN than RedoRecPtr, right?

Here we talk about new creating slots. There should be no other processes that
can change restart_lsn during slot creation. Once, the slot is successfully
created it can be advanced to the greater values only. During advance, the old
restart_lsn will protect the slot, because it will be taken into account in
the checkpoint.

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

Oh, your point is there is another race condition, right? Do you have the reproducer for it?

The attached test for the master branch demonstrates a possible but very
rare race condition, because we read and assign slot's restart_lsn from redo rec
ptr non-atomically. The proposed scenario (see above) seems to be not complete.

With best regards,
Vitaly

#16

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

4 months ago

In reply to: Vitaly Davydov (#15)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Vitaly,

Thanks for sharing the reproducer. Agreed this is a real but minor issue.

Firstly I considered an ad-hoc way, which sets the candidate restart_lsn as
replicationSlotMinLSN before using as slot->data.restart_lsn. PSA the idea.
It can fix your reproducer.

However it still has a corner case; Assuming the checkpointer finishes computing
removal WALs before setting the restart_lsn to system_wide one, and checkpointer
tries to invalidate slots after restart_lsn of the slot is set.
In this case the checkpointer detects the creating slot and its restart_lsn is
older than oldest one. The checkpointer terminates the backend and
invalidates the slot.

This can be reproduced by moving 1) checkpoint-before-old-wal-removal to
in-between KeepLogSeg() and InvalidateObsoleteReplicationSlots(), and
2) physical-slot-reserve-wal-get-redo before the XLogMaybeSetReplicationSlotMinimumLSN().

For now, I cannot come up with the good fix. How about others?

BTW, can you update meson.build as well when you add .pl test code? Otherwise, it
cannot be run for meson builders.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#17

Alexander Korotkov

aekorotkov@gmail.com

4 months ago

In reply to: Vitaly Davydov (#13)

Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Oct 6, 2025 at 6:46 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

Thank you for highlighting this scenario. I've reviewed it. I think
we could avoid it by covering appropriate parts of
ReplicationSlotReserveWal() and Create{Check|Restart}Point() by a new
LWLock. The draft patch is attached. What do you think?

------
Regards,
Alexander Korotkov
Supabase

#18

Amit Kapila

amit.kapila16@gmail.com

4 months ago

In reply to: Alexander Korotkov (#17)

Re: Newly created replication slot may be invalidated by checkpoint

On Wed, Nov 5, 2025 at 3:48 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Oct 6, 2025 at 6:46 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

Thank you for highlighting this scenario. I've reviewed it. I think
we could avoid it by covering appropriate parts of
ReplicationSlotReserveWal() and Create{Check|Restart}Point() by a new
LWLock. The draft patch is attached. What do you think?

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well. In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

I feel with the proposed patches for back branches, the code is
deviating too much and also makes it a bit complicated, which means it
could be difficult to maintain it in the future. Can we consider
reverting the original fix 2090edc6f32f652a2c995ca5f7e65748ae1e4c5d
and make it the same as we did in HEAD
ca307d5cec90a4fde62a50fafc8ab607ff1d8664? I know this would lead to
ABI breakage, but only for extensions using sizeof(ReplicationSlot),
if any. We can try to identify how many extensions rely on
sizeof(ReplicationSlot) and then decide accordingly? We recently did
something similar for another backbranch fix [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45c357e0e85d2dffe7af5440806150124a725a01 which requires adding
members at the end of structure.

OTOH, if we really want to go in the direction of deviating the back
branch code further, then we can review your fix, but I am hesitant to
go in that direction due to additional complexity and maintenance
burden.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45c357e0e85d2dffe7af5440806150124a725a01

--
With Regards,
Amit Kapila.

#19

Amit Kapila

amit.kapila16@gmail.com

4 months ago

In reply to: Amit Kapila (#18)

Re: Newly created replication slot may be invalidated by checkpoint

On Mon, Nov 10, 2025 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 5, 2025 at 3:48 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Oct 6, 2025 at 6:46 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

Thank you for highlighting this scenario. I've reviewed it. I think
we could avoid it by covering appropriate parts of
ReplicationSlotReserveWal() and Create{Check|Restart}Point() by a new
LWLock. The draft patch is attached. What do you think?

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well. In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

We need to check whether a similar change is required in
reserve_wal_for_local_slot() as well for sync slots.

--
With Regards,
Amit Kapila.

#20

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

4 months ago

In reply to: Amit Kapila (#18)

RE: Newly created replication slot may be invalidated by checkpoint

Dear Amit, Alexander,

The fix seems to be only provided for bank branches, but IIUC the
problem can happen in HEAD as well.

Yes, I confirmed this can happen. [1]/messages/by-id/30938d-68ffa500-b-6328cc00@139466859 can be applied atop HEAD and
reproduce the invalidation.

In Head, how about acquiring
ReplicationSlotAllocationLock in Exclusive mode during
ReplicationSlotReserveWal? This lock is acquired in SHARE mode in
CheckPointReplicationSlots. So, this should make our calculations
correct and avoid invalidating the newly created slot.

I feel this can fix the issue. The idea can serialize CheckpointReplicationSlots()
and ReplicationSlotReserveWal(), and upcoming XLogGetReplicationSlotMinimumLSN()
can take care the newly created slot.

Also, since CheckpointReplicationSlots() is called after setting RedoRecPtr, it
is OK if ReplicationSlotReserveWal() is called after the CheckpointReplicationSlots().
In this case the candidate restart_lsn has larger LSN than RedoRecPtr and won't
be removed in the checkpoint.

I feel with the proposed patches for back branches, the code is
deviating too much and also makes it a bit complicated, which means it
could be difficult to maintain it in the future. Can we consider
reverting the original fix 2090edc6f32f652a2c995ca5f7e65748ae1e4c5d
and make it the same as we did in HEAD
ca307d5cec90a4fde62a50fafc8ab607ff1d8664?

Is it allowed to add new LWLocks for backbranched? I'm afraid some other extensions
might be affected.

[1]: /messages/by-id/30938d-68ffa500-b-6328cc00@139466859

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#21

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

4 months ago

In reply to: Amit Kapila (#19)

#22

Vitaly Davydov

v.davydov@postgrespro.ru

4 months ago