logical: fix recomputation required LSN on restart_lsn-only advancement

Started by Chao Li2 months ago5 messageshackers

li.evan.chao@gmail.com

2 months ago

Hi,

While reading logical replication code, I found an issue in LogicalConfirmReceivedLocation().

In LogicalConfirmReceivedLocation(), updated_restart is tracked independently from updated_xmin, and the slot is marked dirty and saved when either one changed. But after that, ReplicationSlotsComputeRequiredLSN() is still only called inside "if (updated_xmin)”.

So for the restart-only case:

* updated_restart = true
* updated_xmin = false
* ReplicationSlotSave() runs
* ReplicationSlotsComputeRequiredLSN() does not run because updated_xmin is false

That means the global retention point managed by XLogSetReplicationSlotMinimumLSN() can stay stale until some later unrelated event recomputes it. Since ReplicationSlotsComputeRequiredLSN() derives the global minimum from slot restat_lsn, skipping it after a restart-only advance can retain excess WAL and may lead to WAL bloat.

This patch fixes the problem by moving ReplicationSlotsComputeRequiredLSN() under “if (updated_restart)”.

Looks like this issue has been there for a long time, so if this analysis is correct, it may also be worth back-patching.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Hu Xunqi

huxunqi.08@gmail.com

2 months ago

In reply to: Chao Li (#1)

Re: logical: fix recomputation required LSN on restart_lsn-only advancement

On Tue, Apr 21, 2026 at 10:16 AM Chao Li <li.evan.chao@gmail.com> wrote:

Hi,

While reading logical replication code, I found an issue in
LogicalConfirmReceivedLocation().

In LogicalConfirmReceivedLocation(), updated_restart is tracked
independently from updated_xmin, and the slot is marked dirty and saved
when either one changed. But after that,
ReplicationSlotsComputeRequiredLSN() is still only called inside "if
(updated_xmin)”.

So for the restart-only case:

* updated_restart = true
* updated_xmin = false
* ReplicationSlotSave() runs
* ReplicationSlotsComputeRequiredLSN() does not run because updated_xmin
is false

That means the global retention point managed by
XLogSetReplicationSlotMinimumLSN() can stay stale until some later
unrelated event recomputes it. Since ReplicationSlotsComputeRequiredLSN()
derives the global minimum from slot restat_lsn, skipping it after a
restart-only advance can retain excess WAL and may lead to WAL bloat.

This patch fixes the problem by moving
ReplicationSlotsComputeRequiredLSN() under “if (updated_restart)”.

Looks like this issue has been there for a long time, so if this analysis
is correct, it may also be worth back-patching.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

I think this change is reasonable.
This makes the recomputation condition match the state it actually depends
on.

Regards,
Xunqi Hu

Antonin Houska

ah@cybertec.at

2 months ago

In reply to: Chao Li (#1)

Re: logical: fix recomputation required LSN on restart_lsn-only advancement

Chao Li <li.evan.chao@gmail.com> wrote:

While reading logical replication code, I found an issue in LogicalConfirmReceivedLocation().

In LogicalConfirmReceivedLocation(), updated_restart is tracked independently from updated_xmin, and the slot is marked dirty and saved when either one changed. But after that, ReplicationSlotsComputeRequiredLSN() is still only called inside "if (updated_xmin)”.

So for the restart-only case:

* updated_restart = true
* updated_xmin = false
* ReplicationSlotSave() runs
* ReplicationSlotsComputeRequiredLSN() does not run because updated_xmin is false

That means the global retention point managed by XLogSetReplicationSlotMinimumLSN() can stay stale until some later unrelated event recomputes it. Since ReplicationSlotsComputeRequiredLSN() derives the global minimum from slot restat_lsn, skipping it after a restart-only advance can retain excess WAL and may lead to WAL bloat.

This patch fixes the problem by moving ReplicationSlotsComputeRequiredLSN() under “if (updated_restart)”.

FYI, this overlaps with another post in the REPACK thread [1]/messages/by-id/TYRPR01MB14195633567DA00ABD42570B794592@TYRPR01MB14195.jpnprd01.prod.outlook.com.

Looks like this issue has been there for a long time, so if this analysis is correct, it may also be worth back-patching.

As REPACK in PG 19 does not let xmin advance (that should be fixed in the
future), I think makes sense to apply [1]/messages/by-id/TYRPR01MB14195633567DA00ABD42570B794592@TYRPR01MB14195.jpnprd01.prod.outlook.com to v19. However, during logical
replication, xmin (IMO) gets updated rather often, so the problem should not
be that severe in earlier versions.

[1]: /messages/by-id/TYRPR01MB14195633567DA00ABD42570B794592@TYRPR01MB14195.jpnprd01.prod.outlook.com

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Chao Li

li.evan.chao@gmail.com

2 months ago

In reply to: Antonin Houska (#3)

Re: logical: fix recomputation required LSN on restart_lsn-only advancement

On Apr 21, 2026, at 15:09, Antonin Houska <ah@cybertec.at> wrote:

Chao Li <li.evan.chao@gmail.com> wrote:

While reading logical replication code, I found an issue in LogicalConfirmReceivedLocation().

In LogicalConfirmReceivedLocation(), updated_restart is tracked independently from updated_xmin, and the slot is marked dirty and saved when either one changed. But after that, ReplicationSlotsComputeRequiredLSN() is still only called inside "if (updated_xmin)”.

So for the restart-only case:

* updated_restart = true
* updated_xmin = false
* ReplicationSlotSave() runs
* ReplicationSlotsComputeRequiredLSN() does not run because updated_xmin is false

That means the global retention point managed by XLogSetReplicationSlotMinimumLSN() can stay stale until some later unrelated event recomputes it. Since ReplicationSlotsComputeRequiredLSN() derives the global minimum from slot restat_lsn, skipping it after a restart-only advance can retain excess WAL and may lead to WAL bloat.

This patch fixes the problem by moving ReplicationSlotsComputeRequiredLSN() under “if (updated_restart)”.

FYI, this overlaps with another post in the REPACK thread [1].

Looks like this issue has been there for a long time, so if this analysis is correct, it may also be worth back-patching.

As REPACK in PG 19 does not let xmin advance (that should be fixed in the
future), I think makes sense to apply [1] to v19. However, during logical
replication, xmin (IMO) gets updated rather often, so the problem should not
be that severe in earlier versions.

[1] /messages/by-id/TYRPR01MB14195633567DA00ABD42570B794592@TYRPR01MB14195.jpnprd01.prod.outlook.com

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Thanks for pointing out that. I will review that patch.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Alvaro Herrera

alvherre@2ndquadrant.com

25 days ago

In reply to: Chao Li (#1)

Re: logical: fix recomputation required LSN on restart_lsn-only advancement

On 2026-Apr-21, Chao Li wrote:

While reading logical replication code, I found an issue in
LogicalConfirmReceivedLocation().

In LogicalConfirmReceivedLocation(), updated_restart is tracked
independently from updated_xmin, and the slot is marked dirty and
saved when either one changed. But after that,
ReplicationSlotsComputeRequiredLSN() is still only called inside "if
(updated_xmin)”.

Have you seen this causing issues in any cases beyond REPACK? I'm
wondering about your suggestion to backpatch this change:

Looks like this issue has been there for a long time, so if this
analysis is correct, it may also be worth back-patching.

If REPACK is the only affected party, then we don't need to care; as
Antonin said, the xmin advances frequently enough in other cases, so it
shouldn't normally be a problem ...

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

logical: fix recomputation required LSN on restart_lsn-only advancement

Attachments: