Fix slotsync worker busy loop causing repeated log messages

Started by Fujii Masaoabout 2 months ago14 messageshackers

masao.fujii@gmail.com

about 2 months ago

Hi,

While testing the replication slot sync feature, I observed that the slotsync
worker can suddenly start emitting four log messages every 200 ms, even when
both primary and standby are idle. For example:

--------------------------
2026-02-27 23:06:46.123 JST [80789] LOG: starting logical decoding
for slot "logical_slot"
2026-02-27 23:06:46.123 JST [80789] DETAIL: Streaming transactions
committing after 0/03000140, reading WAL from 0/03000098.
2026-02-27 23:06:46.123 JST [80789] LOG: logical decoding found
consistent point at 0/03000098
2026-02-27 23:06:46.123 JST [80789] DETAIL: There are no running transactions.
2026-02-27 23:06:46.330 JST [80789] LOG: starting logical decoding
for slot "logical_slot"
2026-02-27 23:06:46.330 JST [80789] DETAIL: Streaming transactions
committing after 0/03000140, reading WAL from 0/03000098.
2026-02-27 23:06:46.330 JST [80789] LOG: logical decoding found
consistent point at 0/03000098
2026-02-27 23:06:46.330 JST [80789] DETAIL: There are no running transactions.
2026-02-27 23:06:46.536 JST [80789] LOG: starting logical decoding
for slot "logical_slot"
2026-02-27 23:06:46.536 JST [80789] DETAIL: Streaming transactions
committing after 0/03000140, reading WAL from 0/03000098.
2026-02-27 23:06:46.536 JST [80789] LOG: logical decoding found
consistent point at 0/03000098
2026-02-27 23:06:46.536 JST [80789] DETAIL: There are no running transactions.
--------------------------

These messages repeat roughly every 200 ms.

I created the replication slot sync environment as follows:

--------------------------
initdb -D data --encoding=UTF8 --locale=C
cat <<EOF >> data/postgresql.conf
wal_level = logical
synchronized_standby_slots = 'physical_slot'
EOF
pg_ctl -D data start
pg_receivewal --create-slot -S physical_slot
pg_recvlogical --create-slot -S logical_slot -P pgoutput
--enable-failover -d postgres
psql -c "CREATE PUBLICATION mypub"

pg_basebackup -D sby1 -c fast -R -S physical_slot -d "dbname=postgres"
cat <<EOF >> sby1/postgresql.conf
port = 5433
sync_replication_slots = on
hot_standby_feedback = on
EOF
pg_ctl -D sby1 start
--------------------------

After that, I executed the following, and then the issue occurred:

--------------------------
SELECT pg_logical_emit_message(true, 'abc', 'xyz');

SELECT pg_replication_slot_advance('logical_slot', max(lsn)) FROM
pg_logical_slot_peek_binary_changes('logical_slot', NULL, NULL,
'proto_version', '3', 'publication_names', 'mypub', 'messages',
'true', 'binary', 'false', 'streaming', 'false');

-- Wait for the log message "newly created replication slot
"logical_slot" is sync-ready now" to output

While the issue is happening, the failover logical slot shows:

[PRIMARY]
=# SELECT slot_name, restart_lsn, confirmed_flush_lsn from
pg_replication_slots where slot_name = 'logical_slot';
slot_name | restart_lsn | confirmed_flush_lsn
--------------+-------------+---------------------
logical_slot | 0/03000140 | 0/03000140

[STANDBY]
=# SELECT slot_name, restart_lsn, confirmed_flush_lsn from
pg_replication_slots where slot_name = 'logical_slot';
slot_name | restart_lsn | confirmed_flush_lsn
--------------+-------------+---------------------
logical_slot | 0/03000098 | 0/03000140

confirmed_flush_lsn matches on both servers, but restart_lsn differs.

Normally, the slotsync worker updates the standby slot using the primary's slot
state. However, when confirmed_flush_lsn matches but restart_lsn does not,
the worker does not actually update the standby slot. Despite that, the current
code of update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the minimum
interval (200 ms) before retrying. In the next cycle, it again assumes
an update happened, and continues looping with the short sleep interval,
causing the repeated logical decoding log messages. Based on a quick analysis,
this seems to be the root cause.

I think update_local_synced_slot() should return false (i.e., no update
happened) when confirmed_flush_lsn is equal but restart_lsn differs between
primary and standby. That would allow the worker to use the normal sleep
interval instead of the minimum one.

I've attached a PoC patch implementing this change.

Thoughts?

Regards,

--
Fujii Masao

Amit Kapila

amit.kapila16@gmail.com

about 2 months ago

In reply to: Fujii Masao (#1)

Re: Fix slotsync worker busy loop causing repeated log messages

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com> wrote:

Normally, the slotsync worker updates the standby slot using the primary's slot
state. However, when confirmed_flush_lsn matches but restart_lsn does not,
the worker does not actually update the standby slot. Despite that, the current
code of update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the minimum
interval (200 ms) before retrying. In the next cycle, it again assumes
an update happened, and continues looping with the short sleep interval,
causing the repeated logical decoding log messages. Based on a quick analysis,
this seems to be the root cause.

I think update_local_synced_slot() should return false (i.e., no update
happened) when confirmed_flush_lsn is equal but restart_lsn differs between
primary and standby.

We expect that in such a case update_local_synced_slot() should
advance local_slot's 'restart_lsn' via
LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
cheap code path next time. Normally, restart_lsn advancement should
happen when we process XLOG_RUNNING_XACTS and call
SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have not
debugged the exact case yet but you can try by emitting some more
records on publisher, it should let the standby advance the slot. It
is possible that we can do something like you are proposing to silence
the LOG messages but we should know what is going on here.

--
With Regards,
Amit Kapila.

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

about 1 month ago

In reply to: Amit Kapila (#2)

RE: Fix slotsync worker busy loop causing repeated log messages

On Saturday, February 28, 2026 1:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com> wrote:

Normally, the slotsync worker updates the standby slot using the
primary's slot state. However, when confirmed_flush_lsn matches but
restart_lsn does not, the worker does not actually update the standby
slot. Despite that, the current code of update_local_synced_slot()
appears to treat this situation as if an update occurred. As a result,
the worker sleeps only for the minimum interval (200 ms) before
retrying. In the next cycle, it again assumes an update happened, and
continues looping with the short sleep interval, causing the repeated
logical decoding log messages. Based on a quick analysis, this seems to be

the root cause.

I think update_local_synced_slot() should return false (i.e., no
update
happened) when confirmed_flush_lsn is equal but restart_lsn differs
between primary and standby.

We expect that in such a case update_local_synced_slot() should advance
local_slot's 'restart_lsn' via LogicalSlotAdvanceAndCheckSnapState(),
otherwise, it won't go in the cheap code path next time. Normally, restart_lsn
advancement should happen when we process XLOG_RUNNING_XACTS and
call SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have not
debugged the exact case yet but you can try by emitting some more records
on publisher, it should let the standby advance the slot. It is possible that we
can do something like you are proposing to silence the LOG messages but we
should know what is going on here.

I reproduced and debugged this issue where a replication slot's restart_lsn
fails to advance. In my environment, I found it only occurs when a synced
slot first builds a consistent snapshot. The problematic code path is in
SnapBuildProcessRunningXacts():

if (builder->state < SNAPBUILD_CONSISTENT)
{
/* returns false if there's no point in performing cleanup just yet */
if (!SnapBuildFindSnapshot(builder, lsn, running))
return;
}

When a synced slot reaches consistency for the first time with no running
transactions, SnapBuildFindSnapshot() returns false, causing the function to
return without updating the candidate restart_lsn.

So, an alternative approach is to improve this logic by updating the candidate
restart_lsn in this case instead of returning early. See the attached patch for
details. This can fix the issue on my machine.

Best Regards,
Hou zj

Amit Kapila

amit.kapila16@gmail.com

about 1 month ago

In reply to: Zhijie Hou (Fujitsu) (#3)

Re: Fix slotsync worker busy loop causing repeated log messages

On Tue, Mar 3, 2026 at 1:12 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Saturday, February 28, 2026 1:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com> wrote:

Normally, the slotsync worker updates the standby slot using the
primary's slot state. However, when confirmed_flush_lsn matches but
restart_lsn does not, the worker does not actually update the standby
slot. Despite that, the current code of update_local_synced_slot()
appears to treat this situation as if an update occurred. As a result,
the worker sleeps only for the minimum interval (200 ms) before
retrying. In the next cycle, it again assumes an update happened, and
continues looping with the short sleep interval, causing the repeated
logical decoding log messages. Based on a quick analysis, this seems to be

the root cause.

I think update_local_synced_slot() should return false (i.e., no
update
happened) when confirmed_flush_lsn is equal but restart_lsn differs
between primary and standby.

We expect that in such a case update_local_synced_slot() should advance
local_slot's 'restart_lsn' via LogicalSlotAdvanceAndCheckSnapState(),
otherwise, it won't go in the cheap code path next time. Normally, restart_lsn
advancement should happen when we process XLOG_RUNNING_XACTS and
call SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have not
debugged the exact case yet but you can try by emitting some more records
on publisher, it should let the standby advance the slot. It is possible that we
can do something like you are proposing to silence the LOG messages but we
should know what is going on here.

I reproduced and debugged this issue where a replication slot's restart_lsn
fails to advance. In my environment, I found it only occurs when a synced
slot first builds a consistent snapshot. The problematic code path is in
SnapBuildProcessRunningXacts():

if (builder->state < SNAPBUILD_CONSISTENT)
{
/* returns false if there's no point in performing cleanup just yet */
if (!SnapBuildFindSnapshot(builder, lsn, running))
return;
}

When a synced slot reaches consistency for the first time with no running
transactions, SnapBuildFindSnapshot() returns false, causing the function to
return without updating the candidate restart_lsn.

So, an alternative approach is to improve this logic by updating the candidate
restart_lsn in this case instead of returning early.

But why not return 'true' from SnapBuildFindSnapshot() in that case?
The comment atop SnapBuildFindSnapshot() says: "Returns true if there
is a point in performing internal maintenance/cleanup using the
xl_running_xacts record.". Doesn't updating restart_lsn fall under
that category?

However, I have a question that even if we haven't incremented it in
the first cycle, why is it not incrementing restart_lsn in consecutive
sync cycles.

--
With Regards,
Amit Kapila.

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

about 1 month ago

In reply to: Amit Kapila (#4)

RE: Fix slotsync worker busy loop causing repeated log messages

On Tuesday, March 3, 2026 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 3, 2026 at 1:12 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Saturday, February 28, 2026 1:03 PM Amit Kapila

<amit.kapila16@gmail.com> wrote:

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com>

wrote:

Normally, the slotsync worker updates the standby slot using the
primary's slot state. However, when confirmed_flush_lsn matches
but restart_lsn does not, the worker does not actually update the
standby slot. Despite that, the current code of
update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the
minimum interval (200 ms) before retrying. In the next cycle, it
again assumes an update happened, and continues looping with the
short sleep interval, causing the repeated logical decoding log
messages. Based on a quick analysis, this seems to be

the root cause.

I think update_local_synced_slot() should return false (i.e., no
update
happened) when confirmed_flush_lsn is equal but restart_lsn
differs between primary and standby.

We expect that in such a case update_local_synced_slot() should
advance local_slot's 'restart_lsn' via
LogicalSlotAdvanceAndCheckSnapState(),
otherwise, it won't go in the cheap code path next time. Normally,
restart_lsn advancement should happen when we process
XLOG_RUNNING_XACTS and call SnapBuildProcessRunningXacts(). In this
particular case as both restart_lsn and confirmed_flush_lsn are the
same (0/03000140), the machinery may not be processing
XLOG_RUNNING_XACTS record. I have not debugged the exact case yet
but you can try by emitting some more records on publisher, it
should let the standby advance the slot. It is possible that we can
do something like you are proposing to silence the LOG messages but we

should know what is going on here.

I reproduced and debugged this issue where a replication slot's
restart_lsn fails to advance. In my environment, I found it only
occurs when a synced slot first builds a consistent snapshot. The
problematic code path is in
SnapBuildProcessRunningXacts():

if (builder->state < SNAPBUILD_CONSISTENT)
{
/* returns false if there's no point in performing cleanup just yet */
if (!SnapBuildFindSnapshot(builder, lsn, running))
return;
}

When a synced slot reaches consistency for the first time with no
running transactions, SnapBuildFindSnapshot() returns false, causing
the function to return without updating the candidate restart_lsn.

So, an alternative approach is to improve this logic by updating the
candidate restart_lsn in this case instead of returning early.

But why not return 'true' from SnapBuildFindSnapshot() in that case?
The comment atop SnapBuildFindSnapshot() says: "Returns true if there is a
point in performing internal maintenance/cleanup using the xl_running_xacts
record.". Doesn't updating restart_lsn fall under that category?

After re-thinking the return value semantics, I realized that
SnapBuildFindSnapshot() may no longer need a return value. In the revised
logic, the function would always return true, making the return value
redundant.

To elaborate: the original SnapBuildFindSnapshot() returned:
- true: when snapshot is built incrementally (after waiting for running
transactions to complete)
- false: when snapshot is built immediately (restoring from disk or no
running transactions)

My initial thought was to use this return value to decide whether to advance
restart_lsn to the latest LSN (advance when restoring a snapshot or with no running
transactions). However, this decision can be made without relying on the
return value (please see the v2 patch).

Therefore, we can change the function return type to void. The only other
consumer of this return value is SnapBuildPurgeOlderTxn(), which can instead
skip cleanup when no transaction data has accumulated (i.e., when the number
of committed transactions is zero), eliminating the need for the return value
altogether.

However, I have a question that even if we haven't incremented it in the first
cycle, why is it not incrementing restart_lsn in consecutive sync cycles.

Because no walsender was active during the reproduction step, so the slot remained
inactive on the publisher and its restart_lsn didn't advanced. As a result, the
slotsync process stalled while continuously retrying the first cycle of
snapshot building.

Best Regards,
Hou zj

cca5507

2624345507@qq.com

about 1 month ago

In reply to: Zhijie Hou (Fujitsu) (#5)

RE: Fix slotsync worker busy loop causing repeated log messages

Hi,

I have a small question about this:

```
@@ -1143,11 +1146,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * our snapshot so others or we, after a restart, can use it.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
-	{
-		/* returns false if there's no point in performing cleanup just yet */
-		if (!SnapBuildFindSnapshot(builder, lsn, running))
-			return;
-	}
+		SnapBuildFindSnapshot(builder, lsn, running);
 	else
 		SnapBuildSerialize(builder, lsn);
```

Why we don't call SnapBuildSerialize() when reaching consistent
in SnapBuildFindSnapshot()? I think we can serialize the snapshot
if we are not building a full snapshot, at least.

--
Regards,
ChangAo Chen

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

about 1 month ago

In reply to: Zhijie Hou (Fujitsu) (#5)

RE: Fix slotsync worker busy loop causing repeated log messages

On Tuesday, March 3, 2026 8:25 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

I reproduced and debugged this issue where a replication slot's
restart_lsn fails to advance. In my environment, I found it only
occurs when a synced slot first builds a consistent snapshot. The
problematic code path is in
SnapBuildProcessRunningXacts():

if (builder->state < SNAPBUILD_CONSISTENT)
{
/* returns false if there's no point in performing cleanup just yet */
if (!SnapBuildFindSnapshot(builder, lsn, running))
return;
}

When a synced slot reaches consistency for the first time with no
running transactions, SnapBuildFindSnapshot() returns false, causing
the function to return without updating the candidate restart_lsn.

However, I have a question that even if we haven't incremented it in
the first cycle, why is it not incrementing restart_lsn in consecutive sync

cycles.

Because no walsender was active during the reproduction step, so the slot
remained inactive on the publisher and its restart_lsn didn't advanced. As a
result, the slotsync process stalled while continuously retrying the first cycle
of snapshot building.

I've analyzed the issues further and identified two distinct cases that can
prevent the slotsync worker from advancing restart_lsn:

1) One is that I mentioned earlier, when the slotsync worker builds a consistent
snapshot for the first time, it does not advance restart_lsn to that LSN nor
serialize the snapshot. If the remote slot on the primary remains unchanged, the
slotsync worker will repeatedly report a WARNING, since it always perceives
itself as building a consistent point for the first time at that same LSN.

The solution I thought is to try to advance the restart_lsn on reaching
consistency, the detailed analysis is as follows:

First is how a consistent snapshot is built first time,

a) a consistent snapshot is built incrementatlly by waiting for running
transaction to finish.
b) a consistent snapshot is built by restoring a serialized snapshot.
c) a consistent snapshot is built because xl_running_xacts record showed no
running transactions.

Currently, we do not advance the restart_lsn in all above case, which cause the
restart_lsn fall behind, causing the slotsync worker to repeatedly reporting
LOG. In my reproducation, it reached case c) but case a) and b) can cause the
same issue.

To improve it, I am thinking to advance the restart_lsn in all above cases when
approproate:

For case a), we cannot unconditionaly advance restart_lsn to the LSN of last
xl_running_xact because without collecting all previous transactions info from
the old restart_lsn we could not build a snapshot immediately at this LSN again
after restarting. The solution I thought is to serialize the snapshot in this
case and then advance the restart_lsn, in which case we can avoid collecting all
previous.

For case b) we can direcltly advance the restart_lsn since with a serialized
snapshot we can built reach consistency after restarting as well.

For case c), similar to case b) it's OK to advance the restart_lsn.

In both a) b) c) case, there is one general case where we cannot advance, that
is, when there are transactions decoded but not yet committed before reaching
the consistent point. These transactions data may still be replicated so we
cannot simply advance beyond them.

To implement above, we still need the return value of SnapBuildFindSnapshot() to
distinguish between case a) and b) c). The function returns true for case a
return false for case b) and c). The comment stop SnapBuildFindSnapshot can be
updated because I think the return value is not directly related to whether to
maintain restart_lsn and clean the txn data or not.

See the attachment that implements above.

2) Another reason is when using pg_logical_slot_peek_changes() (including the
binary version) on the primary. This function allows reading WAL beyond the
current confirmed_flush_lsn without actually advancing confirm pos. If the starting
position of an xl_running_xacts record happens to be exactly at the
confirmed_flush_lsn, the function can advance restart_lsn to that point.

However, during slot synchronization, the standby cannot read WAL beyond
confirmed_flush_lsn. This means it cannot access the final xl_running_xacts
record needed to advance restart_lsn to the latest position, causing the
advancement to fail.

In Fuji-San's example:

0/03000140 is the start position of the last xl_running_xacts.

One way to improve is to change the advancement function to read the last record
as well:

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 603a2b94d05..309feaf2219 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -2134,7 +2134,7 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
 		InvalidateSystemCaches();

 		/* Decode records until we reach the requested target */
-		while (ctx->reader->EndRecPtr < moveto)
+		while (ctx->reader->EndRecPtr <= moveto)

Note that I am not insisting on the approach of changing the advancement, I am
trying to analyze the root reason and explore some alternatives for reference.

Best Regards,
Hou zj

Fujii Masao

masao.fujii@gmail.com

26 days ago

In reply to: Amit Kapila (#2)

Re: Fix slotsync worker busy loop causing repeated log messages

On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com> wrote:

Normally, the slotsync worker updates the standby slot using the primary's slot
state. However, when confirmed_flush_lsn matches but restart_lsn does not,
the worker does not actually update the standby slot. Despite that, the current
code of update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the minimum
interval (200 ms) before retrying. In the next cycle, it again assumes
an update happened, and continues looping with the short sleep interval,
causing the repeated logical decoding log messages. Based on a quick analysis,
this seems to be the root cause.

I think update_local_synced_slot() should return false (i.e., no update
happened) when confirmed_flush_lsn is equal but restart_lsn differs between
primary and standby.

We expect that in such a case update_local_synced_slot() should
advance local_slot's 'restart_lsn' via
LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
cheap code path next time. Normally, restart_lsn advancement should
happen when we process XLOG_RUNNING_XACTS and call
SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have not
debugged the exact case yet but you can try by emitting some more
records on publisher, it should let the standby advance the slot. It
is possible that we can do something like you are proposing to silence
the LOG messages but we should know what is going on here.

I agree it's important to investigate why restart_lsn doesn't advance
in that case and fix that issue.

Separately, I think mistreating the slot as updated is another problem:
the local slot is treated as updated even when it isn't, which causes
the slotsync worker to use the minimum sleep interval. We should
address this independently.

LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local slot,
but update_local_synced_slot() currently assumes it does. That seems to be
the root cause. The attached 0001 patch makes
LogicalSlotAdvanceAndCheckSnapState() return whether the slot was actually
updated, and updates update_local_synced_slot() to act based on that result.

I've also attached 0002, which reduces the log level of some logical decoding
messages (e.g., "logical decoding found consistent point") from LOG to
DEBUG1. These are low-level, developer-oriented messages and can be noisy
when they occur regularly. This is the same issue I reported earlier at [1]/messages/by-id/CAHGQGwED5kJxZ_MdCh=WLa5M7ekXdnzCO1ZCQhQNCEdfaEPwFQ@mail.gmail.com.

How about applying these patches while we continue working on
the restart_lsn issue?

Regards,

[1]: /messages/by-id/CAHGQGwED5kJxZ_MdCh=WLa5M7ekXdnzCO1ZCQhQNCEdfaEPwFQ@mail.gmail.com

--
Fujii Masao

Amit Kapila

amit.kapila16@gmail.com

24 days ago

In reply to: Fujii Masao (#8)

Re: Fix slotsync worker busy loop causing repeated log messages

On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com> wrote:

Normally, the slotsync worker updates the standby slot using the primary's slot
state. However, when confirmed_flush_lsn matches but restart_lsn does not,
the worker does not actually update the standby slot. Despite that, the current
code of update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the minimum
interval (200 ms) before retrying. In the next cycle, it again assumes
an update happened, and continues looping with the short sleep interval,
causing the repeated logical decoding log messages. Based on a quick analysis,
this seems to be the root cause.

I think update_local_synced_slot() should return false (i.e., no update
happened) when confirmed_flush_lsn is equal but restart_lsn differs between
primary and standby.

We expect that in such a case update_local_synced_slot() should
advance local_slot's 'restart_lsn' via
LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
cheap code path next time. Normally, restart_lsn advancement should
happen when we process XLOG_RUNNING_XACTS and call
SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have not
debugged the exact case yet but you can try by emitting some more
records on publisher, it should let the standby advance the slot. It
is possible that we can do something like you are proposing to silence
the LOG messages but we should know what is going on here.

I agree it's important to investigate why restart_lsn doesn't advance
in that case and fix that issue.

Separately, I think mistreating the slot as updated is another problem:
the local slot is treated as updated even when it isn't, which causes
the slotsync worker to use the minimum sleep interval. We should
address this independently.

LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local slot,
but update_local_synced_slot() currently assumes it does. That seems to be
the root cause. The attached 0001 patch makes
LogicalSlotAdvanceAndCheckSnapState() return whether the slot was actually
updated, and updates update_local_synced_slot() to act based on that result.

I've also attached 0002, which reduces the log level of some logical decoding
messages (e.g., "logical decoding found consistent point") from LOG to
DEBUG1. These are low-level, developer-oriented messages and can be noisy
when they occur regularly. This is the same issue I reported earlier at [1].

How about applying these patches while we continue working on
the restart_lsn issue?

*
@@ -2174,7 +2193,10 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,

  if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
  {
- LogicalConfirmReceivedLocation(moveto);
+ bool slot_updated = LogicalConfirmReceivedLocation(moveto);
+
+ if (updated_xmin_or_lsn)
+ *updated_xmin_or_lsn = slot_updated;

BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot
while processing running_xact record, so not sure we can rely only on
the explicit call LogicalConfirmReceivedLocation() above to ascertain
the same.

Sorry, I couldn't get the chance to look at the patches proposed by
Hou-san to fix this issue but I'll look at it after the feature
freeze.

* For 0002, I think the current log_level is used since logical
decoding is introduced, so if we want to change those, we should start
a separate thread. BTW, as a developer, I find those LOG level
messages quite useful while debugging issues, so I am in favor of
keeping those as it is. However, if the consensus is otherwise, we can
change them as well.

--
With Regards,
Amit Kapila.

#10

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

5 days ago

In reply to: Amit Kapila (#9)

RE: Fix slotsync worker busy loop causing repeated log messages

On Sunday, March 22, 2026 1:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <masao.fujii@gmail.com>
wrote:

On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com>

wrote:

Normally, the slotsync worker updates the standby slot using the
primary's slot state. However, when confirmed_flush_lsn matches
but restart_lsn does not, the worker does not actually update the
standby slot. Despite that, the current code of
update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the
minimum interval (200 ms) before retrying. In the next cycle, it
again assumes an update happened, and continues looping with the
short sleep interval, causing the repeated logical decoding log messages.

Based on a quick analysis, this seems to be the root cause.

I think update_local_synced_slot() should return false (i.e., no
update
happened) when confirmed_flush_lsn is equal but restart_lsn
differs between primary and standby.

We expect that in such a case update_local_synced_slot() should
advance local_slot's 'restart_lsn' via
LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
cheap code path next time. Normally, restart_lsn advancement should
happen when we process XLOG_RUNNING_XACTS and call
SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have
not debugged the exact case yet but you can try by emitting some
more records on publisher, it should let the standby advance the
slot. It is possible that we can do something like you are proposing
to silence the LOG messages but we should know what is going on here.

I agree it's important to investigate why restart_lsn doesn't advance
in that case and fix that issue.

Separately, I think mistreating the slot as updated is another problem:
the local slot is treated as updated even when it isn't, which causes
the slotsync worker to use the minimum sleep interval. We should
address this independently.

LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
slot, but update_local_synced_slot() currently assumes it does. That
seems to be the root cause. The attached 0001 patch makes
LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
actually updated, and updates update_local_synced_slot() to act based on

that result.

I've also attached 0002, which reduces the log level of some logical
decoding messages (e.g., "logical decoding found consistent point")
from LOG to DEBUG1. These are low-level, developer-oriented messages
and can be noisy when they occur regularly. This is the same issue I

reported earlier at [1].

How about applying these patches while we continue working on the
restart_lsn issue?

*
@@ -2174,7 +2193,10 @@
LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
{
- LogicalConfirmReceivedLocation(moveto);
+ bool slot_updated = LogicalConfirmReceivedLocation(moveto);
+
+ if (updated_xmin_or_lsn)
+ *updated_xmin_or_lsn = slot_updated;
BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
processing running_xact record, so not sure we can rely only on the explicit
call LogicalConfirmReceivedLocation() above to ascertain the same.

Perhaps we could simply compare the slot's old and new LSN/xmin to determine
whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook
into each individual update path at the lower level. Attaching a patch for reference.

Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to
fix this issue but I'll look at it after the feature freeze.

The other patches attempt to advance restart_lsn in more cases by modifying the
logical decoding logic. However, those scenarios are relatively rare and may not
produce significant improvement outside the slot synchronization context. So, I
think modifying the slot sync worker to correctly increase the polling interval instead
is also a good starting point.

Best Regards,
Hou zj

#11

Amit Kapila

amit.kapila16@gmail.com

5 days ago

In reply to: Zhijie Hou (Fujitsu) (#10)

Re: Fix slotsync worker busy loop causing repeated log messages

On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Sunday, March 22, 2026 1:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
*
@@ -2174,7 +2193,10 @@
LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
{
- LogicalConfirmReceivedLocation(moveto);
+ bool slot_updated = LogicalConfirmReceivedLocation(moveto);
+
+ if (updated_xmin_or_lsn)
+ *updated_xmin_or_lsn = slot_updated;
BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
processing running_xact record, so not sure we can rely only on the explicit
call LogicalConfirmReceivedLocation() above to ascertain the same.
Perhaps we could simply compare the slot's old and new LSN/xmin to determine
whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook
into each individual update path at the lower level. Attaching a patch for reference.

Yes, this sounds like a good idea, I will review it further.

Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to
fix this issue but I'll look at it after the feature freeze.

The other patches attempt to advance restart_lsn in more cases by modifying the
logical decoding logic. However, those scenarios are relatively rare and may not
produce significant improvement outside the slot synchronization context. So, I
think modifying the slot sync worker to correctly increase the polling interval instead
is also a good starting point.

Even in slot synchronization, this is a very rare case where the slot
on primary is not continuously being advanced as normally we do via
walsender, otherwise, we wouldn't have seen LOGs at this frequency.

--
With Regards,
Amit Kapila.

#12

shveta malik

shveta.malik@gmail.com

5 days ago

In reply to: Zhijie Hou (Fujitsu) (#10)

Re: Fix slotsync worker busy loop causing repeated log messages

On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Sunday, March 22, 2026 1:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <masao.fujii@gmail.com>
wrote:

On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com>

wrote:

Normally, the slotsync worker updates the standby slot using the
primary's slot state. However, when confirmed_flush_lsn matches
but restart_lsn does not, the worker does not actually update the
standby slot. Despite that, the current code of
update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the
minimum interval (200 ms) before retrying. In the next cycle, it
again assumes an update happened, and continues looping with the
short sleep interval, causing the repeated logical decoding log messages.

Based on a quick analysis, this seems to be the root cause.

I think update_local_synced_slot() should return false (i.e., no
update
happened) when confirmed_flush_lsn is equal but restart_lsn
differs between primary and standby.

We expect that in such a case update_local_synced_slot() should
advance local_slot's 'restart_lsn' via
LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
cheap code path next time. Normally, restart_lsn advancement should
happen when we process XLOG_RUNNING_XACTS and call
SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have
not debugged the exact case yet but you can try by emitting some
more records on publisher, it should let the standby advance the
slot. It is possible that we can do something like you are proposing
to silence the LOG messages but we should know what is going on here.

I agree it's important to investigate why restart_lsn doesn't advance
in that case and fix that issue.

Separately, I think mistreating the slot as updated is another problem:
the local slot is treated as updated even when it isn't, which causes
the slotsync worker to use the minimum sleep interval. We should
address this independently.

LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
slot, but update_local_synced_slot() currently assumes it does. That
seems to be the root cause. The attached 0001 patch makes
LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
actually updated, and updates update_local_synced_slot() to act based on

that result.

I've also attached 0002, which reduces the log level of some logical
decoding messages (e.g., "logical decoding found consistent point")
from LOG to DEBUG1. These are low-level, developer-oriented messages
and can be noisy when they occur regularly. This is the same issue I

reported earlier at [1].

How about applying these patches while we continue working on the
restart_lsn issue?

*
@@ -2174,7 +2193,10 @@
LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
{
- LogicalConfirmReceivedLocation(moveto);
+ bool slot_updated = LogicalConfirmReceivedLocation(moveto);
+
+ if (updated_xmin_or_lsn)
+ *updated_xmin_or_lsn = slot_updated;
BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
processing running_xact record, so not sure we can rely only on the explicit
call LogicalConfirmReceivedLocation() above to ascertain the same.
Perhaps we could simply compare the slot's old and new LSN/xmin to determine
whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook
into each individual update path at the lower level. Attaching a patch for reference.

Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to
fix this issue but I'll look at it after the feature freeze.

The other patches attempt to advance restart_lsn in more cases by modifying the
logical decoding logic. However, those scenarios are relatively rare and may not
produce significant improvement outside the slot synchronization context. So, I
think modifying the slot sync worker to correctly increase the polling interval instead
is also a good starting point.

After reviewing the entire thread and patches, the above solution
appears to be the simplest and least intrusive, especially since the
issue is limited to the sltosync worker. So +1 for this idea.

thanks
Shveta

#13

Amit Kapila

amit.kapila16@gmail.com

5 days ago

In reply to: Zhijie Hou (Fujitsu) (#10)

Re: Fix slotsync worker busy loop causing repeated log messages

On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Sunday, March 22, 2026 1:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <masao.fujii@gmail.com>
wrote:

On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao.fujii@gmail.com>

wrote:

Normally, the slotsync worker updates the standby slot using the
primary's slot state. However, when confirmed_flush_lsn matches
but restart_lsn does not, the worker does not actually update the
standby slot. Despite that, the current code of
update_local_synced_slot() appears to treat this situation as if
an update occurred. As a result, the worker sleeps only for the
minimum interval (200 ms) before retrying. In the next cycle, it
again assumes an update happened, and continues looping with the
short sleep interval, causing the repeated logical decoding log messages.

Based on a quick analysis, this seems to be the root cause.

I think update_local_synced_slot() should return false (i.e., no
update
happened) when confirmed_flush_lsn is equal but restart_lsn
differs between primary and standby.

We expect that in such a case update_local_synced_slot() should
advance local_slot's 'restart_lsn' via
LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
cheap code path next time. Normally, restart_lsn advancement should
happen when we process XLOG_RUNNING_XACTS and call
SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have
not debugged the exact case yet but you can try by emitting some
more records on publisher, it should let the standby advance the
slot. It is possible that we can do something like you are proposing
to silence the LOG messages but we should know what is going on here.

I agree it's important to investigate why restart_lsn doesn't advance
in that case and fix that issue.

Separately, I think mistreating the slot as updated is another problem:
the local slot is treated as updated even when it isn't, which causes
the slotsync worker to use the minimum sleep interval. We should
address this independently.

LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
slot, but update_local_synced_slot() currently assumes it does. That
seems to be the root cause. The attached 0001 patch makes
LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
actually updated, and updates update_local_synced_slot() to act based on

that result.

I've also attached 0002, which reduces the log level of some logical
decoding messages (e.g., "logical decoding found consistent point")
from LOG to DEBUG1. These are low-level, developer-oriented messages
and can be noisy when they occur regularly. This is the same issue I

reported earlier at [1].

How about applying these patches while we continue working on the
restart_lsn issue?

*
@@ -2174,7 +2193,10 @@
LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
{
- LogicalConfirmReceivedLocation(moveto);
+ bool slot_updated = LogicalConfirmReceivedLocation(moveto);
+
+ if (updated_xmin_or_lsn)
+ *updated_xmin_or_lsn = slot_updated;
BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
processing running_xact record, so not sure we can rely only on the explicit
call LogicalConfirmReceivedLocation() above to ascertain the same.
Perhaps we could simply compare the slot's old and new LSN/xmin to determine
whether updated_xmin_or_lsn needs to be set.

The patch works for me. Can we change the comment to as follows: "It
is possible that the slot's xmin or LSNs are not updated, when the
synced slot has reached consistent snapshot state or cannot build the
one at all."?

--
With Regards,
Amit Kapila.

#14

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

4 days ago

In reply to: Amit Kapila (#13)

RE: Fix slotsync worker busy loop causing repeated log messages

On Friday, April 10, 2026 5:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

Perhaps we could simply compare the slot's old and new LSN/xmin to
determine whether updated_xmin_or_lsn needs to be set.

The patch works for me. Can we change the comment to as follows: "It is
possible that the slot's xmin or LSNs are not updated, when the synced slot
has reached consistent snapshot state or cannot build the one at all."?

Changed as suggested.

Here are the patches for both HEAD and back branches.

Best Regards,
Hou zj

Fix slotsync worker busy loop causing repeated log messages

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: