Add progressive backoff to XactLockTableWait functions
Hi hackers,
This patch implements progressive backoff in XactLockTableWait() and
ConditionalXactLockTableWait().
As Kevin reported in this thread [1]/messages/by-id/CAM45KeELdjhS-rGuvN=ZLJ_asvZACucZ9LZWVzH7bGcD12DDwg@mail.gmail.com, XactLockTableWait() can enter a
tight polling loop during logical replication slot creation on standby
servers, sleeping for fixed 1ms intervals that can continue for a long
time. This creates significant CPU overhead.
The patch implements a time-based threshold approach based on Fujii’s
idea [1]/messages/by-id/CAM45KeELdjhS-rGuvN=ZLJ_asvZACucZ9LZWVzH7bGcD12DDwg@mail.gmail.com: keep sleeping for 1ms until the total sleep time reaches 10
seconds, then start exponential backoff (doubling the sleep duration
each cycle) up to a maximum of 10 seconds per sleep. This balances
responsiveness for normal operations (which typically complete within
seconds) against CPU efficiency for the long waits in some logical
replication scenarios.
[1]: /messages/by-id/CAM45KeELdjhS-rGuvN=ZLJ_asvZACucZ9LZWVzH7bGcD12DDwg@mail.gmail.com
Best regards,
Xuneng
Attachments:
0001-Add-progressive-backoff-to-XactLockTableWait.patchapplication/octet-stream; name=0001-Add-progressive-backoff-to-XactLockTableWait.patchDownload
From cd90ff97b12d3c2e74da6cfa4b0b8939c6f6dbb6 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Sun, 8 Jun 2025 20:28:17 +0800
Subject: [PATCH] Add progressive backoff to XactLockTableWait functions
XactLockTableWait() and ConditionalXactLockTableWait() currently use
a fixed 1ms sleep when waiting for transaction completion. In logical
replication scenarios, particularly during CREATE REPLICATION SLOT,
these functions may wait for very long periods (minutes to hours) for
old transactions to complete, leading to excessive CPU usage due to
frequent polling.
This patch implements progressive backoff: keep sleeping for 1ms until
total sleep time reaches 10 seconds, then start doubling the sleep duration
each cycle, up to a maximum of 10 seconds per sleep. This balances
responsiveness for normal operations (which typically complete within seconds)
against CPU efficiency for long waits common in logical replication scenarios.
---
src/backend/storage/lmgr/lmgr.c | 48 ++++++++++++++++++++++++++++++---
1 file changed, 44 insertions(+), 4 deletions(-)
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 3f6bf70bd3c..495fa607932 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -667,6 +667,13 @@ XactLockTableWait(TransactionId xid, Relation rel, ItemPointer ctid,
XactLockTableWaitInfo info;
ErrorContextCallback callback;
bool first = true;
+ long total_sleep_us = 0;
+ long sleep_us = 1000; /* Start with 1ms */
+ bool do_backoff = false;
+
+ /* Progressive backoff threshold */
+ const long backoff_threshold_us = 10 * USECS_PER_SEC; /* 10 seconds */
+ const long max_sleep_us = 10 * USECS_PER_SEC; /* 10 seconds */
/*
* If an operation is specified, set up our verbose error context
@@ -713,13 +720,25 @@ XactLockTableWait(TransactionId xid, Relation rel, ItemPointer ctid,
* as when building snapshots for logical decoding. It is possible to
* see a transaction in ProcArray before it registers itself in the
* locktable. The topmost transaction in that case is the same xid,
- * so we try again after a short sleep. (Don't sleep the first time
- * through, to avoid slowing down the normal case.)
+ * so we try again after a progressive sleep. (Don't sleep the first
+ * time through, to avoid slowing down the normal case.)
*/
if (!first)
{
CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+ pg_usleep(sleep_us);
+
+ /* Track total only until we start doing backoff */
+ if (!do_backoff)
+ {
+ total_sleep_us += sleep_us;
+ if (total_sleep_us >= backoff_threshold_us)
+ do_backoff = true;
+ }
+
+ /* Exponential backoff once threshold is reached */
+ if (do_backoff && sleep_us < max_sleep_us)
+ sleep_us = Min(sleep_us * 2, max_sleep_us);
}
first = false;
xid = SubTransGetTopmostTransaction(xid);
@@ -734,12 +753,21 @@ XactLockTableWait(TransactionId xid, Relation rel, ItemPointer ctid,
*
* As above, but only lock if we can get the lock without blocking.
* Returns true if the lock was acquired.
+ *
+ * Uses the same progressive backoff as XactLockTableWait.
*/
bool
ConditionalXactLockTableWait(TransactionId xid, bool logLockFailure)
{
LOCKTAG tag;
bool first = true;
+ long total_sleep_us = 0;
+ long sleep_us = 1000; /* Start with 1ms */
+ bool do_backoff = false;
+
+ /* Progressive backoff threshold */
+ const long backoff_threshold_us = 10 * USECS_PER_SEC; /* 10 seconds */
+ const long max_sleep_us = 10 * USECS_PER_SEC; /* 10 seconds */
for (;;)
{
@@ -762,7 +790,19 @@ ConditionalXactLockTableWait(TransactionId xid, bool logLockFailure)
if (!first)
{
CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+ pg_usleep(sleep_us);
+
+ /* Track total only until we start doing backoff */
+ if (!do_backoff)
+ {
+ total_sleep_us += sleep_us;
+ if (total_sleep_us >= backoff_threshold_us)
+ do_backoff = true;
+ }
+
+ /* Exponential backoff once threshold is reached */
+ if (do_backoff && sleep_us < max_sleep_us)
+ sleep_us = Min(sleep_us * 2, max_sleep_us);
}
first = false;
xid = SubTransGetTopmostTransaction(xid);
--
2.48.1
On 2025/06/08 23:33, Xuneng Zhou wrote:
Hi hackers,
This patch implements progressive backoff in XactLockTableWait() and
ConditionalXactLockTableWait().As Kevin reported in this thread [1], XactLockTableWait() can enter a
tight polling loop during logical replication slot creation on standby
servers, sleeping for fixed 1ms intervals that can continue for a long
time. This creates significant CPU overhead.The patch implements a time-based threshold approach based on Fujii’s
idea [1]: keep sleeping for 1ms until the total sleep time reaches 10
seconds, then start exponential backoff (doubling the sleep duration
each cycle) up to a maximum of 10 seconds per sleep. This balances
responsiveness for normal operations (which typically complete within
seconds) against CPU efficiency for the long waits in some logical
replication scenarios.
Thanks for the patch!
When I first suggested this idea, I used 10s as an example for
the maximum sleep time. But thinking more about it now, 10s might
be too long. Even if the target transaction has already finished,
XactLockTableWait() could still wait up to 10 seconds,
which seems excessive.
What about using 1s instead? That value is already used as a max
sleep time in other places, like WaitExceedsMaxStandbyDelay().
If we agree on 1s as the max, then using exponential backoff from
1ms to 1s after the threshold might not be necessary. It might
be simpler and sufficient to just sleep for 1s once we hit
the threshold.
Based on that, I think a change like the following could work well.
Thought?
----------------------------------------
XactLockTableWaitInfo info;
ErrorContextCallback callback;
bool first = true;
+ int left_till_hibernate = 5000;
<snip>
if (!first)
{
CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ if (left_till_hibernate > 0)
+ {
+ pg_usleep(1000L);
+ left_till_hibernate--;
+ }
+ else
+ pg_usleep(1000000L);
----------------------------------------
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
Hi,
Thanks for the feedback!
On Thu, Jun 12, 2025 at 10:02 PM Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:
When I first suggested this idea, I used 10s as an example for
the maximum sleep time. But thinking more about it now, 10s might
be too long. Even if the target transaction has already finished,
XactLockTableWait() could still wait up to 10 seconds,
which seems excessive.
+1, this could be a problem
What about using 1s instead? That value is already used as a max
sleep time in other places, like WaitExceedsMaxStandbyDelay().
1s should be generally good
If we agree on 1s as the max, then using exponential backoff from
1ms to 1s after the threshold might not be necessary. It might
be simpler and sufficient to just sleep for 1s once we hit
the threshold.
That makes sense to me.
Based on that, I think a change like the following could work well.
Thought?
I'll update the patch accordingly.
Best regards,
Xuneng
Hi,
Attached is v2 of the patch to add threshold-based sleep to
XactLockTableWait functions.
Changes from v1:
- Simplified approach based on Fujii's feedback [1]/messages/by-id/7c72c5d1-4d2f-46f7-8b68-dd96905f8c42@oss.nttdata.com: instead of exponential
backoff,
we now sleep 1ms for the first 5 seconds, then switch directly to 1s
sleeps
- Reduced the threshold from 10 seconds to 5 seconds to avoid excessive
delays
[1]: /messages/by-id/7c72c5d1-4d2f-46f7-8b68-dd96905f8c42@oss.nttdata.com
/messages/by-id/7c72c5d1-4d2f-46f7-8b68-dd96905f8c42@oss.nttdata.com
Best regards,
Xuneng
Attachments:
v2-0001-Add-threshold-based-sleep-to-XactLockTableWait-functions.patchapplication/octet-stream; name=v2-0001-Add-threshold-based-sleep-to-XactLockTableWait-functions.patchDownload
From 3dfcb99c8208d6c121d9e32231d1b111a498cbd2 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Sun, 15 Jun 2025 15:47:30 +0800
Subject: [PATCH v2] Add threshold-based sleep to XactLockTableWait functions
XactLockTableWait() and ConditionalXactLockTableWait() currently use
a fixed 1ms sleep when waiting for transaction completion. In logical
replication scenarios, particularly during CREATE REPLICATION SLOT,
these functions may wait for very long periods (minutes to hours) for
old transactions to complete, leading to excessive CPU usage due to
frequent polling.
This patch implements a threshold-based approach: sleep for 1ms for
the first 5 seconds (5000 iterations), then switch to 1s sleeps for
the remainder of the wait. This balances responsiveness for normal
operations (which typically complete within seconds) against CPU
efficiency for long waits common in logical replication scenarios.
---
src/backend/storage/lmgr/lmgr.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 3f6bf70bd3c..c81b2fbe849 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -667,6 +667,7 @@ XactLockTableWait(TransactionId xid, Relation rel, ItemPointer ctid,
XactLockTableWaitInfo info;
ErrorContextCallback callback;
bool first = true;
+ int left_till_hibernate = 5000;
/*
* If an operation is specified, set up our verbose error context
@@ -713,13 +714,22 @@ XactLockTableWait(TransactionId xid, Relation rel, ItemPointer ctid,
* as when building snapshots for logical decoding. It is possible to
* see a transaction in ProcArray before it registers itself in the
* locktable. The topmost transaction in that case is the same xid,
- * so we try again after a short sleep. (Don't sleep the first time
- * through, to avoid slowing down the normal case.)
+ * so we try again after a sleep. We sleep 1ms for the first 5 seconds
+ * to keep normal operations responsive, then 1s to reduce CPU overhead
+ * during long waits. (Don't sleep the first time through, to avoid
+ * slowing down the normal case.)
*/
if (!first)
{
CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ if (left_till_hibernate > 0)
+ {
+ pg_usleep(1000L);
+ left_till_hibernate--;
+ }
+ else
+ pg_usleep(1000000L); /* 1s */
}
first = false;
xid = SubTransGetTopmostTransaction(xid);
@@ -740,6 +750,7 @@ ConditionalXactLockTableWait(TransactionId xid, bool logLockFailure)
{
LOCKTAG tag;
bool first = true;
+ int left_till_hibernate = 5000;
for (;;)
{
@@ -762,7 +773,14 @@ ConditionalXactLockTableWait(TransactionId xid, bool logLockFailure)
if (!first)
{
CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ if (left_till_hibernate > 0)
+ {
+ pg_usleep(1000L);
+ left_till_hibernate--;
+ }
+ else
+ pg_usleep(1000000L);
}
first = false;
xid = SubTransGetTopmostTransaction(xid);
--
2.48.1
Hi,
Although it’s clear that replacing tight 1 ms polling loops will reduce CPU
usage, I'm curious about the hard numbers. To that end, I ran a 60 s
logical-replication slot–creation workload on a standby using three
different XactLockTableWait() variants—on an 8-core, 16 GB AMD system—and
collected both profiling traces and hardware-counter metrics.
1. Hardware‐counter results
[image: image.png]
- CPU cycles drop by 58% moving from 1 ms to exp. backoff, and another
25% to the 1 s threshold variant.
- Cache‐misses and context‐switches see similarly large reductions.
- IPC remains around 0.45, dipping slightly under longer sleeps.
2. Flame‐graph
See attached files
Best regards,
Xuneng
Attachments:
image.pngimage/png; name=image.pngDownload
�PNG
IHDR �l� sRGB ��� �eXIfMM * > F( �i N � � �� �� � � ASCII Screenshots�� pHYs % %IR$� �iTXtXML:com.adobe.xmp <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 6.0.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:exif="http://ns.adobe.com/exif/1.0/">
<exif:PixelYDimension>512</exif:PixelYDimension>
<exif:PixelXDimension>2060</exif:PixelXDimension>
<exif:UserComment>Screenshot</exif:UserComment>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
�0Y� iDOT ( �j�&