Optimize shared LWLock acquisition for high-core-count systems

Started by Zhou, Zhiguo8 months ago7 messages

zhiguo.zhou@intel.com

8 months ago

1 attachment(s)

Hi Hackers,

I am reaching out to solicit your insights and comments on this patch
addressing a significant performance bottleneck in LWLock acquisition
observed on high-core-count systems. During performance analysis of
HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
system, we found that LWLockAttemptLock consumed 7.12% of total CPU
cycles. This bottleneck becomes even more pronounced (up to 30% of
cycles) after applying lock-free WAL optimizations[1]Lock-free XLog Reservation from WAL: /messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com[2]Increase NUM_XLOGINSERT_LOCKS: /messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru.

Problem Analysis:
The current LWLock implementation uses separate atomic operations for
state checking and modification. For shared locks (84% of
LWLockAttemptLock calls), this requires:
1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)

This design causes excessive atomic operations on contended locks, which
are particularly expensive on high-core-count systems where cache-line
bouncing amplifies synchronization costs.

Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage

Key implementation details:
- For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
reference count
- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
1) Exclusive locks
2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)

Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
- *46%* cumulative improvement when combined with lock-free WAL
optimizations[1]Lock-free XLog Reservation from WAL: /messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com[2]Increase NUM_XLOGINSERT_LOCKS: /messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks

The optimization is particularly effective for contended shared locks,
which are common in buffer mapping, lock manager, and shared buffer
access patterns.

Please review this patch for consideration in upcoming PostgreSQL releases.

[1]: Lock-free XLog Reservation from WAL: /messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com
/messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com
[2]: Increase NUM_XLOGINSERT_LOCKS: /messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru
/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

Regards,
Zhiguo

Attachments:

0001-Optimize-shared-LWLock-acquisition-for-high-core-cou.patchtext/plain; charset=UTF-8; name=0001-Optimize-shared-LWLock-acquisition-for-high-core-cou.patchDownload

From 7fa7181655627d7ed5eafb33341621d1965c5312 Mon Sep 17 00:00:00 2001
From: Zhiguo Zhou <zhiguo.zhou@intel.com>
Date: Thu, 29 May 2025 16:55:42 +0800
Subject: [PATCH] Optimize shared LWLock acquisition for high-core-count
 systems

This patch introduces optimizations to reduce lock acquisition overhead in
LWLock by merging the read and update operations for the LW_SHARED lock's
state. This eliminates the need for separate atomic instructions, which is
critical for improving performance on high-core-count systems.

Key changes:
- Extended LW_SHARED_MASK by 1 bit and shifted LW_VAL_EXCLUSIVE by 1 bit to
  ensure compatibility with the upper bound of MAX_BACKENDS * 2.
- Added a `willwait` parameter to `LWLockAttemptLock` to disable the
  optimization when the caller is unwilling to wait, avoiding conflicts
  between the reference count and the LW_VAL_EXCLUSIVE flag.
- Updated `LWLockReleaseInternal` to use `pg_atomic_fetch_and_u32` for
  clearing lock state flags atomically.
- Adjusted related functions (`LWLockAcquire`, `LWLockConditionalAcquire`,
  `LWLockAcquireOrWait`) to pass the `willwait` parameter appropriately.

These changes improve scalability and reduce contention in workloads with
frequent LWLock operations on servers with many cores.
---
 src/backend/storage/lmgr/lwlock.c | 73 ++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..4c29016ce35 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -97,20 +97,41 @@
 #define LW_FLAG_BITS				3
 #define LW_FLAG_MASK				(((1<<LW_FLAG_BITS)-1)<<(32-LW_FLAG_BITS))
 
-/* assumes MAX_BACKENDS is a (power of 2) - 1, checked below */
-#define LW_VAL_EXCLUSIVE			(MAX_BACKENDS + 1)
+/*
+ * already (power of 2)-1, i.e. suitable for a mask
+ *
+ * Originally, the LW_SHARED lock reference count was maintained in bits
+ * [MAX_BACKEND_BITS-1:0] of LWLock.state, with a theoretical maximum of
+ * MAX_BACKENDS (when all MAX_BACKENDS processes hold the lock concurrently).
+ *
+ * To reduce lock acquisition overhead, we optimized LWLockAttemptLock by
+ * merging the read and update operations for the LW_SHARED lock's state.
+ * This eliminates the need for separate atomic instructions - a critical
+ * improvement given the high cost of atomic operations on high-core-count
+ * systems.
+ *
+ * This optimization introduces a scenario where the reference count may
+ * temporarily increment even when a reader fails to acquire an exclusive lock.
+ * However, since each process retries lock acquisition up to *twice* before
+ * waiting on a semaphore, the reference count is bounded by MAX_BACKENDS * 2.
+ *
+ * To ensure compatibility with this upper bound:
+ * 1. LW_SHARED_MASK has been extended by 1 bit
+ * 2. LW_VAL_EXCLUSIVE is left-shifted by 1 bit
+ */
+#define LW_SHARED_MASK				((MAX_BACKENDS << 1) + 1)
+#define LW_VAL_EXCLUSIVE			(LW_SHARED_MASK + 1)
+#define LW_LOCK_MASK				(LW_SHARED_MASK	| LW_VAL_EXCLUSIVE)
 #define LW_VAL_SHARED				1
 
-/* already (power of 2)-1, i.e. suitable for a mask */
-#define LW_SHARED_MASK				MAX_BACKENDS
-#define LW_LOCK_MASK				(MAX_BACKENDS | LW_VAL_EXCLUSIVE)
+/* assumes MAX_BACKENDS is a (power of 2) - 1, checked below */
 
 
 StaticAssertDecl(((MAX_BACKENDS + 1) & MAX_BACKENDS) == 0,
 				 "MAX_BACKENDS + 1 needs to be a power of 2");
 
-StaticAssertDecl((MAX_BACKENDS & LW_FLAG_MASK) == 0,
-				 "MAX_BACKENDS and LW_FLAG_MASK overlap");
+StaticAssertDecl((LW_SHARED_MASK & LW_FLAG_MASK) == 0,
+				 "LW_SHARED_MASK and LW_FLAG_MASK overlap");
 
 StaticAssertDecl((LW_VAL_EXCLUSIVE & LW_FLAG_MASK) == 0,
 				 "LW_VAL_EXCLUSIVE and LW_FLAG_MASK overlap");
@@ -277,6 +298,8 @@ PRINT_LWDEBUG(const char *where, LWLock *lock, LWLockMode mode)
 	if (Trace_lwlocks)
 	{
 		uint32		state = pg_atomic_read_u32(&lock->state);
+		uint32		excl = (state & LW_VAL_EXCLUSIVE) != 0;
+		uint32		shared = excl ? 0 : state & LW_SHARED_MASK;
 
 		ereport(LOG,
 				(errhidestmt(true),
@@ -284,8 +307,8 @@ PRINT_LWDEBUG(const char *where, LWLock *lock, LWLockMode mode)
 				 errmsg_internal("%d: %s(%s %p): excl %u shared %u haswaiters %u waiters %u rOK %d",
 								 MyProcPid,
 								 where, T_NAME(lock), lock,
-								 (state & LW_VAL_EXCLUSIVE) != 0,
-								 state & LW_SHARED_MASK,
+								 excl,
+								 shared,
 								 (state & LW_FLAG_HAS_WAITERS) != 0,
 								 pg_atomic_read_u32(&lock->nwaiters),
 								 (state & LW_FLAG_RELEASE_OK) != 0)));
@@ -790,15 +813,30 @@ GetLWLockIdentifier(uint32 classId, uint16 eventId)
  * This function will not block waiting for a lock to become free - that's the
  * caller's job.
  *
+ * willwait: true if the caller is willing to wait for the lock to become free
+ *          false if the caller is not willing to wait.
+ *
  * Returns true if the lock isn't free and we need to wait.
  */
 static bool
-LWLockAttemptLock(LWLock *lock, LWLockMode mode)
+LWLockAttemptLock(LWLock *lock, LWLockMode mode, bool willwait)
 {
 	uint32		old_state;
 
 	Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
 
+	/*
+	 * To avoid conflicts between the reference count and the LW_VAL_EXCLUSIVE
+	 * flag, this optimization is disabled when willwait is false. See detailed
+	 * comments in this file where LW_SHARED_MASK is defined for more explaination.
+	 */
+	if (willwait && mode == LW_SHARED)
+	{
+		old_state = pg_atomic_fetch_add_u32(&lock->state, LW_VAL_SHARED);
+		Assert((old_state & LW_LOCK_MASK) != LW_LOCK_MASK);
+		return (old_state & LW_VAL_EXCLUSIVE) != 0;
+	}
+
 	/*
 	 * Read once outside the loop, later iterations will get the newer value
 	 * via compare & exchange.
@@ -1242,7 +1280,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		 * Try to grab the lock the first time, we're not in the waitqueue
 		 * yet/anymore.
 		 */
-		mustwait = LWLockAttemptLock(lock, mode);
+		mustwait = LWLockAttemptLock(lock, mode, true);
 
 		if (!mustwait)
 		{
@@ -1265,7 +1303,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		LWLockQueueSelf(lock, mode);
 
 		/* we're now guaranteed to be woken up if necessary */
-		mustwait = LWLockAttemptLock(lock, mode);
+		mustwait = LWLockAttemptLock(lock, mode, true);
 
 		/* ok, grabbed the lock the second time round, need to undo queueing */
 		if (!mustwait)
@@ -1368,7 +1406,7 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 	HOLD_INTERRUPTS();
 
 	/* Check for the lock */
-	mustwait = LWLockAttemptLock(lock, mode);
+	mustwait = LWLockAttemptLock(lock, mode, false);
 
 	if (mustwait)
 	{
@@ -1435,13 +1473,13 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 	 * NB: We're using nearly the same twice-in-a-row lock acquisition
 	 * protocol as LWLockAcquire(). Check its comments for details.
 	 */
-	mustwait = LWLockAttemptLock(lock, mode);
+	mustwait = LWLockAttemptLock(lock, mode, true);
 
 	if (mustwait)
 	{
 		LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE);
 
-		mustwait = LWLockAttemptLock(lock, mode);
+		mustwait = LWLockAttemptLock(lock, mode, true);
 
 		if (mustwait)
 		{
@@ -1843,7 +1881,10 @@ LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
 	 * others, even if we still have to wakeup other waiters.
 	 */
 	if (mode == LW_EXCLUSIVE)
-		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
+	{
+		oldstate = pg_atomic_fetch_and_u32(&lock->state, ~LW_LOCK_MASK);
+		oldstate &= ~LW_LOCK_MASK;
+	}
 	else
 		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
 
-- 
2.43.0

David Rowley

dgrowleyml@gmail.com

8 months ago

In reply to: Zhou, Zhiguo (#1)

Re: Optimize shared LWLock acquisition for high-core-count systems

On Fri, 30 May 2025 at 23:31, Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Please review this patch for consideration in upcoming PostgreSQL releases.

Please add the patch to https://commitfest.postgresql.org/53/

David

Zhou, Zhiguo

zhiguo.zhou@intel.com

8 months ago

In reply to: David Rowley (#2)

Re: Optimize shared LWLock acquisition for high-core-count systems

Thanks David! I've added the patch referred by this link:
https://commitfest.postgresql.org/patch/5784/

Regards,
Zhiguo

Show quoted text

On 5/30/2025 8:02 PM, David Rowley wrote:

On Fri, 30 May 2025 at 23:31, Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Please review this patch for consideration in upcoming PostgreSQL releases.

Please add the patch to https://commitfest.postgresql.org/53/

David

Yura Sokolov

y.sokolov@postgrespro.ru

6 months ago

In reply to: Zhou, Zhiguo (#1)

1 attachment(s)

Re: Optimize shared LWLock acquisition for high-core-count systems

30.05.2025 14:30, Zhou, Zhiguo пишет:

Hi Hackers,

I am reaching out to solicit your insights and comments on this patch
addressing a significant performance bottleneck in LWLock acquisition
observed on high-core-count systems. During performance analysis of
HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
system, we found that LWLockAttemptLock consumed 7.12% of total CPU
cycles. This bottleneck becomes even more pronounced (up to 30% of
cycles) after applying lock-free WAL optimizations[1][2].

Problem Analysis:
The current LWLock implementation uses separate atomic operations for
state checking and modification. For shared locks (84% of
LWLockAttemptLock calls), this requires:
1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)

This design causes excessive atomic operations on contended locks, which
are particularly expensive on high-core-count systems where cache-line
bouncing amplifies synchronization costs.

Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage

Key implementation details:
- For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
reference count
- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
1) Exclusive locks
2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)

Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
- *46%* cumulative improvement when combined with lock-free WAL
optimizations[1][2]

Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks

The optimization is particularly effective for contended shared locks,
which are common in buffer mapping, lock manager, and shared buffer
access patterns.

Please review this patch for consideration in upcoming PostgreSQL releases.

[1] Lock-free XLog Reservation from WAL:
/messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com
[2] Increase NUM_XLOGINSERT_LOCKS:
/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

Good day, Zhou.

Could you explain, why your patch is correct?

As far as I understand, it is clearly not correct at this time:
- SHARED lock count may be incremented many times, because of for(;;) loop
in LWLockAcquire and because LWLockAttemptLock is called twice per each
loop iteration in case lock is held in Exclusive mode by someone else.

If your patch is correct, then where I'm wrong?

When I tried to do same thing, I did sub_fetch immediately in case of
acquisition failure. And did no add_fetch at all if lock is held in
Exclusive mode.

BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
to do CAS if lock is held by someone else.

See my version in attach...

--
regards
Yura Sokolov aka funny-falcon

Attachments:

v1-0001-Optimise-LWLockAttemptLock-by-separate-SHARED-and.patchtext/x-patch; charset=UTF-8; name=v1-0001-Optimise-LWLockAttemptLock-by-separate-SHARED-and.patchDownload

From a87fba3b9a26d2fb5d56bcde709b3315631dd8f6 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Wed, 9 Jul 2025 10:49:38 +0300
Subject: [PATCH v1] Optimise LWLockAttemptLock by separate SHARED and
 EXCLUSIVE acquiring code.

Don't try to write anything if it doesn't look as lock_free.
LWLockAcquire and other functions calls AttemptLock at least twice for
correctness. "Do always CAS" doesn't pay.

Use atomic_fetch_add for SHARED lock acquiring.
---
 src/backend/storage/lmgr/lwlock.c | 67 ++++++++++++++++---------------
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..b591831961d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -796,6 +796,8 @@ static bool
 LWLockAttemptLock(LWLock *lock, LWLockMode mode)
 {
 	uint32		old_state;
+	uint32		desired_state;
+	bool		lock_free;
 
 	Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
 
@@ -805,51 +807,50 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode)
 	 */
 	old_state = pg_atomic_read_u32(&lock->state);
 
-	/* loop until we've determined whether we could acquire the lock or not */
-	while (true)
+	if (mode == LW_SHARED)
 	{
-		uint32		desired_state;
-		bool		lock_free;
-
-		desired_state = old_state;
-
-		if (mode == LW_EXCLUSIVE)
-		{
-			lock_free = (old_state & LW_LOCK_MASK) == 0;
-			if (lock_free)
-				desired_state += LW_VAL_EXCLUSIVE;
-		}
-		else
+		while (true)
 		{
 			lock_free = (old_state & LW_VAL_EXCLUSIVE) == 0;
 			if (lock_free)
-				desired_state += LW_VAL_SHARED;
-		}
+			{
+				old_state = pg_atomic_fetch_add_u32(&lock->state,
+													LW_VAL_SHARED);
+				if ((old_state & LW_VAL_EXCLUSIVE) == 0)
+					return false;
 
-		/*
-		 * Attempt to swap in the state we are expecting. If we didn't see
-		 * lock to be free, that's just the old value. If we saw it as free,
-		 * we'll attempt to mark it acquired. The reason that we always swap
-		 * in the value is that this doubles as a memory barrier. We could try
-		 * to be smarter and only swap in values if we saw the lock as free,
-		 * but benchmark haven't shown it as beneficial so far.
-		 *
-		 * Retry if the value changed since we last looked at it.
-		 */
-		if (pg_atomic_compare_exchange_u32(&lock->state,
-										   &old_state, desired_state))
+				old_state = pg_atomic_sub_fetch_u32(&lock->state,
+													LW_VAL_SHARED);
+			}
+			else
+			{
+				pg_read_barrier();
+				return true;
+			}
+		}
+	}
+
+	while (true)
+	{
+		lock_free = (old_state & LW_LOCK_MASK) == 0;
+		desired_state = old_state | LW_VAL_EXCLUSIVE;
+
+		if (lock_free)
 		{
-			if (lock_free)
+			if (pg_atomic_compare_exchange_u32(&lock->state, &old_state,
+											   desired_state))
 			{
 				/* Great! Got the lock. */
 #ifdef LOCK_DEBUG
-				if (mode == LW_EXCLUSIVE)
-					lock->owner = MyProc;
+				lock->owner = MyProc;
 #endif
 				return false;
 			}
-			else
-				return true;	/* somebody else has the lock */
+		}
+		else
+		{
+			pg_read_barrier();
+			return true;
 		}
 	}
 	pg_unreachable();
-- 
2.43.0

Zhou, Zhiguo

zhiguo.zhou@intel.com

6 months ago

In reply to: Yura Sokolov (#4)

Re: Optimize shared LWLock acquisition for high-core-count systems

On 7/9/2025 3:56 PM, Yura Sokolov wrote:

30.05.2025 14:30, Zhou, Zhiguo пишет:

Hi Hackers,

I am reaching out to solicit your insights and comments on this patch
addressing a significant performance bottleneck in LWLock acquisition
observed on high-core-count systems. During performance analysis of
HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
system, we found that LWLockAttemptLock consumed 7.12% of total CPU
cycles. This bottleneck becomes even more pronounced (up to 30% of
cycles) after applying lock-free WAL optimizations[1][2].

Problem Analysis:
The current LWLock implementation uses separate atomic operations for
state checking and modification. For shared locks (84% of
LWLockAttemptLock calls), this requires:
1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)

This design causes excessive atomic operations on contended locks, which
are particularly expensive on high-core-count systems where cache-line
bouncing amplifies synchronization costs.

Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage

Key implementation details:
- For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
reference count
- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
1) Exclusive locks
2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)

Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
- *46%* cumulative improvement when combined with lock-free WAL
optimizations[1][2]

Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks

The optimization is particularly effective for contended shared locks,
which are common in buffer mapping, lock manager, and shared buffer
access patterns.

Please review this patch for consideration in upcoming PostgreSQL releases.

[1] Lock-free XLog Reservation from WAL:
/messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com
[2] Increase NUM_XLOGINSERT_LOCKS:
/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

Good day, Zhou.

Could you explain, why your patch is correct?

As far as I understand, it is clearly not correct at this time:
- SHARED lock count may be incremented many times, because of for(;;) loop
in LWLockAcquire and because LWLockAttemptLock is called twice per each
loop iteration in case lock is held in Exclusive mode by someone else.

If your patch is correct, then where I'm wrong?

When I tried to do same thing, I did sub_fetch immediately in case of
acquisition failure. And did no add_fetch at all if lock is held in
Exclusive mode.

BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
to do CAS if lock is held by someone else.

See my version in attach...

Good day, Yura.

Thanks for your comments which identifies this critical safety condition
– you're absolutely right that we must guarantee the shared reference
count never overflows into the exclusive bit. Let me clarify the safety
mechanism:

When a reader encounters an exclusive lock during acquisition
(triggering the for(;;) loop), it does temporarily increment the shared
count twice – once per LWLockAttemptLock attempt. However, both
increments occur before the reader waits on its semaphore
(PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holder
releases the lock via LWLockReleaseInternal, it resets the entire lock
state (line 1883: pg_atomic_fetch_and_u32(&lock->state,
~LW_LOCK_MASK)). This clears all reader references, including any
"overcounted" increments from blocked readers.

Thus, when blocked readers wake:

1. They retry acquisition on a zero-initialized state
2. Each ultimately increments only once for successful acquisition
3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely within
LW_SHARED_MASK

The key invariants are:

- LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1
- Exclusive release resets all shared bits
- Readers never persist >1 reference after wakeup

Does this resolve the concern? I appreciate you flagging this subtlety –
please correct me if I've misunderstood your scenario or misinterpreted
the code.

And I'd appreciate you for sharing your implementation – I particularly
agree with your optimization for exclusive lock acquisition.
Avoiding the CAS loop when the lock is already held (by checking state
early) is a clever reduction of atomic operations, which we know are
costly on high-core-count systems. I’ll prioritize evaluating this for
our HammerDB/TPROC-C workload and share benchmark results soon.

Regarding shared locks: Your version (using sub_fetch on acquisition
failure) does align more cleanly with the original state machine by
avoiding transient overcounts. I initially came up with a similar
approach but shifted to the single-atomic-increment design to minimize
atomic instructions – a critical priority for our 384-core benchmarks
where atomic ops dominate contention.

Let’s reconcile these strengths:

1. I’ll test your patch head-to-head against our current version in HCC
TPROC-C workloads.
2. If the atomic savings in your exclusive path yield meaningful gains,
we will try to integrate it into our patch immediately.
1. For shared locks: if your design shows comparable performance while
simplifying correctness, it’s a compelling option.

Really appreciate you driving this optimization further!

Regards,
Zhiguo

Yura Sokolov

y.sokolov@postgrespro.ru

6 months ago

In reply to: Zhou, Zhiguo (#5)

Re: Optimize shared LWLock acquisition for high-core-count systems

10.07.2025 18:57, Zhou, Zhiguo пишет:

On 7/9/2025 3:56 PM, Yura Sokolov wrote:

30.05.2025 14:30, Zhou, Zhiguo пишет:

Hi Hackers,

I am reaching out to solicit your insights and comments on this patch
addressing a significant performance bottleneck in LWLock acquisition
observed on high-core-count systems. During performance analysis of
HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
system, we found that LWLockAttemptLock consumed 7.12% of total CPU
cycles. This bottleneck becomes even more pronounced (up to 30% of
cycles) after applying lock-free WAL optimizations[1][2].

Problem Analysis:
The current LWLock implementation uses separate atomic operations for
state checking and modification. For shared locks (84% of
LWLockAttemptLock calls), this requires:
1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)

This design causes excessive atomic operations on contended locks, which
are particularly expensive on high-core-count systems where cache-line
bouncing amplifies synchronization costs.

Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage

Key implementation details:
- For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
reference count
- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
1) Exclusive locks
2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)

Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
- *46%* cumulative improvement when combined with lock-free WAL
optimizations[1][2]

Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks

The optimization is particularly effective for contended shared locks,
which are common in buffer mapping, lock manager, and shared buffer
access patterns.

Please review this patch for consideration in upcoming PostgreSQL releases.

[1] Lock-free XLog Reservation from WAL:
/messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com
[2] Increase NUM_XLOGINSERT_LOCKS:
/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

Good day, Zhou.

Could you explain, why your patch is correct?

As far as I understand, it is clearly not correct at this time:
- SHARED lock count may be incremented many times, because of for(;;) loop
in LWLockAcquire and because LWLockAttemptLock is called twice per each
loop iteration in case lock is held in Exclusive mode by someone else.

If your patch is correct, then where I'm wrong?

When I tried to do same thing, I did sub_fetch immediately in case of
acquisition failure. And did no add_fetch at all if lock is held in
Exclusive mode.

BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
to do CAS if lock is held by someone else.

See my version in attach...

Good day, Yura.

Thanks for your comments which identifies this critical safety condition
– you're absolutely right that we must guarantee the shared reference
count never overflows into the exclusive bit. Let me clarify the safety
mechanism:

When a reader encounters an exclusive lock during acquisition
(triggering the for(;;) loop), it does temporarily increment the shared
count twice – once per LWLockAttemptLock attempt. However, both
increments occur before the reader waits on its semaphore
(PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holder
releases the lock via LWLockReleaseInternal, it resets the entire lock
state (line 1883: pg_atomic_fetch_and_u32(&lock->state,
~LW_LOCK_MASK)). This clears all reader references, including any
"overcounted" increments from blocked readers.

I see my mistake now: I misread this pg_atomic_fetch_and as
pg_atomic_fetch_add.

Clever trick. But rather unintuitive. It is hard to mean about its safety.
It have to be described in details in code comments and commit message. But
you completely missed description of this important nuance in first patch
version.

Thus, when blocked readers wake:

1. They retry acquisition on a zero-initialized state
2. Each ultimately increments only once for successful acquisition
3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely within
LW_SHARED_MASK

The key invariants are:

- LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1
- Exclusive release resets all shared bits
- Readers never persist >1 reference after wakeup

Does this resolve the concern? I appreciate you flagging this subtlety –
please correct me if I've misunderstood your scenario or misinterpreted
the code.

And I'd appreciate you for sharing your implementation – I particularly
agree with your optimization for exclusive lock acquisition.
Avoiding the CAS loop when the lock is already held (by checking state
early) is a clever reduction of atomic operations, which we know are
costly on high-core-count systems. I’ll prioritize evaluating this for
our HammerDB/TPROC-C workload and share benchmark results soon.

This done the same for shared lock: if lock is locked as exclusive, no
fetch_add is performed. And read before fetch_add is not expensive, I
believe. But I didn't test it on a such huge machine as yours, so I could
be mistaken.

Regarding shared locks: Your version (using sub_fetch on acquisition
failure) does align more cleanly with the original state machine by
avoiding transient overcounts. I initially came up with a similar
approach but shifted to the single-atomic-increment design to minimize
atomic instructions – a critical priority for our 384-core benchmarks
where atomic ops dominate contention.

Let’s reconcile these strengths:

1. I’ll test your patch head-to-head against our current version in HCC
TPROC-C workloads.
2. If the atomic savings in your exclusive path yield meaningful gains,
we will try to integrate it into our patch immediately.
1. For shared locks: if your design shows comparable performance while
simplifying correctness, it’s a compelling option.

Really appreciate you driving this optimization further!

I appreciate your work too!
Scalability has long starving increased attention.

--
regards
Yura Sokolov aka funny-falcon

Zhou, Zhiguo

zhiguo.zhou@intel.com

6 months ago

In reply to: Yura Sokolov (#6)

1 attachment(s)

Re: Optimize shared LWLock acquisition for high-core-count systems

On 7/11/2025 4:35 PM, Yura Sokolov wrote:

10.07.2025 18:57, Zhou, Zhiguo пишет:

On 7/9/2025 3:56 PM, Yura Sokolov wrote:

30.05.2025 14:30, Zhou, Zhiguo пишет:

Hi Hackers,

I am reaching out to solicit your insights and comments on this patch
addressing a significant performance bottleneck in LWLock acquisition
observed on high-core-count systems. During performance analysis of
HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
system, we found that LWLockAttemptLock consumed 7.12% of total CPU
cycles. This bottleneck becomes even more pronounced (up to 30% of
cycles) after applying lock-free WAL optimizations[1][2].

Problem Analysis:
The current LWLock implementation uses separate atomic operations for
state checking and modification. For shared locks (84% of
LWLockAttemptLock calls), this requires:
1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)

This design causes excessive atomic operations on contended locks, which
are particularly expensive on high-core-count systems where cache-line
bouncing amplifies synchronization costs.

Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage

Key implementation details:
- For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
reference count
- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
1) Exclusive locks
2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)

Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
- *46%* cumulative improvement when combined with lock-free WAL
optimizations[1][2]

Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks

The optimization is particularly effective for contended shared locks,
which are common in buffer mapping, lock manager, and shared buffer
access patterns.

Please review this patch for consideration in upcoming PostgreSQL releases.

[1] Lock-free XLog Reservation from WAL:
/messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com
[2] Increase NUM_XLOGINSERT_LOCKS:
/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

Good day, Zhou.

Could you explain, why your patch is correct?

As far as I understand, it is clearly not correct at this time:
- SHARED lock count may be incremented many times, because of for(;;) loop
in LWLockAcquire and because LWLockAttemptLock is called twice per each
loop iteration in case lock is held in Exclusive mode by someone else.

If your patch is correct, then where I'm wrong?

When I tried to do same thing, I did sub_fetch immediately in case of
acquisition failure. And did no add_fetch at all if lock is held in
Exclusive mode.

We rigorously tested your suggested approach of using sub_fetch on
acquisition failure to avoid temporary overcounting with [1]Lock-free XLog Reservation from WAL: /messages/by-id/PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com[2]Increase NUM_XLOGINSERT_LOCKS: /messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru as the
baseline. Performance results on our 384-vCPU Intel system running
HammerDB/TPCC (192 VU, 757 warehouses) showed an NOPM improvement of
8.36% which is lower than the 20.81% brought by this optimization.

BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
to do CAS if lock is held by someone else.

See my version in attach...

Good day, Yura.

Thanks for your comments which identifies this critical safety condition
– you're absolutely right that we must guarantee the shared reference
count never overflows into the exclusive bit. Let me clarify the safety
mechanism:

When a reader encounters an exclusive lock during acquisition
(triggering the for(;;) loop), it does temporarily increment the shared
count twice – once per LWLockAttemptLock attempt. However, both
increments occur before the reader waits on its semaphore
(PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holder
releases the lock via LWLockReleaseInternal, it resets the entire lock
state (line 1883: pg_atomic_fetch_and_u32(&lock->state,
~LW_LOCK_MASK)). This clears all reader references, including any
"overcounted" increments from blocked readers.

I see my mistake now: I misread this pg_atomic_fetch_and as
pg_atomic_fetch_add.

Clever trick. But rather unintuitive. It is hard to mean about its safety.
It have to be described in details in code comments and commit message. But
you completely missed description of this important nuance in first patch
version.

We've significantly expanded the code comments and commit message
(attached) to explicitly document:
- The transient "overcount" scenario when readers encounter exclusive locks
- The safety boundary (MAX_BACKENDS * 2 references)
- The critical reset mechanism in LWLockReleaseInternal where
pg_atomic_fetch_and_u32(&lock->state, ~LW_LOCK_MASK) clears all references
- Conditions under which the optimization activates (only when
willwait=true)

Thus, when blocked readers wake:

1. They retry acquisition on a zero-initialized state
2. Each ultimately increments only once for successful acquisition
3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely within
LW_SHARED_MASK

The key invariants are:

- LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1
- Exclusive release resets all shared bits
- Readers never persist >1 reference after wakeup

Does this resolve the concern? I appreciate you flagging this subtlety –
please correct me if I've misunderstood your scenario or misinterpreted
the code.

And I'd appreciate you for sharing your implementation – I particularly
agree with your optimization for exclusive lock acquisition.
Avoiding the CAS loop when the lock is already held (by checking state
early) is a clever reduction of atomic operations, which we know are
costly on high-core-count systems. I’ll prioritize evaluating this for
our HammerDB/TPROC-C workload and share benchmark results soon.

While testing your exclusive lock optimization (avoiding CAS when lock
is held), we observed neutral performance (compared with this
optimization) in TPCC as the workload isn't exclusive-lock intensive. We
recognize its potential value and will:
- Find another dedicated benchmark to quantify its gains
- Propose it as a separate optimization once validated

This done the same for shared lock: if lock is locked as exclusive, no
fetch_add is performed. And read before fetch_add is not expensive, I
believe. But I didn't test it on a such huge machine as yours, so I could
be mistaken.

We also tried to early give up acquiring shared lock when it is locked
as exclusive by inserting an atomic_read before the fetch_add. The
performance test shows an improvement of 8.4% which is lower than the
20.81% brought by the fetch_add-only implementation, suggesting that the
additional atomic_read operation is also critical on the high-core-count
systems.

Regarding shared locks: Your version (using sub_fetch on acquisition
failure) does align more cleanly with the original state machine by
avoiding transient overcounts. I initially came up with a similar
approach but shifted to the single-atomic-increment design to minimize
atomic instructions – a critical priority for our 384-core benchmarks
where atomic ops dominate contention.

Let’s reconcile these strengths:

1. I’ll test your patch head-to-head against our current version in HCC
TPROC-C workloads.
2. If the atomic savings in your exclusive path yield meaningful gains,
we will try to integrate it into our patch immediately.
1. For shared locks: if your design shows comparable performance while
simplifying correctness, it’s a compelling option.

Really appreciate you driving this optimization further!

I appreciate your work too!
Scalability has long starving increased attention.

Good day, Yura!

Thank you for your thoughtful review and critical feedback on the LWLock
optimization patch. Your insights into the potential shared count
overflow issue were especially valuable and prompted us to thoroughly
re-examine the safety mechanisms.

We've preserved the original optimization due to its demonstrated
performance advantages but incorporated your critical documentation
enhancements. The revised patch now includes:

- Comprehensive safety explanations in lwlock.c comments
- Detailed commit message describing transient states

We believe this addresses the correctness concerns of our readers while
delivering significant scalability gains. We welcome further scrutiny
and would appreciate your thoughts on the revised documentation.

Thank you again for your expertise in strengthening this optimization!

Regards,
Zhiguo

Attachments:

v1-0001-Optimize-shared-LWLock-acquisition-for-high-core-cou.patchtext/plain; charset=UTF-8; name=v1-0001-Optimize-shared-LWLock-acquisition-for-high-core-cou.patchDownload

From c77f2755dca4dec83a5f16df634b07d3e8b91a96 Mon Sep 17 00:00:00 2001
From: Zhiguo Zhou <zhiguo.zhou@intel.com>
Date: Thu, 29 May 2025 16:55:42 +0800
Subject: [PATCH v1] Optimize shared LWLock acquisition for high-core-count
 systems

This patch introduces optimizations to reduce lock acquisition overhead in
LWLock by merging the read and update operations for the LW_SHARED lock's
state. This eliminates the need for separate atomic instructions, which is
critical for improving performance on high-core-count systems.

Key changes:
- Extended LW_SHARED_MASK by 1 bit and shifted LW_VAL_EXCLUSIVE by 1 bit to
  ensure compatibility with the upper bound of MAX_BACKENDS * 2.
- Added a `willwait` parameter to `LWLockAttemptLock` to disable the
  optimization when the caller is unwilling to wait, avoiding conflicts
  between the reference count and the LW_VAL_EXCLUSIVE flag.
- Updated `LWLockReleaseInternal` to use `pg_atomic_fetch_and_u32` for
  clearing lock state flags atomically.
- Adjusted related functions (`LWLockAcquire`, `LWLockConditionalAcquire`,
  `LWLockAcquireOrWait`) to pass the `willwait` parameter appropriately.

Key optimization ideas:
It is only activated when willwait=true, ensuring that the reference
count does not grow unchecked and overflow into the LW_VAL_EXCLUSIVE bit.

Three scenarios can occur when acquiring a shared lock:
1) Lock is free: atomically increment reference count and acquire
2) Lock held in shared mode: atomically increment reference count and acquire
3) Lock held exclusively: atomically increment reference count but fail to acquire

Scenarios 1 and 2 work as expected - we successfully increment the count
and acquire the lock.

Scenario 3 is counterintuitive: we increment the reference count even though
we cannot acquire the lock due to the exclusive holder. This creates a
temporarily invalid reference count, but it's acceptable because:
- The LW_VAL_EXCLUSIVE flag takes precedence in determining lock state
- Each process retries at most twice before blocking on a semaphore
- This bounds the "overcounted" references to MAX_BACKENDS * 2
- The bound fits within LW_SHARED_MASK capacity
- The lock->state including "overcounted" references is reset when the exclusive
  lock is released.

These changes improve scalability and reduce contention in workloads with
frequent LWLock operations on servers with many cores.
---
 src/backend/storage/lmgr/lwlock.c | 104 +++++++++++++++++++++++++-----
 1 file changed, 88 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..613a35021b8 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -97,20 +97,41 @@
 #define LW_FLAG_BITS				3
 #define LW_FLAG_MASK				(((1<<LW_FLAG_BITS)-1)<<(32-LW_FLAG_BITS))
 
-/* assumes MAX_BACKENDS is a (power of 2) - 1, checked below */
-#define LW_VAL_EXCLUSIVE			(MAX_BACKENDS + 1)
+/*
+ * already (power of 2)-1, i.e. suitable for a mask
+ *
+ * Originally, the LW_SHARED lock reference count was maintained in bits
+ * [MAX_BACKEND_BITS-1:0] of LWLock.state, with a theoretical maximum of
+ * MAX_BACKENDS (when all MAX_BACKENDS processes hold the lock concurrently).
+ *
+ * To reduce lock acquisition overhead, we optimized LWLockAttemptLock by
+ * merging the read and update operations for the LW_SHARED lock's state.
+ * This eliminates the need for separate atomic instructions - a critical
+ * improvement given the high cost of atomic operations on high-core-count
+ * systems.
+ *
+ * This optimization introduces a scenario where the reference count may
+ * temporarily increment even when a reader fails to acquire an exclusive lock.
+ * However, since each process retries lock acquisition up to *twice* before
+ * waiting on a semaphore, the reference count is bounded by MAX_BACKENDS * 2.
+ *
+ * To ensure compatibility with this upper bound:
+ * 1. LW_SHARED_MASK has been extended by 1 bit
+ * 2. LW_VAL_EXCLUSIVE is left-shifted by 1 bit
+ */
+#define LW_SHARED_MASK				((MAX_BACKENDS << 1) + 1)
+#define LW_VAL_EXCLUSIVE			(LW_SHARED_MASK + 1)
+#define LW_LOCK_MASK				(LW_SHARED_MASK	| LW_VAL_EXCLUSIVE)
 #define LW_VAL_SHARED				1
 
-/* already (power of 2)-1, i.e. suitable for a mask */
-#define LW_SHARED_MASK				MAX_BACKENDS
-#define LW_LOCK_MASK				(MAX_BACKENDS | LW_VAL_EXCLUSIVE)
+/* assumes MAX_BACKENDS is a (power of 2) - 1, checked below */
 
 
 StaticAssertDecl(((MAX_BACKENDS + 1) & MAX_BACKENDS) == 0,
 				 "MAX_BACKENDS + 1 needs to be a power of 2");
 
-StaticAssertDecl((MAX_BACKENDS & LW_FLAG_MASK) == 0,
-				 "MAX_BACKENDS and LW_FLAG_MASK overlap");
+StaticAssertDecl((LW_SHARED_MASK & LW_FLAG_MASK) == 0,
+				 "LW_SHARED_MASK and LW_FLAG_MASK overlap");
 
 StaticAssertDecl((LW_VAL_EXCLUSIVE & LW_FLAG_MASK) == 0,
 				 "LW_VAL_EXCLUSIVE and LW_FLAG_MASK overlap");
@@ -277,6 +298,8 @@ PRINT_LWDEBUG(const char *where, LWLock *lock, LWLockMode mode)
 	if (Trace_lwlocks)
 	{
 		uint32		state = pg_atomic_read_u32(&lock->state);
+		uint32		excl = (state & LW_VAL_EXCLUSIVE) != 0;
+		uint32		shared = excl ? 0 : state & LW_SHARED_MASK;
 
 		ereport(LOG,
 				(errhidestmt(true),
@@ -284,8 +307,8 @@ PRINT_LWDEBUG(const char *where, LWLock *lock, LWLockMode mode)
 				 errmsg_internal("%d: %s(%s %p): excl %u shared %u haswaiters %u waiters %u rOK %d",
 								 MyProcPid,
 								 where, T_NAME(lock), lock,
-								 (state & LW_VAL_EXCLUSIVE) != 0,
-								 state & LW_SHARED_MASK,
+								 excl,
+								 shared,
 								 (state & LW_FLAG_HAS_WAITERS) != 0,
 								 pg_atomic_read_u32(&lock->nwaiters),
 								 (state & LW_FLAG_RELEASE_OK) != 0)));
@@ -790,14 +813,53 @@ GetLWLockIdentifier(uint32 classId, uint16 eventId)
  * This function will not block waiting for a lock to become free - that's the
  * caller's job.
  *
+ * willwait: true if the caller is willing to wait for the lock to become free
+ *          false if the caller is not willing to wait.
+ *
  * Returns true if the lock isn't free and we need to wait.
  */
 static bool
-LWLockAttemptLock(LWLock *lock, LWLockMode mode)
+LWLockAttemptLock(LWLock *lock, LWLockMode mode, bool willwait)
 {
 	uint32		old_state;
 
 	Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+	/*
+	 * Optimized shared lock acquisition using atomic fetch-and-add.
+	 *
+	 * This optimization aims to lower the cost of acquiring shared locks
+	 * by reducing the number of atomic operations, which can be expensive
+	 * on systems with many CPU cores.
+	 *
+	 * It is only activated when willwait=true, ensuring that the reference
+	 * count does not grow unchecked and overflow into the LW_VAL_EXCLUSIVE bit.
+	 *
+	 * Three scenarios can occur when acquiring a shared lock:
+	 * 1) Lock is free: atomically increment reference count and acquire
+	 * 2) Lock held in shared mode: atomically increment reference count and acquire
+	 * 3) Lock held exclusively: atomically increment reference count but fail to acquire
+	 *
+	 * Scenarios 1 and 2 work as expected - we successfully increment the count
+	 * and acquire the lock.
+	 *
+	 * Scenario 3 is counterintuitive: we increment the reference count even though
+	 * we cannot acquire the lock due to the exclusive holder. This creates a
+	 * temporarily invalid reference count, but it's acceptable because:
+	 * - The LW_VAL_EXCLUSIVE flag takes precedence in determining lock state
+	 * - Each process retries at most twice before blocking on a semaphore
+	 * - This bounds the "overcounted" references to MAX_BACKENDS * 2
+	 * - The bound fits within LW_SHARED_MASK capacity
+	 * - The lock->state including "overcounted" references is reset when the exclusive
+	 *   lock is released.
+	 *
+	 * See LW_SHARED_MASK definition comments for additional details.
+	 */
+	if (willwait && mode == LW_SHARED)
+	{
+		old_state = pg_atomic_fetch_add_u32(&lock->state, LW_VAL_SHARED);
+		Assert((old_state & LW_LOCK_MASK) != LW_LOCK_MASK);
+		return (old_state & LW_VAL_EXCLUSIVE) != 0;
+	}
 
 	/*
 	 * Read once outside the loop, later iterations will get the newer value
@@ -1242,7 +1304,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		 * Try to grab the lock the first time, we're not in the waitqueue
 		 * yet/anymore.
 		 */
-		mustwait = LWLockAttemptLock(lock, mode);
+		mustwait = LWLockAttemptLock(lock, mode, true);
 
 		if (!mustwait)
 		{
@@ -1265,7 +1327,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		LWLockQueueSelf(lock, mode);
 
 		/* we're now guaranteed to be woken up if necessary */
-		mustwait = LWLockAttemptLock(lock, mode);
+		mustwait = LWLockAttemptLock(lock, mode, true);
 
 		/* ok, grabbed the lock the second time round, need to undo queueing */
 		if (!mustwait)
@@ -1296,6 +1358,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 
 		for (;;)
 		{
+			/* When signaled, lock->state has been zero-initialized by the previous holder */
 			PGSemaphoreLock(proc->sem);
 			if (proc->lwWaiting == LW_WS_NOT_WAITING)
 				break;
@@ -1368,7 +1431,7 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 	HOLD_INTERRUPTS();
 
 	/* Check for the lock */
-	mustwait = LWLockAttemptLock(lock, mode);
+	mustwait = LWLockAttemptLock(lock, mode, false);
 
 	if (mustwait)
 	{
@@ -1435,13 +1498,13 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 	 * NB: We're using nearly the same twice-in-a-row lock acquisition
 	 * protocol as LWLockAcquire(). Check its comments for details.
 	 */
-	mustwait = LWLockAttemptLock(lock, mode);
+	mustwait = LWLockAttemptLock(lock, mode, true);
 
 	if (mustwait)
 	{
 		LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE);
 
-		mustwait = LWLockAttemptLock(lock, mode);
+		mustwait = LWLockAttemptLock(lock, mode, true);
 
 		if (mustwait)
 		{
@@ -1461,6 +1524,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 
 			for (;;)
 			{
+				/* When signaled, lock->state has been zero-initialized by the previous holder */
 				PGSemaphoreLock(proc->sem);
 				if (proc->lwWaiting == LW_WS_NOT_WAITING)
 					break;
@@ -1843,7 +1907,15 @@ LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
 	 * others, even if we still have to wakeup other waiters.
 	 */
 	if (mode == LW_EXCLUSIVE)
-		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
+	{
+		/*
+		 * To release the exclusive lock, all bits of LW_LOCK_MASK,
+		 * including any "overcounted" increments from blocked readers,
+		 * are cleared.
+		 */
+		oldstate = pg_atomic_fetch_and_u32(&lock->state, ~LW_LOCK_MASK);
+		oldstate &= ~LW_LOCK_MASK;
+	}
 	else
 		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
 
-- 
2.43.0