Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks

Started by Zhou, Zhiguo10 months ago5 messageshackers
Jump to latest
#1Zhou, Zhiguo
zhiguo.zhou@intel.com

Hi Hackers,

This patch addresses severe LWLock contention observed on high-core systems
where hundreds of processors concurrently access frequently-shared locks.
Specifically for ProcArrayLock (exhibiting 93.5% shared-mode acquires), we
implement a new ReadBiasedLWLock mechanism to eliminate the atomic operation
bottleneck.

Key aspects:
1. Problem: Previous optimizations[1]Optimize shared LWLock acquisition for high-core-count systems: /messages/by-id/73d53acf-4f66-41df-b438-5c2e6115d4de@intel.com left LWLockAttemptLock/Release
consuming
   ~25% total CPU cycles on 384-vCPU systems due to contention on a single
   lock-state cache line. Shared lock attempts showed 37x higher cumulative
   latency than exclusive mode for ProcArrayLock.

2. Solution: ReadBiasedLWLock partitions lock state across 16 cache lines
   (READ_BIASED_LOCK_STATE_COUNT):
   - Readers acquire/release only their designated LWLock (indexed by
     pid % 16) using a single atomic operation
   - Writers pay higher cost by acquiring all 16 sub-locks exclusively
   - Maintains LWLock's "acquiring process must release" semantics

3. Performance: HammerDB/TPCC shows 35.3% NOPM improvement over baseline
   - Lock acquisition CPU cycles reduced from 16.7% to 7.4%
   - Lock release cycles reduced from 7.9% to 2.2%

4. Implementation:
   - Core infrastructure for ReadBiasedLWLock
   - ProcArrayLock converted as proof-of-concept
   - Maintains full LWLock API compatibility

Known considerations:
- Increased writer acquisition cost (acceptable given rarity of exclusive
  acquisitions for biased locks like ProcArrayLock)
- Memory overhead: 16x size increase per converted lock
- Currently validated for ProcArrayLock; other heavily-shared locks may be
  candidates after further analysis

This is a preliminary version for community feedback. We're actively:
1. Refining the implementation details
2. Expanding test coverage
3. Investigating additional lock candidates
4. Optimizing writer-fast-path opportunities

Test results, profiling data, and design details can be shared upon request.
We appreciate all comments and suggestions for improvement.

[1]: Optimize shared LWLock acquisition for high-core-count systems: /messages/by-id/73d53acf-4f66-41df-b438-5c2e6115d4de@intel.com
/messages/by-id/73d53acf-4f66-41df-b438-5c2e6115d4de@intel.com

Regards,

Zhiguo

Attachments:

v0-0001-Optimize-lock-acquisition-release-with-ReadBiased.patchtext/plain; charset=UTF-8; name=v0-0001-Optimize-lock-acquisition-release-with-ReadBiased.patchDownload+260-119
#2Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Zhou, Zhiguo (#1)
Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks

On 6/26/2025 1:07 PM, Zhou, Zhiguo wrote:

Hi Hackers,

This patch addresses severe LWLock contention observed on high-core systems
where hundreds of processors concurrently access frequently-shared locks.
Specifically for ProcArrayLock (exhibiting 93.5% shared-mode acquires), we
implement a new ReadBiasedLWLock mechanism to eliminate the atomic
operation
bottleneck.

Key aspects:
1. Problem: Previous optimizations[1] left LWLockAttemptLock/Release
consuming
   ~25% total CPU cycles on 384-vCPU systems due to contention on a single
   lock-state cache line. Shared lock attempts showed 37x higher
cumulative
   latency than exclusive mode for ProcArrayLock.

2. Solution: ReadBiasedLWLock partitions lock state across 16 cache lines
   (READ_BIASED_LOCK_STATE_COUNT):
   - Readers acquire/release only their designated LWLock (indexed by
     pid % 16) using a single atomic operation
   - Writers pay higher cost by acquiring all 16 sub-locks exclusively
   - Maintains LWLock's "acquiring process must release" semantics

3. Performance: HammerDB/TPCC shows 35.3% NOPM improvement over baseline
   - Lock acquisition CPU cycles reduced from 16.7% to 7.4%
   - Lock release cycles reduced from 7.9% to 2.2%

4. Implementation:
   - Core infrastructure for ReadBiasedLWLock
   - ProcArrayLock converted as proof-of-concept
   - Maintains full LWLock API compatibility

Known considerations:
- Increased writer acquisition cost (acceptable given rarity of exclusive
  acquisitions for biased locks like ProcArrayLock)
- Memory overhead: 16x size increase per converted lock
- Currently validated for ProcArrayLock; other heavily-shared locks may be
  candidates after further analysis

This is a preliminary version for community feedback. We're actively:
1. Refining the implementation details
2. Expanding test coverage
3. Investigating additional lock candidates
4. Optimizing writer-fast-path opportunities

Test results, profiling data, and design details can be shared upon
request.
We appreciate all comments and suggestions for improvement.

[1]Optimize shared LWLock acquisition for high-core-count systems:
/messages/by-id/73d53acf-4f66-41df-
b438-5c2e6115d4de%40intel.com

Regards,

Zhiguo

As a follow-up to previous mail, the proposed patch has undergone
further refinement by being logically split into two distinct patches
for clearer review and testing: the first introduces the new
​​ReadBiasedLWLock​​ mechanism, and the second converts ProcArrayLock
specifically from an LWLock to utilize this new ReadBiasedLWLock type.
This patchset has successfully passed the standard make-check regression
tests. We are now eager to submit these updated patches for community
review and welcome any feedback or discussion on the implementation
approach. Thanks!

Regards,
Zhiguo

Attachments:

v1-0001-Introduce-ReadBiasedLWLock-for-high-concurrency-read.patchtext/plain; charset=UTF-8; name=v1-0001-Introduce-ReadBiasedLWLock-for-high-concurrency-read.patchDownload+375-70
v1-0002-Convert-ProcArrayLock-to-ReadBiasedLWLock-for-improv.patchtext/plain; charset=UTF-8; name=v1-0002-Convert-ProcArrayLock-to-ReadBiasedLWLock-for-improv.patchDownload+117-118
#3Andres Freund
andres@anarazel.de
In reply to: Zhou, Zhiguo (#1)
Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks

Hi,

On 2025-06-26 13:07:49 +0800, Zhou, Zhiguo wrote:

This patch addresses severe LWLock contention observed on high-core systems
where hundreds of processors concurrently access frequently-shared locks.
Specifically for ProcArrayLock (exhibiting 93.5% shared-mode acquires), we
implement a new ReadBiasedLWLock mechanism to eliminate the atomic operation
bottleneck.

Key aspects:
1. Problem: Previous optimizations[1] left LWLockAttemptLock/Release
consuming
� �~25% total CPU cycles on 384-vCPU systems due to contention on a single
� �lock-state cache line. Shared lock attempts showed 37x higher cumulative
� �latency than exclusive mode for ProcArrayLock.

2. Solution: ReadBiasedLWLock partitions lock state across 16 cache lines
� �(READ_BIASED_LOCK_STATE_COUNT):
� �- Readers acquire/release only their designated LWLock (indexed by
� � �pid % 16) using a single atomic operation
� �- Writers pay higher cost by acquiring all 16 sub-locks exclusively
� �- Maintains LWLock's "acquiring process must release" semantics

3. Performance: HammerDB/TPCC shows 35.3% NOPM improvement over baseline
� �- Lock acquisition CPU cycles reduced from 16.7% to 7.4%
� �- Lock release cycles reduced from 7.9% to 2.2%

4. Implementation:
� �- Core infrastructure for ReadBiasedLWLock
� �- ProcArrayLock converted as proof-of-concept
� �- Maintains full LWLock API compatibility

Known considerations:
- Increased writer acquisition cost (acceptable given rarity of exclusive
� acquisitions for biased locks like ProcArrayLock)

Unfortunately I have a very hard time believing that that's unacceptable -
there are plenty workloads (many write intensive ones) where exclusive locks
on ProcArrayLock are the bottleneck.

Greetings,

Andres Freund

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#3)
Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks

Hi,

On 2025-07-01 09:57:18 -0400, Andres Freund wrote:

On 2025-06-26 13:07:49 +0800, Zhou, Zhiguo wrote:

This patch addresses severe LWLock contention observed on high-core systems
where hundreds of processors concurrently access frequently-shared locks.
Specifically for ProcArrayLock (exhibiting 93.5% shared-mode acquires), we
implement a new ReadBiasedLWLock mechanism to eliminate the atomic operation
bottleneck.

Key aspects:
1. Problem: Previous optimizations[1] left LWLockAttemptLock/Release
consuming
� �~25% total CPU cycles on 384-vCPU systems due to contention on a single
� �lock-state cache line. Shared lock attempts showed 37x higher cumulative
� �latency than exclusive mode for ProcArrayLock.

2. Solution: ReadBiasedLWLock partitions lock state across 16 cache lines
� �(READ_BIASED_LOCK_STATE_COUNT):
� �- Readers acquire/release only their designated LWLock (indexed by
� � �pid % 16) using a single atomic operation
� �- Writers pay higher cost by acquiring all 16 sub-locks exclusively
� �- Maintains LWLock's "acquiring process must release" semantics

3. Performance: HammerDB/TPCC shows 35.3% NOPM improvement over baseline
� �- Lock acquisition CPU cycles reduced from 16.7% to 7.4%
� �- Lock release cycles reduced from 7.9% to 2.2%

4. Implementation:
� �- Core infrastructure for ReadBiasedLWLock
� �- ProcArrayLock converted as proof-of-concept
� �- Maintains full LWLock API compatibility

Known considerations:
- Increased writer acquisition cost (acceptable given rarity of exclusive
� acquisitions for biased locks like ProcArrayLock)

Unfortunately I have a very hard time believing that that's unacceptable -
there are plenty workloads (many write intensive ones) where exclusive locks
on ProcArrayLock are the bottleneck.

Ooops, s/unacceptable/acceptable/

Greetings,

Andres Freund

#5Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Andres Freund (#4)
Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks

Hi Andres,

On 7/1/2025 10:06 PM, Andres Freund wrote:

Hi,

On 2025-07-01 09:57:18 -0400, Andres Freund wrote:

On 2025-06-26 13:07:49 +0800, Zhou, Zhiguo wrote:

This patch addresses severe LWLock contention observed on high-core systems
where hundreds of processors concurrently access frequently-shared locks.
Specifically for ProcArrayLock (exhibiting 93.5% shared-mode acquires), we
implement a new ReadBiasedLWLock mechanism to eliminate the atomic operation
bottleneck.

Key aspects:
1. Problem: Previous optimizations[1] left LWLockAttemptLock/Release
consuming
   ~25% total CPU cycles on 384-vCPU systems due to contention on a single
   lock-state cache line. Shared lock attempts showed 37x higher cumulative
   latency than exclusive mode for ProcArrayLock.

2. Solution: ReadBiasedLWLock partitions lock state across 16 cache lines
   (READ_BIASED_LOCK_STATE_COUNT):
   - Readers acquire/release only their designated LWLock (indexed by
     pid % 16) using a single atomic operation
   - Writers pay higher cost by acquiring all 16 sub-locks exclusively
   - Maintains LWLock's "acquiring process must release" semantics

3. Performance: HammerDB/TPCC shows 35.3% NOPM improvement over baseline
   - Lock acquisition CPU cycles reduced from 16.7% to 7.4%
   - Lock release cycles reduced from 7.9% to 2.2%

4. Implementation:
   - Core infrastructure for ReadBiasedLWLock
   - ProcArrayLock converted as proof-of-concept
   - Maintains full LWLock API compatibility

Known considerations:
- Increased writer acquisition cost (acceptable given rarity of exclusive
  acquisitions for biased locks like ProcArrayLock)

Unfortunately I have a very hard time believing that that's unacceptable -
there are plenty workloads (many write intensive ones) where exclusive locks
on ProcArrayLock are the bottleneck.

Ooops, s/unacceptable/acceptable/

Greetings,

Andres Freund

Thank you for raising this important concern about potential impacts on
write-intensive workloads. You're absolutely right to question whether
increased exclusive acquisition costs are acceptable. To address your point:

1. We acknowledge this is an aggressive optimization with inherent
trade-offs. While it addresses severe shared-acquisition bottlenecks
(particularly relevant for large-core systems), we fully recognize that
the increased exclusive acquisition cost could be problematic for
write-heavy scenarios. Our position is that ​​for locks with highly
skewed access patterns​​ like ProcArrayLock (where 93.5% of acquisitions
are shared), this trade-off may be worthwhile.

2. Our focus on HammerDB/TPCC stems from customer workloads where
ProcArrayLock contention in shared mode is demonstrably the dominant
bottleneck. Our profiling shows:
- 37x higher cumulative latency for shared acquires vs exclusive
- 16.7% of CPU cycles consumed by lock acquisition pre-optimization

This suggests that for this specific lock in OLTP contexts, mitigating
shared contention is critical.

3. To better understand scenarios where this trade-off might be unfavorable:
​​
Could you share specific write-intensive workloads you're concerned
about​​? We would prioritize evaluating this patch against:
a) Benchmarks known to stress ProcArrayLock in exclusive mode
b) Production-like workloads where you anticipate exclusive acquisitions
might become problematic

We're committed to testing this rigorously and exploring mitigation
strategies if needed.

4.Before implementing ReadBiasedLWLock, we explored less invasive
alternatives in [1]Optimize shared LWLock acquisition for high-core-count systems: /messages/by-id/73d53acf-4f66-41df-b438-5c2e6115d4de@intel.com. This approach maintained the existing exclusive
lock path while optimizing shared acquisitions with a single atomic
operation. ​​Would you be willing to review that approach first​​? We
believe it might offer a more balanced solution while still addressing
the core contention issue identified in TPCC.

We appreciate your expertise here and want to ensure we don't simply
shift the bottleneck from readers to writers. Your guidance on suitable
stress tests for exclusive acquisition overhead would be invaluable to
our next steps.

[1]: Optimize shared LWLock acquisition for high-core-count systems: /messages/by-id/73d53acf-4f66-41df-b438-5c2e6115d4de@intel.com
/messages/by-id/73d53acf-4f66-41df-b438-5c2e6115d4de@intel.com

Regards,
Zhiguo