[RFC] Lock-free XLog Reservation from WAL

Started by Zhou, Zhiguoabout 1 year ago47 messages

zhiguo.zhou@intel.com

about 1 year ago

Hi all,

I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism.

The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x.

Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations.

Zhou, Zhiguo

zhiguo.zhou@intel.com

about 1 year ago

In reply to: Zhou, Zhiguo (#1)

Hi all,

To address this issue, we propose the following changes:

Zhou, Zhiguo

zhiguo.zhou@intel.com

about 1 year ago

In reply to: Zhou, Zhiguo (#2)

RE: [RFC] Lock-free XLog Reservation from WAL

This message is a duplicate of PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com. Please consider dropping this thread and review the original one instead.

Sorry for your inconvenience.

-----Original Message-----
From: Zhou, Zhiguo <zhiguo.zhou@intel.com>
Sent: Thursday, January 2, 2025 3:20 PM
To: pgsql-hackers@lists.postgresql.org
Subject: [RFC] Lock-free XLog Reservation from WAL

Hi all,

To address this issue, we propose the following changes:

Yura Sokolov

y.sokolov@postgrespro.ru

about 1 year ago

In reply to: Zhou, Zhiguo (#1)

Re: [RFC] Lock-free XLog Reservation from WAL

02.01.2025 10:05, Zhou, Zhiguo wrote:

Hi all,

I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism.

The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x.

Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations.

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big issue).

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear sentence).

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your patch
today I see it is superior to mine (if atomic access will be fixed).

----

regards,
Yura Sokolov aka funny-falcon

wenhui qiu

qiuwenhuifx@gmail.com

about 1 year ago

In reply to: Yura Sokolov (#4)

Re: [RFC] Lock-free XLog Reservation from WAL

Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to 128，I think it
will be challenged，do we make it guc ？

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Show quoted text

02.01.2025 10:05, Zhou, Zhiguo wrote:

Hi all,

I am reaching out to solicit your insights and comments on a recent

proposal regarding the "Lock-free XLog Reservation from WAL." We have
identified some challenges with the current WAL insertions, which require
space reservations in the WAL buffer which involve updating two
shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of
the current XLog) and PrevBytePos (the prev-link to the previous XLog).
Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but
introduces lock contention, hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos

(a single uint64 field) to be implemented with an atomic operation
(fetch_add).

2. Updating Prev-Link of next XLog: Based on the fact that the prev-link

of the next XLog always points to the head of the current Xlog,we will
slightly exceed the reserved memory range of the current XLog to update the
prev-link of the next XLog, regardless of which backend acquires the next
memory space. The next XLog inserter will wait until its prev-link is
updated for CRC calculation before starting its own XLog copy into the WAL.

3. Breaking Sequential Write Convention: Each backend will update the

prev-link of its next XLog first, then return to the header position for
the current log insertion. This change will reduce the dependency of XLog
writes on previous ones (compared with the sequential writes).

4. Revised GetXLogBuffer: To support #3, we need update this function to

separate the LSN it intends to access from the LSN it expects to update in
the insertingAt field.

5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing

NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the
parallelism.

The attached patch could pass the regression tests (make check, make

check-world), and in the performance test of this POC on SPR (480 vCPU)
shows that this optimization could help the TPCC benchmark better scale
with the core count and as a result the performance with full cores enabled
could be improved by 2.04x.

Before we proceed with further patch validation and refinement work, we

are eager to hear the community's thoughts and comments on this
optimization so that we can confirm our current work aligns with
expectations.

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big issue).

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear sentence).

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your patch
today I see it is superior to mine (if atomic access will be fixed).

----

regards,
Yura Sokolov aka funny-falcon

Zhou, Zhiguo

zhiguo.zhou@intel.com

about 1 year ago

In reply to: wenhui qiu (#5)

Re: [RFC] Lock-free XLog Reservation from WAL

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to 128，I
think it will be challenged，do we make it guc ？

I noticed there have been some discussions (for example, [1]/messages/by-id/2266698.1704854297@sss.pgh.pa.us and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be a
controversial proposal. Given that, we may first focus on the lock-free
XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru>> wrote:

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big
issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with
you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try
it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What do
you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable after
inserting the assertion, I will fix it and improve other comments in the
following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your
patch
today I see it is superior to mine (if atomic access will be fixed).

----

regards,
Yura Sokolov aka funny-falcon

Regards,
Zhiguo

[1]: /messages/by-id/2266698.1704854297@sss.pgh.pa.us

wenhui qiu

qiuwenhuifx@gmail.com

about 1 year ago

In reply to: Zhou, Zhiguo (#6)

Re: [RFC] Lock-free XLog Reservation from WAL

HI Zhiguo
Thank you for your reply ,Then you'll have to prove that 128 is the
optimal value, otherwise they'll have a hard time agreeing with you on this
patch.

Thanks

On Mon, Jan 6, 2025 at 2:46 PM Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Show quoted text

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to 128，I
think it will be challenged，do we make it guc ？

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be a
controversial proposal. Given that, we may first focus on the lock-free
XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru>> wrote:

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big
issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with
you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try
it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with

on

platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What do
you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable after
inserting the assertion, I will fix it and improve other comments in the
following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your
patch
today I see it is superior to mine (if atomic access will be fixed).

----

regards,
Yura Sokolov aka funny-falcon

Regards,
Zhiguo

[1]
/messages/by-id/2266698.1704854297@sss.pgh.pa.us

Zhou, Zhiguo

zhiguo.zhou@intel.com

about 1 year ago

In reply to: wenhui qiu (#7)

Re: [RFC] Lock-free XLog Reservation from WAL

Maybe we could leave the NUM_XLOGINSERT_LOCKS unchanged in this patch,
as it is not a hard dependency of the lock-free algorithm. And when this
patch has been fully accepted, we could then investigate the more proper
way of increasing NUM_XLOGINSERT_LOCKS. WDYT?

Show quoted text

On 1/6/2025 4:35 PM, wenhui qiu wrote:

HI Zhiguo
Thank you for your reply ,Then you'll have to prove that 128 is the
optimal value, otherwise they'll have a hard time agreeing with you on
this patch.

Thanks

On Mon, Jan 6, 2025 at 2:46 PM Zhou, Zhiguo <zhiguo.zhou@intel.com
<mailto:zhiguo.zhou@intel.com>> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to

128，I

think it will be challenged，do we make it guc ？

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be a
controversial proposal. Given that, we may first focus on the lock-free
XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov

<y.sokolov@postgrespro.ru <mailto:y.sokolov@postgrespro.ru>

<mailto:y.sokolov@postgrespro.ru

<mailto:y.sokolov@postgrespro.ru>>> wrote:

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from

`ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your

patch, I

don't see why it should exists and be used in

`ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be

consistent with

other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big
issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with
you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try
it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/

read with on

platforms where MAXALIGN != 8 or without native 64 load/

store. Branch

with 'memcpy` is rather obvious, but even pointer de-

referencing on

"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What do
you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into

`GetXLogBuffer`.

I had hard time to recognize `upto` is strictly not in the

future.

- Certainly, final version have to have fixed and improved

comments.

Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable after
inserting the assertion, I will fix it and improve other comments in
the
following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to

your

patch
today I see it is superior to mine (if atomic access will be

fixed).

----

regards,
Yura Sokolov aka funny-falcon

Regards,
Zhiguo

[1] https://www.postgresql.org/message-
id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/
message-id/2266698.1704854297%40sss.pgh.pa.us>

wenhui qiu

qiuwenhuifx@gmail.com

about 1 year ago

In reply to: Zhou, Zhiguo (#8)

Re: [RFC] Lock-free XLog Reservation from WAL

HI Zhiguo

Maybe we could leave the NUM_XLOGINSERT_LOCKS unchanged in this patch,
as it is not a hard dependency of the lock-free algorithm. And when this
patch has been fully accepted, we could then investigate the more proper
way of increasing NUM_XLOGINSERT_LOCKS. WDYT?

If the value is not a strong dependency, then the best way is not to change
it.

Thanks

On Mon, Jan 6, 2025 at 4:49 PM Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Show quoted text

Maybe we could leave the NUM_XLOGINSERT_LOCKS unchanged in this patch,
as it is not a hard dependency of the lock-free algorithm. And when this
patch has been fully accepted, we could then investigate the more proper
way of increasing NUM_XLOGINSERT_LOCKS. WDYT?

On 1/6/2025 4:35 PM, wenhui qiu wrote:

HI Zhiguo
Thank you for your reply ,Then you'll have to prove that 128 is the
optimal value, otherwise they'll have a hard time agreeing with you on
this patch.

Thanks

On Mon, Jan 6, 2025 at 2:46 PM Zhou, Zhiguo <zhiguo.zhou@intel.com
<mailto:zhiguo.zhou@intel.com>> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to

128，I

think it will be challenged，do we make it guc ？

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to

be a

controversial proposal. Given that, we may first focus on the

lock-free

XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov

<y.sokolov@postgrespro.ru <mailto:y.sokolov@postgrespro.ru>

<mailto:y.sokolov@postgrespro.ru

<mailto:y.sokolov@postgrespro.ru>>> wrote:

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from

`ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your

patch, I

don't see why it should exists and be used in

`ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be

consistent with

other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a

big

issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree

with

you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will

try

it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/

read with on

platforms where MAXALIGN != 8 or without native 64 load/

store. Branch

with 'memcpy` is rather obvious, but even pointer de-

referencing on

"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What

do

you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into

`GetXLogBuffer`.

I had hard time to recognize `upto` is strictly not in the

future.

- Certainly, final version have to have fixed and improved

comments.

Many patch's ideas are strictly non-obvious. I had hard time

to

recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable

after

inserting the assertion, I will fix it and improve other comments in
the
following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to

your

patch
today I see it is superior to mine (if atomic access will be

fixed).

----

regards,
Yura Sokolov aka funny-falcon

Regards,
Zhiguo

[1] https://www.postgresql.org/message-
id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/
message-id/2266698.1704854297%40sss.pgh.pa.us>

#10

Yura Sokolov

y.sokolov@postgrespro.ru

about 1 year ago

In reply to: Zhou, Zhiguo (#6)

Re: [RFC] Lock-free XLog Reservation from WAL

On 6 Jan 2025, at 09:46, Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to 128，I think it will be challenged，do we make it guc ？

I noticed there have been some discussions (for example, [1] and its responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be a controversial proposal. Given that, we may first focus on the lock-free XLog reservation implementation, and leave the increase of NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru <mailto:y.sokolov@postgrespro.ru><mailto:y.sokolov@postgrespro.ru>> wrote:
Good day, Zhiguo.
Idea looks great.
Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.
I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.
Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big
issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop which should significantly reduce the synchronization cost here especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.
I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in some situations. My initial thought is to define a bit near the prev-link to flag the completion of the update. In this way, we could allow non-atomic or even discontinuous write/read operations on the prev-link, while simultaneously guaranteeing its atomicity through atomic operations (as well as memory barriers) on the flag bit. What do you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable after inserting the assertion, I will fix it and improve other comments in the following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your
patch
today I see it is superior to mine (if atomic access will be fixed).

[1] /messages/by-id/2266698.1704854297@sss.pgh.pa.us

Good day, Zhiguo.

Here’s my attempt to organise link to previous record without messing with xlog buffers:
- link is stored in lock-free hash table instead.

I don’t claim it is any better than using xlog buffers.
It is just alternative vision.

Some tricks in implementation:
- Relying on byte-position nature, it could be converted to 32 bit unique
value with `(uint32)(pos ^ (pos>>32))`. Certainly it is not totally unique,
but it is certainly unique among 32GB consecutive log.
- PrevBytePos could be calculated as a difference between positions, and
this difference is certainly less than 4GB, so it also could be stored as 32
bit value (PrevSize).
- Since xlog records are aligned we could use lowest bit of PrevSize as a lock.
- While Cuckoo Hashing could suffer from un-solvable cycle conflicts, this implementation relies on concurrent deleters which will eventually break such cycles if any.

I have a version without 32bit conversion trick, and it is a bit lighter on atomic instructions count, but it performs badly in absence of native 64bit atomics.

——
regards
Yura Sokolov aka funny-falcon

#11

Zhou, Zhiguo

zhiguo.zhou@intel.com

about 1 year ago

In reply to: Yura Sokolov (#10)

Re: [RFC] Lock-free XLog Reservation from WAL

On 1/7/2025 10:49 AM, Юрий Соколов wrote:

On 6 Jan 2025, at 09:46, Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to 128，I
think it will be challenged，do we make it guc ？

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be
a controversial proposal. Given that, we may first focus on the lock-
free XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru><mailto:y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru>>> wrote:
   Good day, Zhiguo.
   Idea looks great.
   Minor issue:
   - you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.
   I initially thought it became un-synchronized against
   `ReserveXLogInsertLocation`, but looking closer I found it is
   synchronized with `WALInsertLockAcquireExclusive`.
   Since there are no other `insertpos_lck` usages after your patch, I
   don't see why it should exists and be used in `ReserveXLogSwitch`.
   Still I'd prefer to see CAS loop in this place to be consistent with
   other non-locking access. And it will allow to get rid of
   `WALInsertLockAcquireExclusive`, (though probably it is not a big
   issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with
you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try
it in the next version of patch.

   Major issue:
   - `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
   platforms where MAXALIGN != 8 or without native 64 load/store. Branch
   with 'memcpy` is rather obvious, but even pointer de-referencing on
   "lucky case" is not safe either.
   I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the prev-
link to flag the completion of the update. In this way, we could allow
non-atomic or even discontinuous write/read operations on the prev-
link, while simultaneously guaranteeing its atomicity through atomic
operations (as well as memory barriers) on the flag bit. What do you
think of this as a viable solution?

   Readability issue:
   - It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
   I had hard time to recognize `upto` is strictly not in the future.
   - Certainly, final version have to have fixed and improved comments.
   Many patch's ideas are strictly non-obvious. I had hard time to
   recognize patch is not a piece of ... (excuse me for the swear
   sentence).

Thanks for the suggestion and patience. It's really more readable
after inserting the assertion, I will fix it and improve other
comments in the following patches.

   Indeed, patch is much better than it looks on first sight.
   I came with alternative idea yesterday, but looking closer to your
   patch
   today I see it is superior to mine (if atomic access will be fixed).

[1]https://www.postgresql.org/message-
id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/
message-id/2266698.1704854297%40sss.pgh.pa.us>

Good day, Zhiguo.

Here’s my attempt to organise link to previous record without messing
with xlog buffers:
- link is stored in lock-free hash table instead.

I don’t claim it is any better than using xlog buffers.
It is just alternative vision.

Some tricks in implementation:
- Relying on byte-position nature, it could be converted to 32 bit unique
value with `(uint32)(pos ^ (pos>>32))`. Certainly it is not totally
unique,
but it is certainly unique among 32GB consecutive log.
- PrevBytePos could be calculated as a difference between positions, and
this difference is certainly less than 4GB, so it also could be
stored as 32
bit value (PrevSize).
- Since xlog records are aligned we could use lowest bit of PrevSize as
a lock.
- While Cuckoo Hashing could suffer from un-solvable cycle conflicts,
this implementation relies on concurrent deleters which will eventually
break such cycles if any.

I have a version without 32bit conversion trick, and it is a bit lighter
on atomic instructions count, but it performs badly in absence of native
64bit atomics.

——
regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Your implementation based on the lock-free hash table is truly
impressive! One of the aspects I particularly admire is how your
solution doesn't require breaking the current convention of XLog
insertion, whose revision is quite error-prone and ungraceful. My minor
concern is that the limited number of entries (256) in the hash table
would be a bottleneck for parallel memory reservation, but I believe
this is not a critical issue.

I will soon try to evaluate the performance impact of your patch on my
device with the TPCC benchmark and also profile it to see if there are
any changes that could be made to further improve it.

BTW, do you have a plan to merge this patch to the master branch? Thanks!

Regards,
Zhiguo

#12

Yura Sokolov

y.sokolov@postgrespro.ru

about 1 year ago

In reply to: Zhou, Zhiguo (#11)

Re: [RFC] Lock-free XLog Reservation from WAL

09.01.2025 19:03, Zhou, Zhiguo пишет:

On 1/7/2025 10:49 AM, Юрий Соколов wrote:

On 6 Jan 2025, at 09:46, Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to 128，I
think it will be challenged，do we make it guc ？

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be
a controversial proposal. Given that, we may first focus on the lock-
free XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru><mailto:y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru>>> wrote:
   Good day, Zhiguo.
   Idea looks great.
   Minor issue:
   - you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.
   I initially thought it became un-synchronized against
   `ReserveXLogInsertLocation`, but looking closer I found it is
   synchronized with `WALInsertLockAcquireExclusive`.
   Since there are no other `insertpos_lck` usages after your patch, I
   don't see why it should exists and be used in `ReserveXLogSwitch`.
   Still I'd prefer to see CAS loop in this place to be consistent with
   other non-locking access. And it will allow to get rid of
   `WALInsertLockAcquireExclusive`, (though probably it is not a big
   issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree
with you on getting rid of `WALInsertLockAcquireExclusive` with CAS
loop which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will
try it in the next version of patch.

   Major issue:
   - `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read
with on
   platforms where MAXALIGN != 8 or without native 64 load/store.
Branch
   with 'memcpy` is rather obvious, but even pointer de-referencing on
   "lucky case" is not safe either.
   I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the prev-
link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev- link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What
do you think of this as a viable solution?

   Readability issue:
   - It would be good to add `Assert(ptr >= upto)` into
`GetXLogBuffer`.
   I had hard time to recognize `upto` is strictly not in the future.
   - Certainly, final version have to have fixed and improved comments.
   Many patch's ideas are strictly non-obvious. I had hard time to
   recognize patch is not a piece of ... (excuse me for the swear
   sentence).

Thanks for the suggestion and patience. It's really more readable
after inserting the assertion, I will fix it and improve other
comments in the following patches.

   Indeed, patch is much better than it looks on first sight.
   I came with alternative idea yesterday, but looking closer to your
   patch
   today I see it is superior to mine (if atomic access will be fixed).

[1]https://www.postgresql.org/message-
id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/
message-id/2266698.1704854297%40sss.pgh.pa.us>

Good day, Zhiguo.

Here’s my attempt to organise link to previous record without messing
with xlog buffers:
- link is stored in lock-free hash table instead.

I don’t claim it is any better than using xlog buffers.
It is just alternative vision.

Some tricks in implementation:
- Relying on byte-position nature, it could be converted to 32 bit unique
   value with `(uint32)(pos ^ (pos>>32))`. Certainly it is not totally
unique,
   but it is certainly unique among 32GB consecutive log.
- PrevBytePos could be calculated as a difference between positions, and
   this difference is certainly less than 4GB, so it also could be
stored as 32
   bit value (PrevSize).
- Since xlog records are aligned we could use lowest bit of PrevSize
as a lock.
- While Cuckoo Hashing could suffer from un-solvable cycle conflicts,
this implementation relies on concurrent deleters which will
eventually break such cycles if any.

I have a version without 32bit conversion trick, and it is a bit
lighter on atomic instructions count, but it performs badly in absence
of native 64bit atomics.

——
regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Your implementation based on the lock-free hash table is truly
impressive! One of the aspects I particularly admire is how your
solution doesn't require breaking the current convention of XLog
insertion, whose revision is quite error-prone and ungraceful.

That is main benefit of my approach. Though it is not strictly better
than yours.

My minor
concern is that the limited number of entries (256) in the hash table
would be a bottleneck for parallel memory reservation, but I believe
this is not a critical issue.

If you consider hash-table fillrate, than 256 is quite enough for 128
concurrent inserters.

But I agree 8 items on cache line could lead to false-sharing.
Items could be stretched to 16 bytes (and then CurrPosId could be fully
unique), so there's just 4 entry per cache line.

I will soon try to evaluate the performance impact of your patch on my
device with the TPCC benchmark and also profile it to see if there are
any changes that could be made to further improve it.

It would be great. On my notebook (Mac Air M1) I don't see any benefits
neither from mine, nor from yours patch ))
My colleague will also test it on 20 core virtual machine (but
backported to v15).

BTW, do you have a plan to merge this patch to the master branch? Thanks!

I'm not committer )) We are both will struggle to make something
committed for many months ;-)

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

Since there are two limits:

#define XLogRecordMaxSize (1020 * 1024 * 1024)
#define WalSegMaxSize 1024 * 1024 * 1024

offset to previous record could not be larger than 2GB.

Yes, it is format change, that some backup utilities will have to adopt.
But it saves 4 bytes in XLogRecord (that could be spent to store
FullTransactionId instead of TransactionId) and it is better compressible.

And your version than will not need the case when this value is split
among two buffers (since MAXALIGN is not less than 4), and PostgreSQL
already relies on 4 byte read/write atomicity (in some places even
without use of pg_atomic_uint32).

----

regards
Sokolov Yura aka funny-falcon

#13

Matthias van de Meent

boekewurm+postgres@gmail.com

about 1 year ago

In reply to: Yura Sokolov (#12)

Re: [RFC] Lock-free XLog Reservation from WAL

On Fri, 10 Jan 2025 at 13:42, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

-1, I don't think that is possible without degrading what our current
WAL system protects against.

For intra-record torn write protection we have the checksum, but that
same protection doesn't cover the multiple WAL records on each page.
That is what the xl_prev pointer is used for - detecting that this
part of the page doesn't contain the correct data (e.g. the data of a
previous version of this recycled segment).
If we replaced xl_prev with just an offset into the segment, then this
protection would be much less effective, as the previous version of
the segment realistically used the same segment offsets at the same
offsets into the file.

To protect against torn writes while still only using record segment
offsets, you'd have zero and then fsync any segment before reusing it,
which would severely reduce the benefits we get from recycling
segments.
Note that we can't expect the page header to help here, as write tears
can happen at nearly any offset into the page - not just 8k intervals
- and so the page header is not always representative of the origins
of all bytes on the page - only the first 24 (if even that).

Kind regards,

Matthias van de Meent

#14

Yura Sokolov

y.sokolov@postgrespro.ru

about 1 year ago

In reply to: Matthias van de Meent (#13)

Re: [RFC] Lock-free XLog Reservation from WAL

10.01.2025 19:53, Matthias van de Meent пишет:

On Fri, 10 Jan 2025 at 13:42, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

-1, I don't think that is possible without degrading what our current
WAL system protects against.

For intra-record torn write protection we have the checksum, but that
same protection doesn't cover the multiple WAL records on each page.
That is what the xl_prev pointer is used for - detecting that this
part of the page doesn't contain the correct data (e.g. the data of a
previous version of this recycled segment).
If we replaced xl_prev with just an offset into the segment, then this
protection would be much less effective, as the previous version of
the segment realistically used the same segment offsets at the same
offsets into the file.

Well, to protect against "torn write" it is enough to have "self-lsn"
field, not "prev-lsn". So 8 byte "self-lsn" + "offset-to-prev" would work.

But this way header will be increased by 4 bytes compared to current
one, not decreased.

Just thought:
If XLogRecord alignment were stricter (for example, 32 bytes), then LSN
could mean not byte-offset, but 32byte-offset. Then low 32bits of LSN
will cover 128GB of WAL logs. For most installations re-use distance for
WAL segments doubdfully longer than 128GB. But I believe, there are some
with larger one. So it is not reliable.

To protect against torn writes while still only using record segment
offsets, you'd have zero and then fsync any segment before reusing it,
which would severely reduce the benefits we get from recycling
segments.
Note that we can't expect the page header to help here, as write tears
can happen at nearly any offset into the page - not just 8k intervals
- and so the page header is not always representative of the origins
of all bytes on the page - only the first 24 (if even that).

-----

regards,
Yura

#15

Zhou, Zhiguo

zhiguo.zhou@intel.com

about 1 year ago

In reply to: Yura Sokolov (#12)

Re: [RFC] Lock-free XLog Reservation from WAL

Good day, Yura!

On 1/10/2025 8:42 PM, Yura Sokolov wrote:

If you consider hash-table fillrate, than 256 is quite enough for 128
concurrent inserters.

The profile of your patch didn't show significant hotspots in the hash
table functions, so I believe the 256 entries should be enough.

I will soon try to evaluate the performance impact of your patch on my
device with the TPCC benchmark and also profile it to see if there are
any changes that could be made to further improve it.

It would be great. On my notebook (Mac Air M1) I don't see any benefits
neither from mine, nor from yours patch ))
My colleague will also test it on 20 core virtual machine (but
backported to v15).

I've tested the performance impact of our patches on an Intel Sapphire
Rapids device with 480 vCPUs using a HammerDB TPC-C workload (256 VUs).
The results show a 72.3% improvement (average of 3 rounds, RSD: 1.5%)
with your patch and a 76.0% boost (average of 3 rounds, RSD: 2.95%) with
mine, applied to the latest codebase. This optimization is most
effective on systems with over 64 cores, as our core-scaling experiments
suggest minimal impact on lower-core setups like your notebook or a
20-core VM.

BTW, do you have a plan to merge this patch to the master branch? Thanks!

I'm not committer )) We are both will struggle to make something
committed for many months ;-)

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

Since there are two limits:

#define XLogRecordMaxSize (1020 * 1024 * 1024)
#define WalSegMaxSize 1024 * 1024 * 1024

offset to previous record could not be larger than 2GB.

Yes, it is format change, that some backup utilities will have to adopt.
But it saves 4 bytes in XLogRecord (that could be spent to store
FullTransactionId instead of TransactionId) and it is better compressible.

And your version than will not need the case when this value is split
among two buffers (since MAXALIGN is not less than 4), and PostgreSQL
already relies on 4 byte read/write atomicity (in some places even
without use of pg_atomic_uint32).

----

regards
Sokolov Yura aka funny-falcon

Thanks for the great suggestion!

I think we've arrived at a critical juncture where we need to decide
which patch to move forward with for our optimization efforts. I've
evaluated the pros and cons of my implementation:

Pros:
-Achieves an additional 4% performance improvement.

Cons:
-Breaks the current convention of XLog insertions.
-TAP tests are not fully passed, requiring time to resolve.
-May necessitate changes to the format and backup tools, potentially
leading to backward compatibility issues.

Given these considerations, I believe your implementation is superior to
mine. I'd greatly appreciate it if you could share your insights on this
matter.

Regards,
Zhiguo

#16

Yura Sokolov

y.sokolov@postgrespro.ru

about 1 year ago

In reply to: Zhou, Zhiguo (#15)

Re: [RFC] Lock-free XLog Reservation from WAL

14.01.2025 17:49, Zhou, Zhiguo пишет:

Good day, Yura!

On 1/10/2025 8:42 PM, Yura Sokolov wrote:

If you consider hash-table fillrate, than 256 is quite enough for 128
concurrent inserters.

The profile of your patch didn't show significant hotspots in the hash
table functions, so I believe the 256 entries should be enough.

I will soon try to evaluate the performance impact of your patch on
my device with the TPCC benchmark and also profile it to see if there
are any changes that could be made to further improve it.

It would be great. On my notebook (Mac Air M1) I don't see any
benefits neither from mine, nor from yours patch ))
My colleague will also test it on 20 core virtual machine (but
backported to v15).

I've tested the performance impact of our patches on an Intel Sapphire
Rapids device with 480 vCPUs using a HammerDB TPC-C workload (256 VUs).
The results show a 72.3% improvement (average of 3 rounds, RSD: 1.5%)
with your patch and a 76.0% boost (average of 3 rounds, RSD: 2.95%) with
mine, applied to the latest codebase. This optimization is most
effective on systems with over 64 cores, as our core-scaling experiments
suggest minimal impact on lower-core setups like your notebook or a 20-
core VM.

BTW, do you have a plan to merge this patch to the master branch?
Thanks!

I'm not committer )) We are both will struggle to make something
committed for many months ;-)

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

Since there are two limits:

#define XLogRecordMaxSize (1020 * 1024 * 1024)
#define WalSegMaxSize 1024 * 1024 * 1024

offset to previous record could not be larger than 2GB.

Yes, it is format change, that some backup utilities will have to adopt.
But it saves 4 bytes in XLogRecord (that could be spent to store
FullTransactionId instead of TransactionId) and it is better
compressible.

And your version than will not need the case when this value is split
among two buffers (since MAXALIGN is not less than 4), and PostgreSQL
already relies on 4 byte read/write atomicity (in some places even
without use of pg_atomic_uint32).

----

regards
Sokolov Yura aka funny-falcon

Thanks for the great suggestion!

I think we've arrived at a critical juncture where we need to decide
which patch to move forward with for our optimization efforts. I've
evaluated the pros and cons of my implementation:

Pros:
-Achieves an additional 4% performance improvement.

Cons:
-Breaks the current convention of XLog insertions.
-TAP tests are not fully passed, requiring time to resolve.
-May necessitate changes to the format and backup tools, potentially
leading to backward compatibility issues.

Given these considerations, I believe your implementation is superior to
mine. I'd greatly appreciate it if you could share your insights on this
matter.

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its effect
even on my working notebook (it is another one: Ryzen 5825U limited to
@2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

#17

Zhou, Zhiguo

zhiguo.zhou@intel.com

about 1 year ago

In reply to: Yura Sokolov (#16)

Re: [RFC] Lock-free XLog Reservation from WAL

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its effect
even on my working notebook (it is another one: Ryzen 5825U limited to
@2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Thank you for keeping me informed. I appreciate your proactive approach
and understand the importance of exploring different angles for
optimization. Your patch is indeed fundamental to our ongoing work on
the lock-free xlog reservation, and I'm eager to see how it can further
enhance our efforts.

I will proceed to test the performance impact of your latest patch when
combined with the lock-free xlog reservation patch. This will help us
determine if there's potential for additional optimization.
Concurrently, with your permission, I'll try to refine the
hash-table-based implementation for your further review. WDYT?

Regards,
Zhiguo

#18

Yura Sokolov

y.sokolov@postgrespro.ru

about 1 year ago

In reply to: Zhou, Zhiguo (#17)

Re: [RFC] Lock-free XLog Reservation from WAL

17.01.2025 17:00, Zhou, Zhiguo пишет:

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its
effect even on my working notebook (it is another one: Ryzen 5825U
limited to @2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Thank you for keeping me informed. I appreciate your proactive approach
and understand the importance of exploring different angles for
optimization. Your patch is indeed fundamental to our ongoing work on
the lock-free xlog reservation, and I'm eager to see how it can further
enhance our efforts.

I will proceed to test the performance impact of your latest patch when
combined with the lock-free xlog reservation patch. This will help us
determine if there's potential for additional optimization.
Concurrently, with your permission, I'll try to refine the hash-table-
based implementation for your further review. WDYT?

Certainly.

And I will sent my version of 64bit operations on hash-table entries...

tomorrow.

Today is 3am at the moment...

I was doing "removal of WALBufMappingLock" [1]/messages/by-id/flat/39b39e7a-41b4-4f34-b3f5-db735e74a723@postgrespro.ru
and I want to sleep a lot...

[1]: /messages/by-id/flat/39b39e7a-41b4-4f34-b3f5-db735e74a723@postgrespro.ru
/messages/by-id/flat/39b39e7a-41b4-4f34-b3f5-db735e74a723@postgrespro.ru

-----
regards
Yura Sokolov aka funny-falcon

#19

Yura Sokolov

y.sokolov@postgrespro.ru

about 1 year ago

In reply to: Zhou, Zhiguo (#17)

Re: [RFC] Lock-free XLog Reservation from WAL

17.01.2025 17:00, Zhou, Zhiguo пишет:

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its
effect even on my working notebook (it is another one: Ryzen 5825U
limited to @2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Thank you for keeping me informed. I appreciate your proactive approach
and understand the importance of exploring different angles for
optimization. Your patch is indeed fundamental to our ongoing work on
the lock-free xlog reservation, and I'm eager to see how it can further
enhance our efforts.

I will proceed to test the performance impact of your latest patch when
combined with the lock-free xlog reservation patch. This will help us
determine if there's potential for additional optimization.
Concurrently, with your permission, I'll try to refine the hash-table-
based implementation for your further review. WDYT?

Good day, Zhiguo

Here's version of "hash-table reservation" with both 32bit and 64bit
operations (depending on PG_HAVE_ATOMIC_U64_SIMULATION, or may be
switched by hand).

64bit version uses other protocol with a bit lesser atomic operations. I
suppose it could be a bit faster. But I can't prove it now.

btw, you wrote:

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read

with on

platforms where MAXALIGN != 8 or without native 64 load/store.

Branch

with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What
do you think of this as a viable solution?

There is a way to order operations:
- since SetPrevRecPtr stores start of record as LSN, its lower 32bits
are certainly non-zero (record could not start at the beginning of a page).
- so SetPrevRecPtr should write high 32bits, issue write barrier, and
then write lower 32bits,
- and then GetPrevRecPtr should first read lower 32bits, and if it is
not zero, then issue read barrier and read upper 32bits.

This way you will always read correct prev-rec-ptr on platform without
64bit atomics. (because MAXALING >= 4 and PostgreSQL requires 4 byte
atomicity for several years).

------
regards
Yura Sokolov aka funny-falcon

#20

Zhou, Zhiguo

zhiguo.zhou@intel.com

about 1 year ago

In reply to: Yura Sokolov (#19)

Re: [RFC] Lock-free XLog Reservation from WAL

On 1/19/2025 10:56 PM, Yura Sokolov wrote:

17.01.2025 17:00, Zhou, Zhiguo пишет:

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its
effect even on my working notebook (it is another one: Ryzen 5825U
limited to @2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Thank you for keeping me informed. I appreciate your proactive
approach and understand the importance of exploring different angles
for optimization. Your patch is indeed fundamental to our ongoing work
on the lock-free xlog reservation, and I'm eager to see how it can
further enhance our efforts.

I will proceed to test the performance impact of your latest patch
when combined with the lock-free xlog reservation patch. This will
help us determine if there's potential for additional optimization.
Concurrently, with your permission, I'll try to refine the hash-table-
based implementation for your further review. WDYT?

Good day, Zhiguo

Here's version of "hash-table reservation" with both 32bit and 64bit
operations (depending on PG_HAVE_ATOMIC_U64_SIMULATION, or may be
switched by hand).

64bit version uses other protocol with a bit lesser atomic operations. I
suppose it could be a bit faster. But I can't prove it now.

btw, you wrote:

Major issue:
     - `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read

with on

     platforms where MAXALIGN != 8 or without native 64 load/store.

Branch

     with 'memcpy` is rather obvious, but even pointer de-referencing on
     "lucky case" is not safe either.

     I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What
do you think of this as a viable solution?

There is a way to order operations:
- since SetPrevRecPtr stores start of record as LSN, its lower 32bits
are certainly non-zero (record could not start at the beginning of a page).
- so SetPrevRecPtr should write high 32bits, issue write barrier, and
then write lower 32bits,
- and then GetPrevRecPtr should first read lower 32bits, and if it is
not zero, then issue read barrier and read upper 32bits.

This way you will always read correct prev-rec-ptr on platform without
64bit atomics. (because MAXALING >= 4 and PostgreSQL requires 4 byte
atomicity for several years).

------
regards
Yura Sokolov aka funny-falcon

Good day, Yura.

Thank you for your patch! It has been incredibly helpful and serves as a
great guide for my revisions. I particularly appreciate your insight
into writing the prev-rec-ptr atomically. It's a brilliant approach, and
I will definitely try implementing it in my development work. Besides,
please take some well-deserved rest. Thanks!

Regards,
Zhiguo

#21