Get rid of WALBufMappingLock

Started by Yura Sokolovabout 1 year ago47 messages
Jump to latest
#1Yura Sokolov
y.sokolov@postgrespro.ru

Good day, hackers.

During discussion of Increasing NUM_XLOGINSERT_LOCKS [1]/messages/by-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru, Andres Freund
used benchmark which creates WAL records very intensively. While I this
it is not completely fair (1MB log records are really rare), it pushed
me to analyze write-side waiting of XLog machinery.

First I tried to optimize WaitXLogInsertionsToFinish, but without great
success (yet).

While profiling, I found a lot of time is spend in the memory clearing
under global WALBufMappingLock:

MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

It is obvious scalability bottleneck.

So "challenge was accepted".

Certainly, backend should initialize pages without exclusive lock. But
which way to ensure pages were initialized? In other words, how to
ensure XLogCtl->InitializedUpTo is correct.

I've tried to play around WALBufMappingLock with holding it for a short
time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
WALBufMappingLock is useless at all.

Instead of holding lock, it is better to allow backends to cooperate:
- I bound ConditionVariable to each xlblocks entry,
- every backend now checks every required block pointed by
InitializedUpto was successfully initialized or sleeps on its condvar,
- when backend sure block is initialized, it tries to update
InitializedUpTo with conditional variable.

Andres's benchmark looks like:

c=100 && install/bin/psql -c checkpoint -c 'select pg_switch_wal()'
postgres && install/bin/pgbench -n -M prepared -c$c -j$c -f <(echo
"SELECT pg_logical_emit_message(true, 'test', repeat('0',
1024*1024));";) -P1 -T45 postgres

So, it generate 1M records as fast as possible for 45 seconds.

Test machine is Ryzen 5825U (8c/16th) limited to 2GHz.
Config:

max_connections = 1000
shared_buffers = 1024MB
fsync = off
wal_sync_method = fdatasync
full_page_writes = off
wal_buffers = 1024MB
checkpoint_timeout = 1d

Results are: "average for 45 sec" /"1 second max outlier"

Results for master @ d3d098316913 :
25 clients: 2908 /3230
50 clients: 2759 /3130
100 clients: 2641 /2933
200 clients: 2419 /2707
400 clients: 1928 /2377
800 clients: 1689 /2266

With v0-0001-Get-rid-of-WALBufMappingLock.patch :
25 clients: 3103 /3583
50 clients: 3183 /3706
100 clients: 3106 /3559
200 clients: 2902 /3427
400 clients: 2303 /2717
800 clients: 1925 /2329

Combined with v0-0002-several-attempts-to-lock-WALInsertLocks.patch

No WALBufMappingLock + attempts on XLogInsertLock:
25 clients: 3518 /3750
50 clients: 3355 /3548
100 clients: 3226 /3460
200 clients: 3092 /3299
400 clients: 2575 /2801
800 clients: 1946 /2341

This results are with untouched NUM_XLOGINSERT_LOCKS == 8.

[1]: /messages/by-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru
/messages/by-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

PS.
Increasing NUM_XLOGINSERT_LOCKS to 64 gives:
25 clients: 3457 /3624
50 clients: 3215 /3500
100 clients: 2750 /3000
200 clients: 2535 /2729
400 clients: 2163 /2400
800 clients: 1700 /2060

While doing this on master:
25 clients 2645 /2953
50 clients: 2562 /2968
100 clients: 2364 /2756
200 clients: 2266 /2564
400 clients: 1868 /2228
800 clients: 1527 /2133

So, patched version with increased NUM_XLOGINSERT_LOCKS looks no worse
than unpatched without increasing num of locks.

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v0-0001-Get-rid-of-WALBufMappingLock.patchtext/x-patch; charset=UTF-8; name=v0-0001-Get-rid-of-WALBufMappingLock.patchDownload+122-73
v0-0002-several-attempts-to-lock-WALInsertLocks.patchtext/x-patch; charset=UTF-8; name=v0-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-20
#2Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#1)
Re: Get rid of WALBufMappingLock

19.01.2025 03:11, Yura Sokolov пишет:

Good day, hackers.

During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
used benchmark which creates WAL records very intensively. While I this
it is not completely fair (1MB log records are really rare), it pushed
me to analyze write-side waiting of XLog machinery.

First I tried to optimize WaitXLogInsertionsToFinish, but without great
success (yet).

While profiling, I found a lot of time is spend in the memory clearing
under global WALBufMappingLock:

    MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

It is obvious scalability bottleneck.

So "challenge was accepted".

Certainly, backend should initialize pages without exclusive lock. But
which way to ensure pages were initialized? In other words, how to
ensure XLogCtl->InitializedUpTo is correct.

I've tried to play around WALBufMappingLock with holding it for a short
time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
WALBufMappingLock is useless at all.

Instead of holding lock, it is better to allow backends to cooperate:
- I bound ConditionVariable to each xlblocks entry,
- every backend now checks every required block pointed by
InitializedUpto was successfully initialized or sleeps on its condvar,
- when backend sure block is initialized, it tries to update
InitializedUpTo with conditional variable.

Andres's benchmark looks like:

  c=100 && install/bin/psql -c checkpoint -c 'select pg_switch_wal()'
postgres && install/bin/pgbench -n -M prepared -c$c -j$c -f <(echo
"SELECT pg_logical_emit_message(true, 'test', repeat('0',
1024*1024));";) -P1 -T45 postgres

So, it generate 1M records as fast as possible for 45 seconds.

Test machine is Ryzen 5825U (8c/16th) limited to 2GHz.
Config:

  max_connections = 1000
  shared_buffers = 1024MB
  fsync = off
  wal_sync_method = fdatasync
  full_page_writes = off
  wal_buffers = 1024MB
  checkpoint_timeout = 1d

Results are: "average for 45 sec"  /"1 second max outlier"

Results for master @ d3d098316913 :
  25  clients: 2908  /3230
  50  clients: 2759  /3130
  100 clients: 2641  /2933
  200 clients: 2419  /2707
  400 clients: 1928  /2377
  800 clients: 1689  /2266

With v0-0001-Get-rid-of-WALBufMappingLock.patch :
  25  clients: 3103  /3583
  50  clients: 3183  /3706
  100 clients: 3106  /3559
  200 clients: 2902  /3427
  400 clients: 2303  /2717
  800 clients: 1925  /2329

Combined with v0-0002-several-attempts-to-lock-WALInsertLocks.patch

No WALBufMappingLock + attempts on XLogInsertLock:
  25  clients: 3518  /3750
  50  clients: 3355  /3548
  100 clients: 3226  /3460
  200 clients: 3092  /3299
  400 clients: 2575  /2801
  800 clients: 1946  /2341

This results are with untouched NUM_XLOGINSERT_LOCKS == 8.

[1] /messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

PS.
Increasing NUM_XLOGINSERT_LOCKS to 64 gives:
  25  clients: 3457  /3624
  50  clients: 3215  /3500
  100 clients: 2750  /3000
  200 clients: 2535  /2729
  400 clients: 2163  /2400
  800 clients: 1700  /2060

While doing this on master:
  25  clients  2645  /2953
  50  clients: 2562  /2968
  100 clients: 2364  /2756
  200 clients: 2266  /2564
  400 clients: 1868  /2228
  800 clients: 1527  /2133

So, patched version with increased NUM_XLOGINSERT_LOCKS looks no worse
than unpatched without increasing num of locks.

I'm too brave... or too sleepy (it's 3:30am)...
But I took the risk of sending a patch to commitfest:
https://commitfest.postgresql.org/52/5511/

------
regards
Yura Sokolov aka funny-falcon

#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Yura Sokolov (#1)
Re: Get rid of WALBufMappingLock

Hi!

On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
used benchmark which creates WAL records very intensively. While I this
it is not completely fair (1MB log records are really rare), it pushed
me to analyze write-side waiting of XLog machinery.

First I tried to optimize WaitXLogInsertionsToFinish, but without great
success (yet).

While profiling, I found a lot of time is spend in the memory clearing
under global WALBufMappingLock:

MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

It is obvious scalability bottleneck.

So "challenge was accepted".

Certainly, backend should initialize pages without exclusive lock. But
which way to ensure pages were initialized? In other words, how to
ensure XLogCtl->InitializedUpTo is correct.

I've tried to play around WALBufMappingLock with holding it for a short
time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
WALBufMappingLock is useless at all.

Instead of holding lock, it is better to allow backends to cooperate:
- I bound ConditionVariable to each xlblocks entry,
- every backend now checks every required block pointed by
InitializedUpto was successfully initialized or sleeps on its condvar,
- when backend sure block is initialized, it tries to update
InitializedUpTo with conditional variable.

Looks reasonable for me, but having ConditionVariable per xlog buffer
seems overkill for me. Find an attached revision, where I've
implemented advancing InitializedUpTo without ConditionVariable.
After initialization of each buffer there is attempt to do CAS for
InitializedUpTo in a loop. So, multiple processes will try to advance
InitializedUpTo, they could hijack initiative from each other, but
there is always a leader which will finish the work.

There is only one ConditionVariable to wait for InitializedUpTo being advanced.

I didn't benchmark my version, just checked that tests passed.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v1-0002-several-attempts-to-lock-WALInsertLocks.patchapplication/octet-stream; name=v1-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-20
v1-0001-Get-rid-of-WALBufMappingLock.patchapplication/octet-stream; name=v1-0001-Get-rid-of-WALBufMappingLock.patchDownload+111-74
#4Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alexander Korotkov (#3)
Re: Get rid of WALBufMappingLock

07.02.2025 01:26, Alexander Korotkov пишет:

Hi!

On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
used benchmark which creates WAL records very intensively. While I this
it is not completely fair (1MB log records are really rare), it pushed
me to analyze write-side waiting of XLog machinery.

First I tried to optimize WaitXLogInsertionsToFinish, but without great
success (yet).

While profiling, I found a lot of time is spend in the memory clearing
under global WALBufMappingLock:

MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

It is obvious scalability bottleneck.

So "challenge was accepted".

Certainly, backend should initialize pages without exclusive lock. But
which way to ensure pages were initialized? In other words, how to
ensure XLogCtl->InitializedUpTo is correct.

I've tried to play around WALBufMappingLock with holding it for a short
time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
WALBufMappingLock is useless at all.

Instead of holding lock, it is better to allow backends to cooperate:
- I bound ConditionVariable to each xlblocks entry,
- every backend now checks every required block pointed by
InitializedUpto was successfully initialized or sleeps on its condvar,
- when backend sure block is initialized, it tries to update
InitializedUpTo with conditional variable.

Looks reasonable for me, but having ConditionVariable per xlog buffer
seems overkill for me. Find an attached revision, where I've
implemented advancing InitializedUpTo without ConditionVariable.
After initialization of each buffer there is attempt to do CAS for
InitializedUpTo in a loop. So, multiple processes will try to advance
InitializedUpTo, they could hijack initiative from each other, but
there is always a leader which will finish the work.

There is only one ConditionVariable to wait for InitializedUpTo being advanced.

I didn't benchmark my version, just checked that tests passed.

Good day, Alexander.

I've got mixed but quite close result for both approaches (single or many
ConditionVariable) on the notebook. Since I have no access to larger
machine, I can't prove "many" is way better (or discover it worse).

Given patch after cleanup looks a bit smaller and clearer, I agree to keep
just single condition variable.

Cleaned version is attached.

I've changed condition for broadcast a bit ("less" instead "not equal"):
- buffer's border may already go into future,
- and then other backend will reach not yet initialized buffer and will
broadcast.

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v2-0001-Get-rid-of-WALBufMappingLock.patchtext/x-patch; charset=UTF-8; name=v2-0001-Get-rid-of-WALBufMappingLock.patchDownload+90-63
v2-0002-several-attempts-to-lock-WALInsertLocks.patchtext/x-patch; charset=UTF-8; name=v2-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-20
#5Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#4)
Re: Get rid of WALBufMappingLock

07.02.2025 14:02, Yura Sokolov пишет:

07.02.2025 01:26, Alexander Korotkov пишет:

Hi!

On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
used benchmark which creates WAL records very intensively. While I this
it is not completely fair (1MB log records are really rare), it pushed
me to analyze write-side waiting of XLog machinery.

First I tried to optimize WaitXLogInsertionsToFinish, but without great
success (yet).

While profiling, I found a lot of time is spend in the memory clearing
under global WALBufMappingLock:

MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

It is obvious scalability bottleneck.

So "challenge was accepted".

Certainly, backend should initialize pages without exclusive lock. But
which way to ensure pages were initialized? In other words, how to
ensure XLogCtl->InitializedUpTo is correct.

I've tried to play around WALBufMappingLock with holding it for a short
time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
WALBufMappingLock is useless at all.

Instead of holding lock, it is better to allow backends to cooperate:
- I bound ConditionVariable to each xlblocks entry,
- every backend now checks every required block pointed by
InitializedUpto was successfully initialized or sleeps on its condvar,
- when backend sure block is initialized, it tries to update
InitializedUpTo with conditional variable.

Looks reasonable for me, but having ConditionVariable per xlog buffer
seems overkill for me. Find an attached revision, where I've
implemented advancing InitializedUpTo without ConditionVariable.
After initialization of each buffer there is attempt to do CAS for
InitializedUpTo in a loop. So, multiple processes will try to advance
InitializedUpTo, they could hijack initiative from each other, but
there is always a leader which will finish the work.

There is only one ConditionVariable to wait for InitializedUpTo being advanced.

I didn't benchmark my version, just checked that tests passed.

Good day, Alexander.

Seems I was mistaken twice.

I've got mixed but quite close result for both approaches (single or many
ConditionVariable) on the notebook. Since I have no access to larger
machine, I can't prove "many" is way better (or discover it worse).

1. "many condvars" (my variant) is strictly worse with num locks = 64 and
when pg_logical_emit_message emits just 1kB instead of 1MB.

Therefore, "single condvar" is strictly better.

Given patch after cleanup looks a bit smaller and clearer, I agree to keep
just single condition variable.

Cleaned version is attached.

I've changed condition for broadcast a bit ("less" instead "not equal"):
- buffer's border may already go into future,
- and then other backend will reach not yet initialized buffer and will
broadcast.

2. I've inserted abort if "buffer's border went into future", and wasn't
able to trigger it.

So I returned update-loop's body to your variant.

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v3-0001-Get-rid-of-WALBufMappingLock.patchtext/x-patch; charset=UTF-8; name=v3-0001-Get-rid-of-WALBufMappingLock.patchDownload+91-63
v3-0002-several-attempts-to-lock-WALInsertLocks.patchtext/x-patch; charset=UTF-8; name=v3-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-20
#6Alexander Korotkov
aekorotkov@gmail.com
In reply to: Yura Sokolov (#5)
Re: Get rid of WALBufMappingLock

On Fri, Feb 7, 2025 at 1:24 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

07.02.2025 14:02, Yura Sokolov пишет:

07.02.2025 01:26, Alexander Korotkov пишет:

Hi!

On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
used benchmark which creates WAL records very intensively. While I this
it is not completely fair (1MB log records are really rare), it pushed
me to analyze write-side waiting of XLog machinery.

First I tried to optimize WaitXLogInsertionsToFinish, but without great
success (yet).

While profiling, I found a lot of time is spend in the memory clearing
under global WALBufMappingLock:

MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

It is obvious scalability bottleneck.

So "challenge was accepted".

Certainly, backend should initialize pages without exclusive lock. But
which way to ensure pages were initialized? In other words, how to
ensure XLogCtl->InitializedUpTo is correct.

I've tried to play around WALBufMappingLock with holding it for a short
time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
WALBufMappingLock is useless at all.

Instead of holding lock, it is better to allow backends to cooperate:
- I bound ConditionVariable to each xlblocks entry,
- every backend now checks every required block pointed by
InitializedUpto was successfully initialized or sleeps on its condvar,
- when backend sure block is initialized, it tries to update
InitializedUpTo with conditional variable.

Looks reasonable for me, but having ConditionVariable per xlog buffer
seems overkill for me. Find an attached revision, where I've
implemented advancing InitializedUpTo without ConditionVariable.
After initialization of each buffer there is attempt to do CAS for
InitializedUpTo in a loop. So, multiple processes will try to advance
InitializedUpTo, they could hijack initiative from each other, but
there is always a leader which will finish the work.

There is only one ConditionVariable to wait for InitializedUpTo being advanced.

I didn't benchmark my version, just checked that tests passed.

Good day, Alexander.

Seems I was mistaken twice.

I've got mixed but quite close result for both approaches (single or many
ConditionVariable) on the notebook. Since I have no access to larger
machine, I can't prove "many" is way better (or discover it worse).

1. "many condvars" (my variant) is strictly worse with num locks = 64 and
when pg_logical_emit_message emits just 1kB instead of 1MB.

Therefore, "single condvar" is strictly better.

Given patch after cleanup looks a bit smaller and clearer, I agree to keep
just single condition variable.

Cleaned version is attached.

I've changed condition for broadcast a bit ("less" instead "not equal"):
- buffer's border may already go into future,
- and then other backend will reach not yet initialized buffer and will
broadcast.

2. I've inserted abort if "buffer's border went into future", and wasn't
able to trigger it.

So I returned update-loop's body to your variant.

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

Regarding 0002 patch, it looks generally reasonable. But are 2
attempts always optimal? Are there cases of regression, or cases when
more attempts are even better? Could we have there some
self-adjusting mechanism like what we have for spinlocks?

------
Regards,
Alexander Korotkov
Supabase

#7Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#6)
Re: Get rid of WALBufMappingLock

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.
2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

------
Regards,
Alexander Korotkov
Supabase

#8Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alexander Korotkov (#7)
Re: Get rid of WALBufMappingLock

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Regarding 0002 patch, it looks generally reasonable. But are 2
attempts always optimal? Are there cases of regression, or cases when
more attempts are even better? Could we have there some
self-adjusting mechanism like what we have for spinlocks?

Well, I chose to perform 3 probes (2 conditional attempts + 1
unconditional) based on intuition. I have some experience in building hash
tables, and cuckoo-hashing theory tells 2 probes is usually enough to reach
50% fill-rate, and 3 probes enough for ~75% fill rate. Since each probe is
cache miss, it is hardly sensible to do more probes.

3 probes did better than 2 in other benchmark [1]/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru, although there were
NUM_XLOGINSERT_LOCK increased.

Excuse me for not bencmarking different choices here. I'll try to do
measurements in next days.

[1]: /messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v4-0001-Get-rid-of-WALBufMappingLock.patchtext/x-patch; charset=UTF-8; name=v4-0001-Get-rid-of-WALBufMappingLock.patchDownload+120-61
v4-0002-several-attempts-to-lock-WALInsertLocks.patchtext/x-patch; charset=UTF-8; name=v4-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-19
#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Yura Sokolov (#8)
Re: Get rid of WALBufMappingLock

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

Regarding 0002 patch, it looks generally reasonable. But are 2
attempts always optimal? Are there cases of regression, or cases when
more attempts are even better? Could we have there some
self-adjusting mechanism like what we have for spinlocks?

Well, I chose to perform 3 probes (2 conditional attempts + 1
unconditional) based on intuition. I have some experience in building hash
tables, and cuckoo-hashing theory tells 2 probes is usually enough to reach
50% fill-rate, and 3 probes enough for ~75% fill rate. Since each probe is
cache miss, it is hardly sensible to do more probes.

3 probes did better than 2 in other benchmark [1], although there were
NUM_XLOGINSERT_LOCK increased.

Excuse me for not bencmarking different choices here. I'll try to do
measurements in next days.

[1] /messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

Ok, let's wait for your measurements.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v5-0002-several-attempts-to-lock-WALInsertLocks.patchapplication/octet-stream; name=v5-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-20
v5-0001-Get-rid-of-WALBufMappingLock.patchapplication/octet-stream; name=v5-0001-Get-rid-of-WALBufMappingLock.patchDownload+111-74
#10Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alexander Korotkov (#9)
Re: Get rid of WALBufMappingLock

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

-------
regards
Yura Sokolov aka funny-falcon

#11Alexander Korotkov
aekorotkov@gmail.com
In reply to: Yura Sokolov (#10)
Re: Get rid of WALBufMappingLock

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v6-0001-Get-rid-of-WALBufMappingLock.patchapplication/octet-stream; name=v6-0001-Get-rid-of-WALBufMappingLock.patchDownload+111-74
v6-0002-several-attempts-to-lock-WALInsertLocks.patchapplication/octet-stream; name=v6-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-20
#12Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Alexander Korotkov (#11)
Re: Get rid of WALBufMappingLock

Hi, Yura and Alexander!

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

The overall goal to avoid locking looks good. Both patches look right
to me. The only thing I'm slightly concerned about if patchset could
demonstrate performance differences in any real workload. I appreciate
it as a beautiful optimization anyway.

I have the following proposals to change the code and comments:

For patch 0001:

- Struct XLBlocks contains just one pg_atomic_uint64 member. Is it
still needed as a struct? These changes make a significant volume of
changes to the patch, being noop. Maybe it was inherited from v1 and
not needed anymore.

- Furthermore when xlblocks became a struct comments like:

and xlblocks values certainly do. xlblocks values are changed

need to be changed to xlblocks.bound. This could be avoided by
changing back xlblocks from type XLBlocks * to pg_atomic_uint64 *

- It's worth more detailed commenting
InitializedUpTo/InitializedUpToCondVar than:
+        * It is updated to successfully initialized buffer's
identities, perhaps
+        * waiting on conditional variable bound to buffer.

"perhaps waiting" could also be in style "maybe/even while AAA waits BBB"

"lock-free with cooperation with" -> "lock-free accompanied by changes to..."

- Comment inside typedef struct XLogCtlData:
/* 1st byte ptr-s + XLOG_BLCKSZ (and condvar * for) */
need to be returned back
/* 1st byte ptr-s + XLOG_BLCKSZ */

- Commented out code for cleanup in the final patch:
//ConditionVariableBroadcast(&XLogCtl->xlblocks[nextidx].condvar);

- in AdvanceXLInsertBuffer()
npages initialised to 0 but it is not increased anymore
Block under

if (XLOG_DEBUG && npages > 0)

became unreachable

(InvalidXLogRecPtr + 1) is essentially 0+1 and IMO this semantically
calls for adding #define FirstValidXLogRecPtr 1

Typo in a commit message: %s/usign/using/g

For patch 0002:

I think Yura's explanation from above in this thread need to get place
in a commit message and in a comment to this:

int attempts = 2;

Comments around:
"try another lock next time" could be modified to reflect that we do
repeat twice

Kind regards,
Pavel Borisov
Supabase

#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Pavel Borisov (#12)
Re: Get rid of WALBufMappingLock

Hi, Pavel!

On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

Oh, sorry, I really did wrong. I've done git format-patch for wrong
local branch for v5 and v6. Patches I've sent for v5 and v6 are
actually the same as my v1. This is really pity. Please, find the
right version of patchset attached.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v7-0001-Get-rid-of-WALBufMappingLock.patchapplication/octet-stream; name=v7-0001-Get-rid-of-WALBufMappingLock.patchDownload+136-53
v7-0002-several-attempts-to-lock-WALInsertLocks.patchapplication/octet-stream; name=v7-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-19
#14Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Alexander Korotkov (#13)
Re: Get rid of WALBufMappingLock

Hi, Alexander!

On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Pavel!

On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

Oh, sorry, I really did wrong. I've done git format-patch for wrong
local branch for v5 and v6. Patches I've sent for v5 and v6 are
actually the same as my v1. This is really pity. Please, find the
right version of patchset attached.

I've rechecked v7. In v6 a proposal from [1]/messages/by-id/d6799557-e352-42c8-80cc-ed36e3b8893c@postgrespro.ru was not reflected. Now it
landed in v7.

Other changes are not regarding code behavior. The things from my
previous review that still could apply to v7:

For 0001:

Comment change proposed:
"lock-free with cooperation with" -> "lock-free accompanied by changes
to..." (maybe other variant)

I propose a new define:
#define FirstValidXLogRecPtr 1
While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
that has no semantical meaning and it's better to avoid using direct
arithmetics to relate meaning of FirstValidXLogRecPtr from
InvalidXLogRecPtr.

For 0002 both comments proposals from my message applied to v6 apply
to v7 as well

[1]: /messages/by-id/d6799557-e352-42c8-80cc-ed36e3b8893c@postgrespro.ru

Regards,
Pavel Borisov
Supabase

#15Alexander Korotkov
aekorotkov@gmail.com
In reply to: Pavel Borisov (#14)
Re: Get rid of WALBufMappingLock

On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

Oh, sorry, I really did wrong. I've done git format-patch for wrong
local branch for v5 and v6. Patches I've sent for v5 and v6 are
actually the same as my v1. This is really pity. Please, find the
right version of patchset attached.

I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
landed in v7.

Other changes are not regarding code behavior. The things from my
previous review that still could apply to v7:

For 0001:

Comment change proposed:
"lock-free with cooperation with" -> "lock-free accompanied by changes
to..." (maybe other variant)

Good catch. I've rephrased this comment even more.

I propose a new define:
#define FirstValidXLogRecPtr 1
While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
that has no semantical meaning and it's better to avoid using direct
arithmetics to relate meaning of FirstValidXLogRecPtr from
InvalidXLogRecPtr.

Makes sense, but I'm not sure if this change is required at all. I've
reverted this to the state of master, and everything seems to work.

For 0002 both comments proposals from my message applied to v6 apply
to v7 as well

Thank you for pointing. For now, I'm concentrated on improvements on
0001. Probably Yura could work on your notes to 0002.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v8-0001-Get-rid-of-WALBufMappingLock.patchapplication/octet-stream; name=v8-0001-Get-rid-of-WALBufMappingLock.patchDownload+132-49
v8-0002-several-attempts-to-lock-WALInsertLocks.patchapplication/octet-stream; name=v8-0002-several-attempts-to-lock-WALInsertLocks.patchDownload+28-19
#16Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alexander Korotkov (#15)
Re: Get rid of WALBufMappingLock

14.02.2025 13:24, Alexander Korotkov пишет:

On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

Oh, sorry, I really did wrong. I've done git format-patch for wrong
local branch for v5 and v6. Patches I've sent for v5 and v6 are
actually the same as my v1. This is really pity. Please, find the
right version of patchset attached.

I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
landed in v7.

Other changes are not regarding code behavior. The things from my
previous review that still could apply to v7:

For 0001:

Comment change proposed:
"lock-free with cooperation with" -> "lock-free accompanied by changes
to..." (maybe other variant)

Good catch. I've rephrased this comment even more.

I propose a new define:
#define FirstValidXLogRecPtr 1
While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
that has no semantical meaning and it's better to avoid using direct
arithmetics to relate meaning of FirstValidXLogRecPtr from
InvalidXLogRecPtr.

Makes sense, but I'm not sure if this change is required at all. I've
reverted this to the state of master, and everything seems to work.

For 0002 both comments proposals from my message applied to v6 apply
to v7 as well

Thank you for pointing. For now, I'm concentrated on improvements on
0001. Probably Yura could work on your notes to 0002.

I wrote good commit message for 0002 with calculated probabilities and
simple Ruby program which calculates them to explain choice of 2
conditional attempts. (At least I hope the message is good). And added
simple comment before `int attempts = 2;`

Also I simplified 0002 a bit to look a bit prettier (ie without goto), and
added static assert on NUM_XLOGINSERT_LOCKS being power of 2.

(0001 patch is same as for v8)

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v9-0001-Get-rid-of-WALBufMappingLock.patchtext/x-patch; charset=UTF-8; name=v9-0001-Get-rid-of-WALBufMappingLock.patchDownload+132-49
v9-0002-Several-attempts-to-lock-WALInsertLocks.patchtext/x-patch; charset=UTF-8; name=v9-0002-Several-attempts-to-lock-WALInsertLocks.patchDownload+26-20
#17Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#16)
Re: Get rid of WALBufMappingLock

14.02.2025 17:09, Yura Sokolov пишет:

14.02.2025 13:24, Alexander Korotkov пишет:

On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

Oh, sorry, I really did wrong. I've done git format-patch for wrong
local branch for v5 and v6. Patches I've sent for v5 and v6 are
actually the same as my v1. This is really pity. Please, find the
right version of patchset attached.

I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
landed in v7.

Other changes are not regarding code behavior. The things from my
previous review that still could apply to v7:

For 0001:

Comment change proposed:
"lock-free with cooperation with" -> "lock-free accompanied by changes
to..." (maybe other variant)

Good catch. I've rephrased this comment even more.

I propose a new define:
#define FirstValidXLogRecPtr 1
While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
that has no semantical meaning and it's better to avoid using direct
arithmetics to relate meaning of FirstValidXLogRecPtr from
InvalidXLogRecPtr.

Makes sense, but I'm not sure if this change is required at all. I've
reverted this to the state of master, and everything seems to work.

For 0002 both comments proposals from my message applied to v6 apply
to v7 as well

Thank you for pointing. For now, I'm concentrated on improvements on
0001. Probably Yura could work on your notes to 0002.

I wrote good commit message for 0002 with calculated probabilities and
simple Ruby program which calculates them to explain choice of 2
conditional attempts. (At least I hope the message is good). And added
simple comment before `int attempts = 2;`

Also I simplified 0002 a bit to look a bit prettier (ie without goto), and
added static assert on NUM_XLOGINSERT_LOCKS being power of 2.

(0001 patch is same as for v8)

-------
regards
Yura Sokolov aka funny-falcon

Oops, forgot to add StaticAssert into v9-0002.

Attachments:

v10-0001-Get-rid-of-WALBufMappingLock.patchtext/x-patch; charset=UTF-8; name=v10-0001-Get-rid-of-WALBufMappingLock.patchDownload+132-49
v10-0002-Several-attempts-to-lock-WALInsertLocks.patchtext/x-patch; charset=UTF-8; name=v10-0002-Several-attempts-to-lock-WALInsertLocks.patchDownload+28-20
#18Alexander Korotkov
aekorotkov@gmail.com
In reply to: Yura Sokolov (#17)
Re: Get rid of WALBufMappingLock

Hi!

On Fri, Feb 14, 2025 at 4:11 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

14.02.2025 17:09, Yura Sokolov пишет:

14.02.2025 13:24, Alexander Korotkov пишет:

On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

13.02.2025 12:34, Alexander Korotkov пишет:

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

08.02.2025 13:07, Alexander Korotkov пишет:

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

Oh, sorry, I really did wrong. I've done git format-patch for wrong
local branch for v5 and v6. Patches I've sent for v5 and v6 are
actually the same as my v1. This is really pity. Please, find the
right version of patchset attached.

I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
landed in v7.

Other changes are not regarding code behavior. The things from my
previous review that still could apply to v7:

For 0001:

Comment change proposed:
"lock-free with cooperation with" -> "lock-free accompanied by changes
to..." (maybe other variant)

Good catch. I've rephrased this comment even more.

I propose a new define:
#define FirstValidXLogRecPtr 1
While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
that has no semantical meaning and it's better to avoid using direct
arithmetics to relate meaning of FirstValidXLogRecPtr from
InvalidXLogRecPtr.

Makes sense, but I'm not sure if this change is required at all. I've
reverted this to the state of master, and everything seems to work.

For 0002 both comments proposals from my message applied to v6 apply
to v7 as well

Thank you for pointing. For now, I'm concentrated on improvements on
0001. Probably Yura could work on your notes to 0002.

I wrote good commit message for 0002 with calculated probabilities and
simple Ruby program which calculates them to explain choice of 2
conditional attempts. (At least I hope the message is good). And added
simple comment before `int attempts = 2;`

Also I simplified 0002 a bit to look a bit prettier (ie without goto), and
added static assert on NUM_XLOGINSERT_LOCKS being power of 2.

(0001 patch is same as for v8)

Oops, forgot to add StaticAssert into v9-0002.

Thank you. I'm planning to push 0001 if there is no objections. And
I'm planning to do more review/revision of 0002.

------
Regards,
Alexander Korotkov
Supabase

#19Kirill Reshke
reshkekirill@gmail.com
In reply to: Yura Sokolov (#17)
Re: Get rid of WALBufMappingLock

Hi!
I spotted a typo in v10:

+ /*
+ * Page at nextidx wasn't initialized yet, so we cann't move
+ * InitializedUpto further. It will be moved by backend which
+ * will initialize nextidx.
+ */

cann't - > can't

moved by backend -> moved by the backend

--
Best regards,
Kirill Reshke

#20Victor Yegorov
vyegorov@gmail.com
In reply to: Yura Sokolov (#17)
Re: Get rid of WALBufMappingLock

Hey.

I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock
it's being replaced by CV, actually.

Should the subject be changed to “Replace WALBufMappingLock with
ConditionVariable” instead?

--
Victor Yegorov

#21Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Victor Yegorov (#20)
#22Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Pavel Borisov (#21)
#23Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Pavel Borisov (#22)
#24Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Pavel Borisov (#23)
#25Alexander Korotkov
aekorotkov@gmail.com
In reply to: Pavel Borisov (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#26)
#28Alexander Korotkov
aekorotkov@gmail.com
In reply to: Michael Paquier (#27)
#29Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#28)
#30Michael Paquier
michael@paquier.xyz
In reply to: Alexander Korotkov (#29)
#31Andrey Borodin
amborodin@acm.org
In reply to: Alexander Korotkov (#29)
#32Alexander Korotkov
aekorotkov@gmail.com
In reply to: Michael Paquier (#30)
#33Michael Paquier
michael@paquier.xyz
In reply to: Alexander Korotkov (#32)
#34Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andrey Borodin (#31)
#35Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alexander Korotkov (#32)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#33)
#37Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#36)
#38Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alvaro Herrera (#36)
#39Alexander Korotkov
aekorotkov@gmail.com
In reply to: Michael Paquier (#37)
#40Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#38)
#41Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#40)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexander Korotkov (#40)
#43Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Tomas Vondra (#42)
#44Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Tomas Vondra (#42)
#45Alexander Korotkov
aekorotkov@gmail.com
In reply to: Yura Sokolov (#44)
#46Noah Misch
noah@leadboat.com
In reply to: Alexander Korotkov (#45)
#47Alexander Korotkov
aekorotkov@gmail.com
In reply to: Noah Misch (#46)