BufferAlloc: don&#39;t take two simultaneous locks

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Zhihong Yu (#2)

Re: BufferAlloc: don't take two simultaneous locks

В Пт, 01/10/2021 в 15:46 -0700, Zhihong Yu wrote:

On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent
replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it
certainly
doesn't fit shared buffers. This way performance starts to degrade
at
~100 connections. Even with shared buffers 1GB it slowly degrades
after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select
it
for eviction. If tag is cleared and buffer removed from old
partition
then other processes will not find it. Therefore it is safe to
release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old
buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is
no
need to put its BufferLookupEnt into dynahash's freelist. That
reduces
lock contention a bit more. To acomplish this FreeListData.nentries
is
changed to pg_atomic_u32/pg_atomic_u64 and atomic
increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were
not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we
want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
(with numactl -m 0 -N 0)

Results for i7-1165G7:

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 29667 | 29079 | 29425 | 29411
2 | 55577 | 55553 | 57974 | 57223
3 | 87393 | 87924 | 87246 | 89210
5 | 136222 | 136879 | 133775 | 133949
7 | 179865 | 176734 | 178297 | 175559
17 | 215953 | 214708 | 222908 | 223651
27 | 211162 | 213014 | 220506 | 219752
53 | 211620 | 218702 | 220906 | 225218
83 | 213488 | 221799 | 219075 | 228096
107 | 212018 | 222110 | 222502 | 227825
139 | 207068 | 220812 | 218191 | 226712
163 | 203716 | 220793 | 213498 | 226493
191 | 199248 | 217486 | 210994 | 221026
211 | 195887 | 217356 | 209601 | 219397
239 | 193133 | 215695 | 209023 | 218773
271 | 190686 | 213668 | 207181 | 219137
307 | 188066 | 214120 | 205392 | 218782
353 | 185449 | 213570 | 202120 | 217786
397 | 182173 | 212168 | 201285 | 216489

Results for 1 socket X5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16864 | 16584 | 17419 | 17630
2 | 32764 | 32735 | 34593 | 34000
3 | 47258 | 46022 | 49570 | 47432
5 | 64487 | 64929 | 68369 | 68885
7 | 81932 | 82034 | 87543 | 87538
17 | 114502 | 114218 | 127347 | 127448
27 | 116030 | 115758 | 130003 | 128890
53 | 116814 | 117197 | 131142 | 131080
83 | 114438 | 116704 | 130198 | 130985
107 | 113255 | 116910 | 129932 | 131468
139 | 111577 | 116929 | 129012 | 131782
163 | 110477 | 116818 | 128628 | 131697
191 | 109237 | 116672 | 127833 | 131586
211 | 108248 | 116396 | 127474 | 131650
239 | 107443 | 116237 | 126731 | 131760
271 | 106434 | 115813 | 126009 | 131526
307 | 105077 | 115542 | 125279 | 131421
353 | 104516 | 115277 | 124491 | 131276
397 | 103016 | 114842 | 123624 | 131019

Results for 2 socket x5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16323 | 16280 | 16959 | 17598
2 | 30510 | 31431 | 33763 | 31690
3 | 45051 | 45834 | 48896 | 47991
5 | 71800 | 73208 | 78077 | 74714
7 | 89792 | 89980 | 95986 | 96662
17 | 178319 | 177979 | 195566 | 196143
27 | 210475 | 205209 | 226966 | 235249
53 | 222857 | 220256 | 252673 | 251041
83 | 219652 | 219938 | 250309 | 250464
107 | 218468 | 219849 | 251312 | 251425
139 | 210486 | 217003 | 250029 | 250695
163 | 204068 | 218424 | 248234 | 252940
191 | 200014 | 218224 | 246622 | 253331
211 | 197608 | 218033 | 245331 | 253055
239 | 195036 | 218398 | 243306 | 253394
271 | 192780 | 217747 | 241406 | 253148
307 | 189490 | 217607 | 239246 | 253373
353 | 186104 | 216697 | 236952 | 253034
397 | 183507 | 216324 | 234764 | 252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
custom hash-table using BufferDesc as entries. BufferDesc has
spare
space for link to next and hashvalue.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Hi,
Improvement is impressive.

Thank you!

For BufTableFreeDeleted(), since it only has one call, maybe its
caller can invoke hash_return_to_freelist() directly.

It will be a dirty break of abstraction. Everywhere we talk with
BufTable, and here will be hash ... eugh

For free_list_decrement_nentries():

+ Assert(hctl->freeList[freelist_idx].nentries.value <
MAX_NENTRIES);

Is the assertion necessary ? There is similar assertion in
free_list_increment_nentries() which would maintain hctl-

freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Assertion in free_list_decrement_nentries is absolutely necessary:
it is direct translation of Assert(nentries>=0) from signed types
to unsigned. (Since there is no signed atomics in pg, I had to convert
signed `long nentries` to unsigned `pg_atomic_uXX nentries`).

Assertion in free_list_increment_nentries is not necessary. But it
doesn't hurt either - it is just Assert that doesn't compile into
production code.

regards

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Yura Sokolov (#1)

Re: BufferAlloc: don't take two simultaneous locks

В Сб, 02/10/2021 в 01:25 +0300, Yura Sokolov пишет:

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
(with numactl -m 0 -N 0)

Results for i7-1165G7:

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 29667 | 29079 | 29425 | 29411
2 | 55577 | 55553 | 57974 | 57223
3 | 87393 | 87924 | 87246 | 89210
5 | 136222 | 136879 | 133775 | 133949
7 | 179865 | 176734 | 178297 | 175559
17 | 215953 | 214708 | 222908 | 223651
27 | 211162 | 213014 | 220506 | 219752
53 | 211620 | 218702 | 220906 | 225218
83 | 213488 | 221799 | 219075 | 228096
107 | 212018 | 222110 | 222502 | 227825
139 | 207068 | 220812 | 218191 | 226712
163 | 203716 | 220793 | 213498 | 226493
191 | 199248 | 217486 | 210994 | 221026
211 | 195887 | 217356 | 209601 | 219397
239 | 193133 | 215695 | 209023 | 218773
271 | 190686 | 213668 | 207181 | 219137
307 | 188066 | 214120 | 205392 | 218782
353 | 185449 | 213570 | 202120 | 217786
397 | 182173 | 212168 | 201285 | 216489

Results for 1 socket X5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16864 | 16584 | 17419 | 17630
2 | 32764 | 32735 | 34593 | 34000
3 | 47258 | 46022 | 49570 | 47432
5 | 64487 | 64929 | 68369 | 68885
7 | 81932 | 82034 | 87543 | 87538
17 | 114502 | 114218 | 127347 | 127448
27 | 116030 | 115758 | 130003 | 128890
53 | 116814 | 117197 | 131142 | 131080
83 | 114438 | 116704 | 130198 | 130985
107 | 113255 | 116910 | 129932 | 131468
139 | 111577 | 116929 | 129012 | 131782
163 | 110477 | 116818 | 128628 | 131697
191 | 109237 | 116672 | 127833 | 131586
211 | 108248 | 116396 | 127474 | 131650
239 | 107443 | 116237 | 126731 | 131760
271 | 106434 | 115813 | 126009 | 131526
307 | 105077 | 115542 | 125279 | 131421
353 | 104516 | 115277 | 124491 | 131276
397 | 103016 | 114842 | 123624 | 131019

Results for 2 socket x5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16323 | 16280 | 16959 | 17598
2 | 30510 | 31431 | 33763 | 31690
3 | 45051 | 45834 | 48896 | 47991
5 | 71800 | 73208 | 78077 | 74714
7 | 89792 | 89980 | 95986 | 96662
17 | 178319 | 177979 | 195566 | 196143
27 | 210475 | 205209 | 226966 | 235249
53 | 222857 | 220256 | 252673 | 251041
83 | 219652 | 219938 | 250309 | 250464
107 | 218468 | 219849 | 251312 | 251425
139 | 210486 | 217003 | 250029 | 250695
163 | 204068 | 218424 | 248234 | 252940
191 | 200014 | 218224 | 246622 | 253331
211 | 197608 | 218033 | 245331 | 253055
239 | 195036 | 218398 | 243306 | 253394
271 | 192780 | 217747 | 241406 | 253148
307 | 189490 | 217607 | 239246 | 253373
353 | 186104 | 216697 | 236952 | 253034
397 | 183507 | 216324 | 234764 | 252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
custom hash-table using BufferDesc as entries. BufferDesc has spare
space for link to next and hashvalue.

Here is fixed version:
- in first version InvalidateBuffer's BufTableDelete were not paired
with BufTableFreeDeleted.

regards,
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Andrey Borodin

amborodin@acm.org

over 4 years ago

In reply to: Yura Sokolov (#4)

Re: BufferAlloc: don't take two simultaneous locks

21 дек. 2021 г., в 10:23, Yura Sokolov <y.sokolov@postgrespro.ru> написал(а):

<v1-0001-bufmgr-do-not-acquire-two-partition-lo.patch>

Hi Yura!

I've took a look into the patch. The idea seems reasonable to me: clearing\evicting old buffer and placing new one seem to be different units of work, there is no need to couple both partition locks together. And the claimed performance impact is fascinating! Though I didn't verify it yet.

On a first glance API change in BufTable does not seem obvious to me. Is void *oldelem actually BufferTag * or maybe BufferLookupEnt *? What if we would like to use or manipulate with oldelem in future?

And the name BufTableFreeDeleted() confuses me a bit. You know, in C we usually free(), but in C++ we delete [], and here we do both... Just to be sure.

Thanks!

Best regards, Andrey Borodin.

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andrey Borodin (#5)

Re: BufferAlloc: don't take two simultaneous locks

At Sat, 22 Jan 2022 12:56:14 +0500, Andrey Borodin <x4mmm@yandex-team.ru> wrote in

I've took a look into the patch. The idea seems reasonable to me:
clearing\evicting old buffer and placing new one seem to be
different units of work, there is no need to couple both partition
locks together. And the claimed performance impact is fascinating!
Though I didn't verify it yet.

The need for having both locks came from, I seems to me, that the
function was moving a buffer between two pages, and that there is a
moment where buftable holds two entries for one buffer. It seems to
me this patch is trying to move a victim buffer to new page via
"unallocated" state and to avoid the buftable from having duplicate
entries for the same buffer. The outline of the story sounds
reasonable.

On a first glance API change in BufTable does not seem obvious to
me. Is void *oldelem actually BufferTag * or maybe BufferLookupEnt
*? What if we would like to use or manipulate with oldelem in
future?

And the name BufTableFreeDeleted() confuses me a bit. You know, in C
we usually free(), but in C++ we delete [], and here we do
both... Just to be sure.

Honestly, I don't like the API change at all as the change allows a
dynahash to be in a (even tentatively) broken state and bufmgr touches
too much of dynahash details. Couldn't we get a good extent of
benefit without that invasive changes?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Michail Nikolaev

michail.nikolaev@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#6)

Re: BufferAlloc: don't take two simultaneous locks

Hello, Yura.

Test results look promising. But it seems like the naming and dynahash
API change is a little confusing.

1) I think it is better to split the main part and atomic nentries
optimization into separate commits.
2) Also, it would be nice to also fix hash_update_hash_key bug :)
3) Do we really need a SIZEOF_LONG check? I think pg_atomic_uint64 is
fine these days.
4) Looks like hash_insert_with_hash_nocheck could potentially break
the hash table. Is it better to replace it with
hash_search_with_hash_value with HASH_ATTACH action?
5) In such a case hash_delete_skip_freelist with
hash_search_with_hash_value with HASH_DETTACH.
6) And then hash_return_to_freelist -> hash_dispose_dettached_entry?

Another approach is a new version of hash_update_hash_key with
callbacks. Probably it is the most "correct" way to keep a hash table
implementation details closed. It should be doable, I think.

Thanks,
Michail.

Michail Nikolaev

michail.nikolaev@gmail.com

over 4 years ago

In reply to: Michail Nikolaev (#7)

Re: BufferAlloc: don't take two simultaneous locks

Hello, Yura.

A one additional moment:

1332: Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
1333: CLEAR_BUFFERTAG(buf->tag);
1334: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
1335: UnlockBufHdr(buf, buf_state);

I think there is no sense to unlock buffer here because it will be
locked after a few moments (and no one is able to find it somehow). Of
course, it should be unlocked in case of collision.

BTW, I still think is better to introduce some kind of
hash_update_hash_key and use it.

It may look like this:

// should be called with oldPartitionLock acquired
// newPartitionLock hold on return
// oldPartitionLock and newPartitionLock are not taken at the same time
// if newKeyPtr is present - existingEntry is removed
bool hash_update_hash_key_or_remove(
HTAB *hashp,
void *existingEntry,
const void *newKeyPtr,
uint32 newHashValue,
LWLock *oldPartitionLock,
LWLock *newPartitionLock
);

Thanks,
Michail.

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Michail Nikolaev (#8)

Re: BufferAlloc: don't take two simultaneous locks

В Вс, 06/02/2022 в 19:34 +0300, Michail Nikolaev пишет:

Hello, Yura.

A one additional moment:

1332: Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
1333: CLEAR_BUFFERTAG(buf->tag);
1334: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
1335: UnlockBufHdr(buf, buf_state);

I think there is no sense to unlock buffer here because it will be
locked after a few moments (and no one is able to find it somehow). Of
course, it should be unlocked in case of collision.

UnlockBufHdr actually writes buf_state. Until it called, buffer
is in intermediate state and it is ... locked.

We have to write state with BM_TAG_VALID cleared before we
call BufTableDelete and release oldPartitionLock to maintain
consistency.

Perhaps, it could be cheated, and there is no harm to skip state
write at this point. But I'm not so confident to do it.

BTW, I still think is better to introduce some kind of
hash_update_hash_key and use it.

It may look like this:

// should be called with oldPartitionLock acquired
// newPartitionLock hold on return
// oldPartitionLock and newPartitionLock are not taken at the same time
// if newKeyPtr is present - existingEntry is removed
bool hash_update_hash_key_or_remove(
HTAB *hashp,
void *existingEntry,
const void *newKeyPtr,
uint32 newHashValue,
LWLock *oldPartitionLock,
LWLock *newPartitionLock
);

Interesting suggestion, thanks. I'll think about.
It has downside of bringing LWLock knowdlege to dynahash.c .
But otherwise looks smart.

---------

regards,
Yura Sokolov

#10

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Michail Nikolaev (#8)

Re: BufferAlloc: don't take two simultaneous locks

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

------

regards,
Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#11

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#10)

Re: BufferAlloc: don't take two simultaneous locks

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

Thanks for the new patch. The patch as a whole looks fine to me. But
some comments needs to be revised.

(existing comments)

* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.

...

* Somebody could have pinned or re-dirtied the buffer while we were
* doing the I/O and making the new hashtable entry. If so, we can't
* recycle this buffer; we must undo everything we've done and start
* over with a new victim buffer.

We no longer take a lock on the new partition and have no new hash
entry (if others have not yet done) at this point.

+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We

The reason we can clear out the tag is it's safe to use the victim
buffer at this point. This comment needs to mention that reason.

+	 *
+	 * Since we are single pinner, there should no be PIN_COUNT_WAITER or
+	 * IO_IN_PROGRESS (flags that were not cleared in previous code).
+	 */
+	Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);

It seems like to be a test for potential bugs in other functions. As
the comment is saying, we are sure that no other processes are pinning
the buffer and the existing code doesn't seem to be care about that
condition. Is it really needed?

+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another

The most significant point of this patch is the reason that the victim
buffer is protected from stealing until it is set up for new tag. I
think we need an explanation about the protection here.

+	 * buffer for the same block we want to read in. Note that we have not yet
+	 * removed the hashtable entry for the old tag.

Since we have removed the hash table entry for the old tag at this
point, the comment got wrong.

+		 * the first place.  First, give up the buffer we were planning to use
+		 * and put it to free lists.
..
+		StrategyFreeBuffer(buf);

This is one downside of this patch. But it seems to me that the odds
are low that many buffers are freed in a short time by this logic. By
the way it would be better if the sentence starts with "First" has a
separate comment section.

(existing comment)
| * Okay, it's finally safe to rename the buffer.

We don't "rename" the buffer here. And the safety is already
establishsed at the end of the oldPartitionLock section. So it would
be just something like "Now allocate the victim buffer for the new
tag"?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Kyotaro Horiguchi (#11)

Re: BufferAlloc: don't take two simultaneous locks

Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

Thanks for the new patch. The patch as a whole looks fine to me. But
some comments needs to be revised.

Thank you for review and remarks.

(existing comments)

* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.

...

* Somebody could have pinned or re-dirtied the buffer while we were
* doing the I/O and making the new hashtable entry. If so, we can't
* recycle this buffer; we must undo everything we've done and start
* over with a new victim buffer.

We no longer take a lock on the new partition and have no new hash
entry (if others have not yet done) at this point.

fixed

+        * Clear out the buffer's tag and flags.  We must do this to ensure that
+        * linear scans of the buffer array don't think the buffer is valid. We
The reason we can clear out the tag is it's safe to use the victim
buffer at this point. This comment needs to mention that reason.

Tried to describe.

+        *
+        * Since we are single pinner, there should no be PIN_COUNT_WAITER or
+        * IO_IN_PROGRESS (flags that were not cleared in previous code).
+        */
+       Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
It seems like to be a test for potential bugs in other functions. As
the comment is saying, we are sure that no other processes are pinning
the buffer and the existing code doesn't seem to be care about that
condition. Is it really needed?

Ok, I agree this check is excess.
These two flags were not cleared in the previous code, and I didn't get
why. Probably, it is just a historical accident.

+       /*
+        * Try to make a hashtable entry for the buffer under its new tag. This
+        * could fail because while we were writing someone else allocated another
The most significant point of this patch is the reason that the victim
buffer is protected from stealing until it is set up for new tag. I
think we need an explanation about the protection here.

I don't get what you mean clearly :( . I would appreciate your
suggestion for this comment.

+        * buffer for the same block we want to read in. Note that we have not yet
+        * removed the hashtable entry for the old tag.
Since we have removed the hash table entry for the old tag at this
point, the comment got wrong.

Thanks, changed.

+                * the first place.  First, give up the buffer we were planning to use
+                * and put it to free lists.
..
+               StrategyFreeBuffer(buf);
This is one downside of this patch. But it seems to me that the odds
are low that many buffers are freed in a short time by this logic. By
the way it would be better if the sentence starts with "First" has a
separate comment section.

Splitted the comment.

(existing comment)
| * Okay, it's finally safe to rename the buffer.

We don't "rename" the buffer here. And the safety is already
establishsed at the end of the oldPartitionLock section. So it would
be just something like "Now allocate the victim buffer for the new
tag"?

Changed to "Now it is safe to use victim buffer for new tag."

There is also tiny code change at block reuse finalization: instead
of LockBufHdr+UnlockBufHdr I use single atomic_fetch_or protected
with WaitBufHdrUnlocked. I've tried to explain its safety. Please,
check it.

Benchmarks:
- base point is 6ce16088bfed97f9.
- notebook with i7-1165G7 and server with Xeon 8354H (1&2 sockets)
- pgbench select only scale 100 (1.5GB on disk)
- two shared_buffers values: 128MB and 1GB.
- enabled hugepages
- second best result from five runs

Notebook:
conns | master | patch_v3 | master 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 29508 | 29481 | 31774 | 32305
2 | 57139 | 56694 | 63393 | 62968
3 | 89759 | 90861 | 101873 | 102399
5 | 133491 | 134573 | 145451 | 145009
7 | 156739 | 155832 | 164633 | 164562
17 | 216863 | 216379 | 251923 | 251017
27 | 209532 | 209802 | 244202 | 243709
53 | 212615 | 213552 | 248107 | 250317
83 | 214446 | 218230 | 252414 | 252337
107 | 211276 | 217109 | 252762 | 250328
139 | 208070 | 214265 | 248350 | 249684
163 | 206764 | 214594 | 247369 | 250323
191 | 205478 | 213511 | 244597 | 246877
211 | 200976 | 212976 | 244035 | 245032
239 | 196588 | 211519 | 243897 | 245055
271 | 195813 | 209631 | 237457 | 242771
307 | 192724 | 208074 | 237658 | 241759
353 | 187847 | 207189 | 234548 | 239008
397 | 186942 | 205317 | 230465 | 238782

I don't get why numbers changed from first letter ))
But still no slowdown, and measurable gain at 128MB shared
buffers.

Xeon 1 socket

conns | master | patch_v3 | master 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 41975 | 41799 | 52898 | 52715
2 | 77693 | 77531 | 97571 | 98547
3 | 114713 | 114533 | 142709 | 143579
5 | 188898 | 187241 | 239322 | 236682
7 | 261516 | 260249 | 329119 | 328562
17 | 521821 | 518981 | 672390 | 662987
27 | 555487 | 557019 | 674630 | 675703
53 | 868213 | 897097 | 1190734 | 1202575
83 | 868232 | 881705 | 1164997 | 1157764
107 | 850477 | 855169 | 1140597 | 1128027
139 | 816311 | 826756 | 1101471 | 1096197
163 | 794788 | 805946 | 1078445 | 1071535
191 | 765934 | 783209 | 1059497 | 1039936
211 | 738656 | 786171 | 1083356 | 1049025
239 | 713124 | 837040 | 1104629 | 1125969
271 | 692138 | 847741 | 1094432 | 1131968
307 | 682919 | 847939 | 1086306 | 1124649
353 | 679449 | 844596 | 1071482 | 1125980
397 | 676217 | 833009 | 1058937 | 1113496

Here is small slowdown at some connection numbers (17,
107-191).It is reproducible. Probably it is due to one more
atomice write. Perhaps for some other scheduling issues (
processes block less on buffer manager but compete more
on other resources). I could not reliably determine why,
because change is too small, and `perf record` harms
performance more at this point.

This is the reason I've changed finalization to atomic_or
instead of Lock+Unlock pair. The changed helped a bit, but
didn't remove slowdown completely.

Xeon 2 socket

conns | m0 | patch_v3 | m0 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 44317 | 43747 | 53920 | 53759
2 | 81193 | 79976 | 99138 | 99213
3 | 120755 | 114481 | 148102 | 146494
5 | 190007 | 187384 | 232078 | 229627
7 | 258602 | 256657 | 325545 | 322417
17 | 551814 | 549041 | 692312 | 688204
27 | 787353 | 787916 | 1023509 | 1020995
53 | 973880 | 996019 | 1228274 | 1246128
83 | 1108442 | 1258589 | 1596292 | 1662586
107 | 1072188 | 1317229 | 1542401 | 1684603
139 | 1000446 | 1272759 | 1490757 | 1672507
163 | 967378 | 1224513 | 1461468 | 1660012
191 | 926010 | 1178067 | 1435317 | 1645886
211 | 909919 | 1148862 | 1417437 | 1629487
239 | 895944 | 1108579 | 1393530 | 1616824
271 | 880545 | 1078280 | 1374878 | 1608412
307 | 865560 | 1056988 | 1355164 | 1601066
353 | 857591 | 1033980 | 1330069 | 1586769
397 | 840374 | 1016690 | 1312257 | 1573376

regards,
Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#13

Simon Riggs

simon@2ndQuadrant.com

over 4 years ago

In reply to: Yura Sokolov (#12)

Re: BufferAlloc: don't take two simultaneous locks

On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

Thanks for the new patch. The patch as a whole looks fine to me. But
some comments needs to be revised.

Thank you for review and remarks.

v3 gets the buffer partition locking right, well done, great results!

In v3, the comment at line 1279 still implies we take both locks
together, which is not now the case.

Dynahash actions are still possible. You now have the BufTableDelete
before the BufTableInsert, which opens up the possibility I discussed
here:
/messages/by-id/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
(Apologies for raising a similar topic, I hadn't noticed this thread
before; thanks to Horiguchi-san for pointing this out).

v1 had a horrible API (sorry!) where you returned the entry and then
explicitly re-used it. I think we *should* make changes to dynahash,
but not with the API you proposed.

Proposal for new BufTable API
BufTableReuse() - similar to BufTableDelete() but does NOT put entry
back on freelist, we remember it in a private single item cache in
dynahash
BufTableAssign() - similar to BufTableInsert() but can only be
executed directly after BufTableReuse(), fails with ERROR otherwise.
Takes the entry from single item cache and re-assigns it to new tag

In dynahash we have two new modes that match the above
HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
places entry on the single item cache, avoiding freelist
HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
uses the entry from the single item cache, rather than asking freelist
This last call can fail if someone else already inserted the tag, in
which case it adds the single item cache entry back onto freelist

Notice that single item cache is not in shared memory, so on abort we
should give it back, so we probably need an extra API call for that
also to avoid leaking an entry.

Doing it this way allows us to
* avoid touching freelists altogether in the common path - we know we
are about to reassign the entry, so we do remember it - no contention
from other backends, no borrowing etc..
* avoid sharing the private details outside of the dynahash module
* allows us to use the same technique elsewhere that we have
partitioned hash tables

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

With these changes, I think we will be able to *reduce* the number of
freelists for partitioned dynahash from 32 to maybe 8, as originally
speculated by Robert in 2016:
/messages/by-id/CA+TgmoZkg-04rcNRURt=jAG0Cs5oPyB-qKxH4wqX09e-oXy-nw@mail.gmail.com
since the freelists will be much less contended with the above approach

It would be useful to see performance with a higher number of connections, >400.

--
Simon Riggs http://www.EnterpriseDB.com/

#14

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Yura Sokolov (#12)

Re: BufferAlloc: don't take two simultaneous locks

Hi,

On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:

From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

Yes, the current design is pretty nonsensical. It leads to really absurd stuff
like holding the relation extension lock while we write out old buffer
contents etc.

+	 * We have pinned buffer and we are single pinner at the moment so there
+	 * is no other pinners.

Seems redundant.

We hold buffer header lock and exclusive partition
+	 * lock if tag is valid. Given these statements it is safe to clear tag
+	 * since no other process can inspect it to the moment.
+	 */

Could we share code with InvalidateBuffer here? It's not quite the same code,
but nearly the same.

+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?

+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.

*/
buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;

-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Why don't you just use LockBufHdr/UnlockBufHdr?

Greetings,

Andres Freund

#15

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#14)

Re: BufferAlloc: don't take two simultaneous locks

At Fri, 25 Feb 2022 00:04:55 -0800, Andres Freund <andres@anarazel.de> wrote in

Why don't you just use LockBufHdr/UnlockBufHdr?

FWIW, v2 looked fine to me in regards to this point.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Simon Riggs (#13)

Re: BufferAlloc: don't take two simultaneous locks

Hello, Simon.

В Пт, 25/02/2022 в 04:35 +0000, Simon Riggs пишет:

On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:

At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in

Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

Thanks for the new patch. The patch as a whole looks fine to me. But
some comments needs to be revised.

Thank you for review and remarks.

v3 gets the buffer partition locking right, well done, great results!

In v3, the comment at line 1279 still implies we take both locks
together, which is not now the case.

Dynahash actions are still possible. You now have the BufTableDelete
before the BufTableInsert, which opens up the possibility I discussed
here:
/messages/by-id/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
(Apologies for raising a similar topic, I hadn't noticed this thread
before; thanks to Horiguchi-san for pointing this out).

v1 had a horrible API (sorry!) where you returned the entry and then
explicitly re-used it. I think we *should* make changes to dynahash,
but not with the API you proposed.

Proposal for new BufTable API
BufTableReuse() - similar to BufTableDelete() but does NOT put entry
back on freelist, we remember it in a private single item cache in
dynahash
BufTableAssign() - similar to BufTableInsert() but can only be
executed directly after BufTableReuse(), fails with ERROR otherwise.
Takes the entry from single item cache and re-assigns it to new tag

In dynahash we have two new modes that match the above
HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
places entry on the single item cache, avoiding freelist
HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
uses the entry from the single item cache, rather than asking freelist
This last call can fail if someone else already inserted the tag, in
which case it adds the single item cache entry back onto freelist

Notice that single item cache is not in shared memory, so on abort we
should give it back, so we probably need an extra API call for that
also to avoid leaking an entry.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

Doing it this way allows us to
* avoid touching freelists altogether in the common path - we know we
are about to reassign the entry, so we do remember it - no contention
from other backends, no borrowing etc..
* avoid sharing the private details outside of the dynahash module
* allows us to use the same technique elsewhere that we have
partitioned hash tables

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.

With these changes, I think we will be able to *reduce* the number of
freelists for partitioned dynahash from 32 to maybe 8, as originally
speculated by Robert in 2016:
/messages/by-id/CA+TgmoZkg-04rcNRURt=jAG0Cs5oPyB-qKxH4wqX09e-oXy-nw@mail.gmail.com
since the freelists will be much less contended with the above approach

It would be useful to see performance with a higher number of connections, >400.

--
Simon Riggs http://www.EnterpriseDB.com/

------

regards,
Yura Sokolov

#17

Simon Riggs

simon@2ndQuadrant.com

over 4 years ago

In reply to: Yura Sokolov (#16)

Re: BufferAlloc: don't take two simultaneous locks

On Fri, 25 Feb 2022 at 09:24, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.

I don't insist that you do that, but since the API changes are a few
hours work ISTM better to include in one patch for combined perf
testing. It would be better to put all changes in this area into PG15
than to split it across multiple releases.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

Sounds reasonable.

--
Simon Riggs http://www.EnterpriseDB.com/

#18

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Andres Freund (#14)

Re: BufferAlloc: don't take two simultaneous locks

Hello, Andres

В Пт, 25/02/2022 в 00:04 -0800, Andres Freund пишет:

Hi,

On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:

From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

Yes, the current design is pretty nonsensical. It leads to really absurd stuff
like holding the relation extension lock while we write out old buffer
contents etc.
+	 * We have pinned buffer and we are single pinner at the moment so there
+	 * is no other pinners.
Seems redundant.
We hold buffer header lock and exclusive partition
+	 * lock if tag is valid. Given these statements it is safe to clear tag
+	 * since no other process can inspect it to the moment.
+	 */
Could we share code with InvalidateBuffer here? It's not quite the same code,
but nearly the same.
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.
So basically the first half of of the paragraph is wrong, because no, we
can't?

Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.

+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.

*/
buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;

-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))

That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Sorry, I didn't understand sentence :((

Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

But I thought to use Lock+UnlockBuhHdr instead of safety loop:

buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
if (unlikely(buf_state & BM_LOCKED))
{
buf_state = LockBufHdr(&buf->state);
UnlockBufHdr(&buf->state, buf_state | new_bits);
}

I agree this way code is cleaner. Will do in next version.

-----

regards,
Yura Sokolov

#19

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Yura Sokolov (#18)

Re: BufferAlloc: don't take two simultaneous locks

Hi,

On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:

+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.
So basically the first half of of the paragraph is wrong, because no, we
can't?
Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.

Yea, but that's still being interested in the buffer...

+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.
*/
buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))
I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...
That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Sorry, I didn't understand sentence :((

You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
BM_LOCKED. ORing BM_LOCKED is fine:
Either the buffer is not already locked, in which case it just sets the
BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
BM_LOCKED already was set.

But OR'ing in multiple bits is *not* fine, because it'll actually change the
contents of ->state while the buffer header is locked.

Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
unlocked write.

Greetings,

Andres Freund

#20

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Andres Freund (#19)

Re: BufferAlloc: don't take two simultaneous locks

В Пт, 25/02/2022 в 09:01 -0800, Andres Freund пишет:

Hi,

On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.
So basically the first half of of the paragraph is wrong, because no, we
can't?
Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.
Yea, but that's still being interested in the buffer...
+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.
*/
buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))
I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...
That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Sorry, I didn't understand sentence :((
You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
BM_LOCKED. ORing BM_LOCKED is fine:
Either the buffer is not already locked, in which case it just sets the
BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
BM_LOCKED already was set.

But OR'ing in multiple bits is *not* fine, because it'll actually change the
contents of ->state while the buffer header is locked.

First, both states are valid: before atomic_or and after.
Second, there are no checks for buffer->state while buffer header is locked.
All LockBufHdr users uses result of LockBufHdr. (I just checked that).

Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
unlocked write.

Write barrier is not free on any platform.

Well, while I don't see problem with modifying buffer->state, there is problem
with modifying buffer->tag: I missed Drop*Buffers doesn't check BM_TAG_VALID
flag. Therefore either I had to add this check to those places, or return to
LockBufHdr+UnlockBufHdr pair.

For patch simplicity I'll return Lock+UnlockBufHdr pair. But it has measurable
impact on low connection numbers on many-sockets.

Show quoted text

Greetings,

Andres Freund

#21

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Simon Riggs (#17)

#22

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Yura Sokolov (#21)

#23

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#22)

#24

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#23)

#25

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#24)

#26

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#23)

#27

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#24)

#28

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#25)

#29

Zhihong Yu

zyu@yugabyte.com

about 4 years ago

In reply to: Yura Sokolov (#28)

#30

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Zhihong Yu (#29)

#31

Zhihong Yu

zyu@yugabyte.com

about 4 years ago

In reply to: Yura Sokolov (#30)

#32

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#26)

#33

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#27)

#34

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#32)

#35

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#34)

#36

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#35)

#37

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#36)

#38

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#36)

#39

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Yura Sokolov (#38)

#40

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#39)

#41

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#40)

#42

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Yura Sokolov (#41)

#43

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Yura Sokolov (#41)

#44

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#41)

#45

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#44)

#46

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#45)

#47

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#46)

#48

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Yura Sokolov (#47)

#49

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#48)

#50

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#49)

#51

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#50)

#52

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#51)

#53

robertmhaas@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#48)

#54

Tom Lane

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#53)

#55

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#54)

#56

Tom Lane

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#55)

#57

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#56)

#58

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#53)

#59

robertmhaas@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#58)

#60

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#59)

#61

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Robert Haas (#59)

#62

robertmhaas@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#61)

#63

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Robert Haas (#62)

#64

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Yura Sokolov (#63)

#65

robertmhaas@gmail.com

about 4 years ago

In reply to: Yura Sokolov (#63)

#66

y.sokolov@postgrespro.ru

about 4 years ago

In reply to: Robert Haas (#65)

#67

y.sokolov@postgrespro.ru

almost 4 years ago

In reply to: Yura Sokolov (#66)

#68

y.sokolov@postgrespro.ru

almost 4 years ago

In reply to: Yura Sokolov (#67)

#69