Addressing buffer private reference count scalability issue

Started by Alexandre Felipe3 months ago6 messageshackers

o.alexandre.felipe@gmail.com

3 months ago

Hi Hackers,

This patch addresses a performance issue pointed out by Andres Freund,

Benchmark buffer pinning: You know benchmark code, implemented a few
functions that can be use in postgres queries, and a python script that
runs them and produces CSV files and SVG plots for the current build.
2.

Refactoring reference counting: Before starting to change code and
potentially breaking things I considered prudent to isolate it to limit the
damage. This code was part of a 8k+ LOC file.
3.

Using simplehash: Simply replacing the HTAB for a simplehash, and adding
a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE. To allow
using the InvalidBuffer special value instead of allocating extra space for
a validity flag. Here I assume that the buffer buffer sequence is
independent enough from the array size, so I use the buffer as the hash key
directly, omitting a hash function call.
4.

Compact PrivateRefCountEntry: The original implementation used a 4-byte
key and 8-byte value. Reference count uses 32 bits, and it is unreasonable
to expect one backend to pin the same buffer 1 billion times. The lock mode
uses 32 bits but can only assume 4 values. So I packed them in one single
uint32, giving 30 bits for count and 2 bits for lock mode. This makes the
entries 8-byte long, on 64-bit CPUs this represents more than a 1/3
reduction in memory. This makes the array aligned with the 64-bit words,
copying one entry can be completed in one instruction, and every entry will
be aligned.
5.

REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
lookup, it is worth trying to remove it completely and keep only the hash.
For small values we would be trading a few branches for a buffer % SIZE,
for the use case of prefetch where pin/unpin in a FIFO fashion, it will
save an 8-entry array lookup, and some extra data moves.

In addition to the patch I am including

- A bash script to apply and benchmark the patches sequentially. You might
have to adjust REPO_ROOT, in my case it gets it relative to the script
path, that is under $REPO_ROOT/.patches/pins/.
- A compare-patches.py script that can be copied to
src/test/modules/test_buffer_pin to process the benchmark CSV in figures
showing one metric for different patches instead of different metrics for
one patch as the benchmark.py produces.
- A nicely formatted post about this [2]https://afelipe.hashnode.dev/postgres-backend-buffer-pinning-algorithm

Regards,
Alexandre

[1]: /messages/by-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa@6oxsemms5mw4
/messages/by-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa@6oxsemms5mw4
[2]: https://afelipe.hashnode.dev/postgres-backend-buffer-pinning-algorithm

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Alexandre Felipe (#1)

Re: Addressing buffer private reference count scalability issue

Hi,

On 2026-03-08 16:09:07 +0000, Alexandre Felipe wrote:

2.

Refactoring reference counting: Before starting to change code and
potentially breaking things I considered prudent to isolate it to limit the
damage. This code was part of a 8k+ LOC file.

Not allowing, at least the fast paths, to be inlined will make the most common
cases measurably slower, I've tried that.

3.

Using simplehash: Simply replacing the HTAB for a simplehash, and adding
a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE.

Yea, we'll need to that. Peter Geoghegan has a patch for it as well.

Here I assume that the buffer buffer sequence is independent enough from the
array size, so I use the buffer as the hash key directly, omitting a hash
function call.

I doubt that that's good enough. The hash table code relies on the bits being
well mixed, but you won't get that with buffer ids.

4.

Compact PrivateRefCountEntry: The original implementation used a 4-byte
key and 8-byte value. Reference count uses 32 bits, and it is unreasonable
to expect one backend to pin the same buffer 1 billion times. The lock mode
uses 32 bits but can only assume 4 values. So I packed them in one single
uint32, giving 30 bits for count and 2 bits for lock mode. This makes the
entries 8-byte long, on 64-bit CPUs this represents more than a 1/3
reduction in memory. This makes the array aligned with the 64-bit words,
copying one entry can be completed in one instruction, and every entry will
be aligned.
5.

I'm rather sceptical that the overhead of needing to shift and mask is worth
it. I suspect we'll also want to add more state for each entry (e.g. I think
it may be worth getting rid of the io-in-progress resowner).

REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
lookup, it is worth trying to remove it completely and keep only the hash.
For small values we would be trading a few branches for a buffer % SIZE,
for the use case of prefetch where pin/unpin in a FIFO fashion, it will
save an 8-entry array lookup, and some extra data moves.

I doubt that that's ok, in the vast majority of access we will have 0-2
buffers pinned. And even when we have pinned more buffers, it's *exceedingly*
common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin),
adding a few cycles to each of those repeated accesses will quickly show up.

From 56bfdd6d7296397a7b3f0b282e239fdc86d6580d Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Fri, 6 Mar 2026 17:15:44 +0000
Subject: [PATCH 4/5] Compact PrivateRefCountEntry

The previous implementation used an 8-bite (64-bit) entry to store
a uint32 count and an uint32 lock mode. That is fine when we store
the data separate from the key (buffer). But in the simplehash
{key, value} are stored together, so each entry is 12-bytes.
This is somewhat awkward as we have to either pad the entry to 16-bytes,
or the access will be an alternating aligned/misaligned addreses.

However, we are probably not going to need even 16-bits for the count
and 2-bits is enough for the lock mode. So we can easily pack these
int to a single uint32.

I wouldn't want to rely on a 16bit pin counter anyway.

Incrementing/decrementing the count continue the same, just using
4 instead of 1, lock mode access will require one or two additional
bitwise operations.

No bit-shifts are required.

I don't know how that last sentence can be true, given that:

-struct PrivateRefCountEntry
+#define PRIVATEREFCOUNT_LOCKMODE_MASK	0x3
+#define ONE_PRIVATE_REFERENCE			4	/* 1 << 2 */
+
+#define PrivateRefCountGetLockMode(d)	((BufferLockMode)((d) & PRIVATEREFCOUNT_LOCKMODE_MASK))
+#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m))
+#define PrivateRefCountGetRefCount(d)	((int32)((d) >> 2))
+#define PrivateRefCountIsZero(d)		((d) < ONE_PRIVATE_REFERENCE)

Involves shifts at least in PrivateRefCountGetRefCount()..

Greetings,

Andres Freund

Alexandre Felipe

o.alexandre.felipe@gmail.com

3 months ago

In reply to: Andres Freund (#2)

Re: Addressing buffer private reference count scalability issue

Hi Hackers,

Please find attached version 2 of this patch and the results of the
ReadBuffer/ReleaseBuffer benchmark

For those who prefer plain text here goes an excerpt
baseline patch
num_buffers pattern
1 random 194.76 183.50
sequential 183.75 168.02
2 random 204.74 187.02
sequential 195.27 169.58
3 random 222.62 192.57
sequential 212.81 169.55
7 random 233.96 219.64
sequential 221.94 171.94
11 random 365.50 247.96
sequential 318.24 183.96
19 random 390.58 267.96
sequential 369.51 195.35
100 random 424.03 286.33
sequential 420.74 280.23
988 random 445.75 301.37
sequential 453.61 313.30
10000 random 587.13 418.91
sequential 497.92 394.10

On Sun, Mar 8, 2026 at 5:09 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-03-08 16:09:07 +0000, Alexandre Felipe wrote:

2.

Refactoring reference counting: Before starting to change code and
potentially breaking things I considered prudent to isolate it to

limit the

damage. This code was part of a 8k+ LOC file.

Not allowing, at least the fast paths, to be inlined will make the most
common
cases measurably slower, I've tried that.

I made it inline in this time. I placed in a separate file, I am including
the
.h at the top and the .c

Here I assume that the buffer buffer sequence is independent enough from

the

array size, so I use the buffer as the hash key directly, omitting a hash
function call.

I doubt that that's good enough. The hash table code relies on the bits
being
well mixed, but you won't get that with buffer ids.

Introduced murmurhash32, it is noticeably worse for sequential access
(by buffer index), but has some mixing.

I'm rather sceptical that the overhead of needing to shift and mask is
worth
it. I suspect we'll also want to add more state for each entry (e.g. I
think
it may be worth getting rid of the io-in-progress resowner).

io-in-progress resowner name suggests that it runs only for reads...
so maybe it could be taken out of the way of buffer hits?

REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
lookup, it is worth trying to remove it completely and keep only the

hash.

For small values we would be trading a few branches for a buffer %

SIZE,

for the use case of prefetch where pin/unpin in a FIFO fashion, it

will

save an 8-entry array lookup, and some extra data moves.

I doubt that that's ok, in the vast majority of access we will have 0-2
buffers pinned. And even when we have pinned more buffers, it's
*exceedingly*
common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin),
adding a few cycles to each of those repeated accesses will quickly show
up.

I brought back the array, but I eliminated the linear search.

1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.

2a. the dynamic array case
REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
will grow the array when it reaches a certain level of occupation.
I have set the default occupation level to 86% so that, if enabled, for a
random input it will grow when we have about 2*size pins in total.
If we find a sequential pattern then it will grow without growing the hash
table.
For the array lookup I don't use a hash, so for small number of pins
it will be very fast.

2b. the static case
REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
will use a static array, just as we had before and will not perform the
linear
search. It still has to read the size and do mask input.

I tested the 4 variations and the winner is with the static array without
the cache for the last entry.
I increased the array size from 8 to 32, since you suggested before that
that this could help. At that point it would have the tradeoff of a longer
linear search, so it may help even more now.

Incrementing/decrementing the count continue the same, just using

4 instead of 1, lock mode access will require one or two additional
bitwise operations.

No bit-shifts are required.

I don't know how that last sentence can be true, given that:
-struct PrivateRefCountEntry
+#define PRIVATEREFCOUNT_LOCKMODE_MASK        0x3
+#define ONE_PRIVATE_REFERENCE                        4       /* 1 << 2
*/
+
+#define PrivateRefCountGetLockMode(d)        ((BufferLockMode)((d) &
PRIVATEREFCOUNT_LOCKMODE_MASK))

+#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) &

~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m))
+#define PrivateRefCountGetRefCount(d)        ((int32)((d) >> 2))
+#define PrivateRefCountIsZero(d)             ((d) <
ONE_PRIVATE_REFERENCE)

Involves shifts at least in PrivateRefCountGetRefCount()..

Yes, there was some unintended code there.
I am adding SharedBufferHasSingleRef to replace the (...) == 1 checks.
The actual count is used only for debugging, so the shift there shouldn't
be a problem.

Regards,
Alexandre

Andres Freund

andres@anarazel.de

3 months ago

In reply to: Alexandre Felipe (#3)

Re: Addressing buffer private reference count scalability issue

Hi,

On 2026-03-11 18:03:34 +0000, Alexandre Felipe wrote:

Hi Hackers,

Please find attached version 2 of this patch and the results of the
ReadBuffer/ReleaseBuffer benchmark

For those who prefer plain text here goes an excerpt
baseline patch
num_buffers pattern
1 random 194.76 183.50
sequential 183.75 168.02
2 random 204.74 187.02
sequential 195.27 169.58
3 random 222.62 192.57
sequential 212.81 169.55
7 random 233.96 219.64
sequential 221.94 171.94
11 random 365.50 247.96
sequential 318.24 183.96
19 random 390.58 267.96
sequential 369.51 195.35
100 random 424.03 286.33
sequential 420.74 280.23
988 random 445.75 301.37
sequential 453.61 313.30
10000 random 587.13 418.91
sequential 497.92 394.10

Nice numbers!

It'd be good to also evaluate in the context of queries, as such a focused
microbenchmark will have much higher L1/L2 cache hit ratios than workloads
that actually look at data.

Here I assume that the buffer buffer sequence is independent enough from

the

array size, so I use the buffer as the hash key directly, omitting a hash
function call.

I doubt that that's good enough. The hash table code relies on the bits
being
well mixed, but you won't get that with buffer ids.

Introduced murmurhash32, it is noticeably worse for sequential access
(by buffer index), but has some mixing.

It's not surprising it's worse, I just don't see how we could get away without
some mixing. It's not at all crazy to have accesss patterns that just differ
in the higher bits of the buffer number (e.g. a nestloop join where the inner
side always needs a certain number of buffers). We need to be somewhat
resistant to that causing a lot of collisions (which will trigger a hashtable
growth cycle, without that necessarily fixing the issue!).

I'm rather sceptical that the overhead of needing to shift and mask is
worth
it. I suspect we'll also want to add more state for each entry (e.g. I
think
it may be worth getting rid of the io-in-progress resowner).

io-in-progress resowner name suggests that it runs only for reads...
so maybe it could be taken out of the way of buffer hits?

It's not used for hits.

But it's used for both reads and writes.

REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
lookup, it is worth trying to remove it completely and keep only the

hash.

For small values we would be trading a few branches for a buffer %

SIZE,

for the use case of prefetch where pin/unpin in a FIFO fashion, it

will

save an 8-entry array lookup, and some extra data moves.

I doubt that that's ok, in the vast majority of access we will have 0-2
buffers pinned. And even when we have pinned more buffers, it's
*exceedingly*
common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin),
adding a few cycles to each of those repeated accesses will quickly show
up.

I brought back the array, but I eliminated the linear search.

Why? In my benchmarks allowing vectorization helped a decent amount in real
queries, because it does away with all the branch misses.

1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.

2a. the dynamic array case
REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
will grow the array when it reaches a certain level of occupation.
I have set the default occupation level to 86% so that, if enabled, for a
random input it will grow when we have about 2*size pins in total.
If we find a sequential pattern then it will grow without growing the hash
table.
For the array lookup I don't use a hash, so for small number of pins
it will be very fast.

I doubt it makes sense to basically have two levels of hash tables.

2b. the static case
REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
will use a static array, just as we had before and will not perform the
linear
search. It still has to read the size and do mask input.

I tested the 4 variations and the winner is with the static array without
the cache for the last entry.
I increased the array size from 8 to 32, since you suggested before that
that this could help. At that point it would have the tradeoff of a longer
linear search, so it may help even more now.

Does your benchmark actually test repeated accesses to the same refcount? As
mentioned, those are very very common (access sequences like Pin, (Lock,
Unlock)+, Unpin are extremely common).

From 5de90fcd14e5d32c3165c3e7a278adaa44f4d9d1 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Wed, 11 Mar 2026 12:05:50 +0000
Subject: [PATCH 2/5] Refactoring reference counting

This moves the private reference count logic out of
backend/storage/buffer/buffmgr.c that is still 8000+
lines long. Preparing for the changes to come.

I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as
that does not *at all* communicate that this is backend private state.

I'd strongly advise separating moving code from a large scale rename. I
certainly won't waste time trying to see the difference between what was just
moved and what was changed at the same time.

diff --git a/.gitignore b/.gitignore
index 4e911395fe..fddb7f861d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,5 @@ lib*.pc
/Release/
/tmp_install/
/portlock/
+
+.*
\ No newline at end of file

What? Certainly not.

From bf146598cd94110da255c6fb45b4a5c58002a91d Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Wed, 11 Mar 2026 12:07:54 +0000
Subject: [PATCH 3/5] Using simplehash

This patch replaces the HTAB implementation with a simplehash
as suggested by Andres Freund [1]. The simplehash implements a templated
open addressing hash, which have entry size information at compile time.

The access strategy of the simplehash is very close to the plain array
that was seen to be very efficient compared to the HTAB implementation,
with the additional advantage of using the key value to initialize the
search closer to where the key actually is, instead of always scanning
the entire array.

Adds one macro to the simplehash implementation enable overriding the
empty position detection, to enable using buffer == InvalidBuffer
instead of requiring a separate status field.

[1] /messages/by-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa@6oxsemms5mw4

---
src/backend/storage/buffer/bufmgr_refcnt.c | 89 +++++++++++-----------
src/include/lib/simplehash.h | 59 +++++++++-----
2 files changed, 84 insertions(+), 64 deletions(-)

I don't think it's a good idea to introduce new simplehash infrastructure as
part of this larger change. You also haven't documented the new stuff.

From 4b96a723bfc3a6fe78e05bc3ce454b0b160679d1 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Wed, 11 Mar 2026 12:08:03 +0000
Subject: [PATCH 4/5] Compact PrivateRefCountEntry

The previous implementation used an 8-bytes (64-bit) entry to store
a uint32 count and an uint32 lock mode. That is fine when we store
the data separate from the key (buffer). But in the simplehash
{key, value} are stored together, so each entry is 12-bytes.
This is somewhat awkward as we have to either pad the entry to 16-bytes,
or the access will be an alternating aligned/misaligned addreses.

Lock can assume only 4 values, and 2^30 is a decent limit for the
number of pins on a single buffer. So this change is packing the
{count[31:2], lock[1:0]} into a single uint32.

Incrementing/decrementing the count continue the same, just using
4 instead of 1, lock mode access will require one or two additional
bitwise operations. The exact count requires one shift, and is used
only for debugging. A special function is provided to check whether
count == 1.

Have you actually evaluated the benefit from this? Pretty sceptical it's worth
it.

Greetings,

Andres Freund

Alexandre Felipe

o.alexandre.felipe@gmail.com

3 months ago

In reply to: Andres Freund (#4)

Re: Addressing buffer private reference count scalability issue

On Wed, Mar 11, 2026 at 6:55 PM Andres Freund <andres@anarazel.de> wrote:

Nice numbers!

It'd be good to also evaluate in the context of queries, as such a focused
microbenchmark will have much higher L1/L2 cache hit ratios than workloads
that actually look at data

I ran with the query you first used to illustrate the issue, and it was
slower.
Went back and tried a strict copy of the code and it was 3% slower.
passing CFLAGS += -falign-functions=64 this shows no degradation.
Maybe LTO is what we need here?

It's not surprising it's worse, I just don't see how we could get away

without
some mixing.

I will keep that for the future.

It's not at all crazy to have accesss patterns that just differ
in the higher bits of the buffer number (e.g. a nestloop join where the
inner
side always needs a certain number of buffers). We need to be somewhat
resistant to that causing a lot of collisions (which will trigger a
hashtable
growth cycle, without that necessarily fixing the issue!).

Yes, I understand that possibility.

I brought back the array, but I eliminated the linear search.

Why? In my benchmarks allowing vectorization helped a decent amount in real
queries, because it does away with all the branch misses.

I basically treated the array as a hash table with a weak hash function and
delegated collisions to simplehash.
In the worse case would do a simple array[buffer & mask], never fails with
one branch, and would fail in ~3% of the cases with two branches.
And would eliminate the branches necessary to check the 1-entry cache.

But notice that this was the last patch, if you like everything except that
it is just a matter of picking 02, 03, 04.

1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.

2a. the dynamic array case
REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
will grow the array when it reaches a certain level of occupation.
I have set the default occupation level to 86% so that, if enabled, for a
random input it will grow when we have about 2*size pins in total.
If we find a sequential pattern then it will grow without growing the

hash

table.
For the array lookup I don't use a hash, so for small number of pins
it will be very fast.

I doubt it makes sense to basically have two levels of hash tables.

2b. the static case
REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
will use a static array, just as we had before and will not perform the
linear
search. It still has to read the size and do mask input.

I tested the 4 variations and the winner is with the static array without
the cache for the last entry.
I increased the array size from 8 to 32, since you suggested before that
that this could help. At that point it would have the tradeoff of a

longer

linear search, so it may help even more now.

Does your benchmark actually test repeated accesses to the same refcount?
As
mentioned, those are very very common (access sequences like Pin, (Lock,
Unlock)+, Unpin are extremely common).

Yes that is the case a FIFO with one buffer, I didn't include the lock
unlock tests here.
I did it a some point and it

I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as

that does not *at all* communicate that this is backend private state.

I suspected that, GetEntryForPrivateReferenceCountOfSharedBuffer would be
more accurate right?
I probably will stick with the original names.

I'd strongly advise separating moving code from a large scale rename. I
certainly won't waste time trying to see the difference between what was
just
moved and what was changed at the same time.

Would you prefer not moving the code at all? One of the main reasons
for this was the changes in data structure on the patch 04, that I will not
include in the next version.

diff --git a/.gitignore b/.gitignore
index 4e911395fe..fddb7f861d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,5 @@ lib*.pc
/Release/
/tmp_install/
/portlock/
+
+.*
\ No newline at end of file

What? Certainly not.

Do you mean, we should certainly not exclude hidden files from git?
I usually build with prefix postgresql/.build/patch-*/
Then whenever I checkout something I have to keep adding this again.

I don't think it's a good idea to introduce new simplehash infrastructure as

part of this larger change.

Do you think it is worth doing that as a separate patch? Then we get it
out of the way on this that probably will go a few more versions?

You also haven't documented the new stuff.

Do you mean as source comments, or is there a separate documentation
for this?

The previous implementation used an 8-bytes (64-bit) entry to store
a uint32 count and an uint32 lock mode. That is fine when we store
the data separate from the key (buffer). But in the simplehash
{key, value} are stored together, so each entry is 12-bytes.
This is somewhat awkward as we have to either pad the entry to 16-bytes,
or the access will be an alternating aligned/misaligned addreses.

Lock can assume only 4 values, and 2^30 is a decent limit for the
number of pins on a single buffer. So this change is packing the
{count[31:2], lock[1:0]} into a single uint32.

Incrementing/decrementing the count continue the same, just using
4 instead of 1, lock mode access will require one or two additional
bitwise operations. The exact count requires one shift, and is used
only for debugging. A special function is provided to check whether
count == 1.

Have you actually evaluated the benefit from this? Pretty sceptical it's
worth
it.

I tested, and I agree, not worth it from a speed perspective.
At this point the only part left is the introduction of the simplehash.

However what I will try is to store just the buffer number in the hash
and keep another array for the entries, who knows that works better.

Alexandre Felipe

o.alexandre.felipe@gmail.com

3 months ago

In reply to: Alexandre Felipe (#5)

Re: Addressing buffer private reference count scalability issue

Hi Andres,

I don't think it's a good idea to introduce new simplehash infrastructure

part of this larger change.

I am submitting the change I did before on simplehash for empty entry
detection.

You also haven't documented the new stuff.

What and where I was supposed to document?

01 the change I did before, 02 I applied it to refcount in bufmgr.
Exploring a bit more the code base, 03 removes status from nodeMemoize
04 remove status from pg_rewind/filemap.c
05 was a bit trickier because InvalidBlockNumber is not 0, then I had to
make entries empty after allocation that uses memset 0 by default.

I grepped for a list and there are 21 files in total so I will stop here.

In my benchmarks allowing vectorization helped a decent amount in real
queries, because it does away with all the branch misses.

Interesting, did you compile with the default configuration?
I used gcc11 (a bit old I know) and and yes, it remove some branches
but still use a loop with cmove (conditional copy), in some cases it unrolls
but still uses cmove for each entry (the machine I tested is quite feature
rich
e.g. avx512cd avx512bw avx512vl avx512_vnni.

So I am wondering if the impact I see is not the same impact as you see.

If we go for vectorisation we could do a vectorized loop
e.g. 16 iterations on 16 x 32-bit vectors with early exit.
but that would inevitably make the few entries case slightly slower.

Do you have any other localised issue like this that could be worth looking
into?

Regards,
Alexandre

Addressing buffer private reference count scalability issue

Attachments:

Attachments:

Attachments: