RFC: Packing the buffer lookup table

Started by Matthias van de Meentover 1 year ago10 messageshackers

boekewurm+postgres@gmail.com

over 1 year ago

Hi,

Some time ago I noticed that every buffer table entry is quite large at 40
bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
buffers array, with generally 8 more bytes used for the bucket pointers.
(32-bit systems: 32 (+4) bytes)

Does anyone know why we must have the buffer tag in the buffer table?
It seems to me we can follow the offset pointer into the shared BufferDesc
array whenever we find out we need to compare the tags (as opposed to just
the hash, which is stored and present in HASHELEMENT). If we decide to just
follow the pointer, we can immediately shave 16 bytes (40%) off the lookup
table's per-element size, or 24 if we pack the 4-byte shared buffer offset
into the unused bytes in HASHELEMENT, reducing the memory usage of that
hash table by ~50%: We'd have 16 bytes for every
ELEMENT+shared_buffer_offset, plus 8 bytes for every bucket pointer (of
which there are approximately as many as there are elements), resulting in
24 bytes /max_alloc elements.

(This was also discussed on Discord in the Hackers Mentoring server, over
at [0]https://discord.com/channels/1258108670710124574/1318997914777026580)

Together that results in the following prototype patchset. 0001 adds the
ability for dynahash users to opt in to using the 4-byte alignment hole in
HASHELEMENT (by providing size- and alignment info that dynahash uses to
partially move the entry into the alignment hole), 0002 uses that feature
to get the per-element size of the buffer lookup hash table down to 16
bytes (+8B for bucket pointers), or 12 (+4) on 32-bit systems

An alternative approach to current patch 1 (which introduces "element data
offset" to determine where to start looking for the key) would be to add an
option to allow "0-length" keys/entries when there is alignment space, and
make the hash/compare functions handle writing/reading of key data (thus
removing the new data dependencies in the hash lookup function), but I'm
not sure that's a winning idea as that requires the user of the API to have
knowledge about the internals of dynahash, rather than dynahash internally
optimizing usage based on a clearer picture of what the hash entry needs.

Does anyone have an idea on how to best benchmark this kind of patch, apart
from "running pgbench"? Other ideas on how to improve this? Specific
concerns?

Kind regards,

Matthias van de Meent

[0]: https://discord.com/channels/1258108670710124574/1318997914777026580

James Hunter

james.hunter.pg@gmail.com

over 1 year ago

In reply to: Matthias van de Meent (#1)

Re: RFC: Packing the buffer lookup table

On Wed, Jan 29, 2025 at 11:49 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Hi,

Some time ago I noticed that every buffer table entry is quite large at 40 bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared buffers array, with generally 8 more bytes used for the bucket pointers. (32-bit systems: 32 (+4) bytes)

Does anyone know why we must have the buffer tag in the buffer table?
It seems to me we can follow the offset pointer into the shared BufferDesc array whenever we find out we need to compare the tags (as opposed to just the hash, which is stored and present in HASHELEMENT). If we decide to just follow the pointer, we can immediately shave 16 bytes (40%) off the lookup table's per-element size, or 24 if we pack the 4-byte shared buffer offset into the unused bytes in HASHELEMENT, reducing the memory usage of that hash table by ~50%: ...

So every buffer table entry is 40 bytes, and every buffer itself is 8
KB. Also, every BufferDesc is padded to 64 bytes (on 64-bit systems).
So buffer table entry is < 0.5% of total buffer space.

I assume this means that the benefit of your patch doesn't come from
memory savings, but maybe from spatial locality? As in, saving 0.5% of
memory doesn't seem like a huge win by itelf, but maybe the hash table
will see fewer cache misses, since the entries are smaller, so they'll
be packed closer together. Now you can fit 2.7 hash entries on a cache
line, instead of 1.6.

Except, to check the buffer tag itself, now you have to dereference
the offset pointer into the shared BufferDesc array. And every
BufferDesc is padded to 64 bytes (on 64-bit systems), a cache line.
So, now, instead of having the buffer tag adjacent to the hash entry,
you have a memory stall.

If the buffer tags match (i.e., no collision), this is still a
possible win, because you'd have to load the BufferDesc anyway (since
this is the buffer you're looking for).

Does anyone have an idea on how to best benchmark this kind of patch, apart from "running pgbench"? Other ideas on how to improve this? Specific concerns?

What's the expected advantage of reducing the buffer table's entry
size? Better CPU cache usage when there's no collision, but worse in
the less-likely case where there is a collision?

What I mean is, regarding benchmarks: what's the best case scenario
for this kind of patch, and what sort of performance difference would
you expect to see?

Thanks,
James

Zhang Mingli

zmlpostgres@gmail.com

over 1 year ago

In reply to: Matthias van de Meent (#1)

Re: RFC: Packing the buffer lookup table

Zhang Mingli
www.hashdata.xyz
On Jan 30, 2025 at 15:49 +0800, Matthias van de Meent <boekewurm+postgres@gmail.com>, wrote:

Some time ago I noticed that every buffer table entry is quite large at 40 bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared buffers array, with generally 8 more bytes used for the bucket pointers. (32-bit systems: 32 (+4) bytes)

Does anyone know why we must have the buffer tag in the buffer table?
It seems to me we can follow the offset pointer into the shared BufferDesc array whenever we find out we need to compare the tags (as opposed to just the hash, which is stored and present in HASHELEMENT). If we decide to just follow the pointer, we can immediately shave 16 bytes (40%) off the lookup table's per-element size, or 24 if we pack the 4-byte shared buffer offset into the unused bytes in HASHELEMENT, reducing the memory usage of that hash table by ~50%: We'd have 16 bytes for every ELEMENT+shared_buffer_offset, plus 8 bytes for every bucket pointer (of which there are approximately as many as there are elements), resulting in 24 bytes /max_alloc elements.

Hi,

Thanks for your insights.
While the buffer tag consumes a relatively small amount of space in the overall shared buffer architecture, including the BufferDescriptors array and page buffers, but every little helps, I think.

Regarding the code, I've read through some. Here are my initial thoughts:

```
int
BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
{
const int surrogateid = SurrogateBuffer;
BufferLookupEnt *result;
bool found;

Assert(buf_id >= 0); /* -1 is reserved for not-in-table */
Assert(tagPtr->blockNum != P_NEW); /* invalid tag */

MyBackendBufLookup = tagPtr;

result = (BufferLookupEnt *)
hash_search_with_hash_value(SharedBufHash,
&surrogateid,
hashcode,
HASH_ENTER,
&found);

MyBackendBufLookup = NULL;

if (found) /* found something already in the table */
{
Assert(result->id != SurrogateBuffer);
return result->id;
}
```

In the BufTableInsert function, it appears that the key changes from BufferTag to an integer surrogateid.
Given that multiple buckets exist based on the hash code, we need to iterate through the bucket lists to find a slot by comparing the keys, and if surrogateid is set to -1, will the comparison function always return false?

match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0) inside hash_search_with_hash_value

```
while (currBucket != NULL)
{
if (currBucket->hashvalue == hashvalue &&
match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
break;
prevBucketPtr = &(currBucket->link);
currBucket = *prevBucketPtr;
```

Additionally, I'm curious about the purpose of MyBackendBufLookup, which is set and reset around the hash_search_with_hash_value call. Is there a concurrency consideration here, even though we have a lock before the buffer insertion?

And, a potential drawback? : databases built on PostgreSQL might manipulate the buffer table directly (e.g., reading it for specific purposes).
In this case, the buffer tags stored in the table would reveal the infos without needing to reference the buffer descriptor array.
While I understand that Postgres doesn’t have promise about that, just a consideration.

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Zhang Mingli (#3)

Re: RFC: Packing the buffer lookup table

On Sat, 1 Feb 2025 at 06:01, Zhang Mingli <zmlpostgres@gmail.com> wrote:

Zhang Mingli
www.hashdata.xyz
On Jan 30, 2025 at 15:49 +0800, Matthias van de Meent <

boekewurm+postgres@gmail.com>, wrote:

Hi,

Thanks for your insights.
While the buffer tag consumes a relatively small amount of space in the

overall shared buffer architecture, including the BufferDescriptors array
and page buffers, but every little helps, I think.

Regarding the code, I've read through some. Here are my initial thoughts:

```
int
BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
[...]
```

In the BufTableInsert function, it appears that the key changes from

BufferTag to an integer surrogateid.

Given that multiple buckets exist based on the hash code, we need to

iterate through the bucket lists to find a slot by comparing the keys, and
if surrogateid is set to -1, will the comparison function always return
false?

Please note the use of our own compare function, which grabs the stored
value of 'key' and uses that buffer ID to find the buffertag of the
indicated buffer (or, in case of surrogateId, the buffertag that was
supplied by the caller of BufTableInsert()).

Additionally, I'm curious about the purpose of MyBackendBufLookup, which

is set and reset around the hash_search_with_hash_value call. Is there a
concurrency consideration here, even though we have a lock before the
buffer insertion?

In multi-threaded postgresql this might need additional considerations, but
in non-reentrent non-threaded PostgreSQL we shouldn't have any callers to
the BufTableInsert/Delete/Search() functions from inside those same
functions.

And, a potential drawback? : databases built on PostgreSQL might

manipulate the buffer table directly (e.g., reading it for specific
purposes).

In this case, the buffer tags stored in the table would reveal the infos

without needing to reference the buffer descriptor array.

While I understand that Postgres doesn’t have promise about that, just a

consideration.

The buffer lookup table reference is stored in a variable that's private to
buftable.c. Any user of the table that's not using the BufTable*-functions
would need to be doing some very weird trickery w.r.t. finding out where
the table base pointer is stored and how to access it (though attaching
through shared memory is technically possible, I wouldn't suggest that as a
valid and supported method).

Additionally, anyone who wants to scan that table would have to lock all
partitions for the duration of the scan to correctly scan all entries,
which would be catastrophic for the performance of a cluster that does any
amount of buffer swapping.

It is probably much cheaper to scan the bufferdesc array, as that has
per-entry locks, and the definitions for that array are shared in the
relevant headers, and are thus usable without resorting to magic values.

Kind regards,

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: James Hunter (#2)

Re: RFC: Packing the buffer lookup table

On Fri, 31 Jan 2025 at 18:23, James Hunter <james.hunter.pg@gmail.com> wrote:

On Wed, Jan 29, 2025 at 11:49 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Hi,

Some time ago I noticed that every buffer table entry is quite large at 40 bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared buffers array, with generally 8 more bytes used for the bucket pointers. (32-bit systems: 32 (+4) bytes)

Does anyone know why we must have the buffer tag in the buffer table?
It seems to me we can follow the offset pointer into the shared BufferDesc array whenever we find out we need to compare the tags (as opposed to just the hash, which is stored and present in HASHELEMENT). If we decide to just follow the pointer, we can immediately shave 16 bytes (40%) off the lookup table's per-element size, or 24 if we pack the 4-byte shared buffer offset into the unused bytes in HASHELEMENT, reducing the memory usage of that hash table by ~50%: ...

So every buffer table entry is 40 bytes, and every buffer itself is 8
KB. Also, every BufferDesc is padded to 64 bytes (on 64-bit systems).
So buffer table entry is < 0.5% of total buffer space.

Correct. But note that most buffers generally aren't all being
accessed, and thus aren't (fully) present in CPU caches.

I assume this means that the benefit of your patch doesn't come from
memory savings, but maybe from spatial locality? As in, saving 0.5% of
memory doesn't seem like a huge win by itelf, but maybe the hash table
will see fewer cache misses, since the entries are smaller, so they'll
be packed closer together. Now you can fit 2.7 hash entries on a cache
line, instead of 1.6.

Yes - though it's actually 4 entries per cache line: While the buckets
array uses space proportional to the total number of entries, we won't
access that bucket array again once we've used it, so any subsequent
memory accesses are 2.5x more likely to hit a cache line, compared to
the chances of what's currently on master.

Except, to check the buffer tag itself, now you have to dereference
the offset pointer into the shared BufferDesc array. And every
BufferDesc is padded to 64 bytes (on 64-bit systems), a cache line.
So, now, instead of having the buffer tag adjacent to the hash entry,
you have a memory stall.

Correct - though that only happens for hits and hash collisions. Hash
collisions are considered rare (assuming linear hashing, it'd be
1/2^16), and for hits you would load that buffer tag anyway, so the
cost is a potential cache miss _while holding the buftable lock_, be
it in shared or exclusive mode (most likely: shared, as we don't do
blind updates to the hash table).

If the buffer tags match (i.e., no collision), this is still a
possible win, because you'd have to load the BufferDesc anyway (since
this is the buffer you're looking for).

Exactly.

Does anyone have an idea on how to best benchmark this kind of patch, apart from "running pgbench"? Other ideas on how to improve this? Specific concerns?

What's the expected advantage of reducing the buffer table's entry
size? Better CPU cache usage when there's no collision, but worse in
the less-likely case where there is a collision?

Yes, plus a reduced overhead of defining a large shared buffers. Over
in [0]/messages/by-id/CA+TgmoaGCFPhMjz7veJOeef30=KdpOxgywcLwNbr-Gny-mXwcg@mail.gmail.com people discuss changing the size of shared_buffers on-the-fly.
When we have that capability, the overhead of allocating large shared
buffers can become a meaningful fraction of total shared memory in
reduced-shared_buffers systems: Currently, we have ±144
bytes/shared_buffers entry in shared_memory allocations, excluding the
page itself. This means that if someone wants to be able to scale
their buffers with a factor 1:100 min:max, the size of those shared
buffers' metadata allocations is 1.7x as large as the size of the
buffers in its smallest configuration. [^1]

While it is currently definitely not a reasonable expectation to
expect 100x scaling, I think reducing this overhead factor is going to
be very useful in the long run. See also [2]/messages/by-id/CA+CZih5R_ZMjcuMHM3Ub10vTNDLEBjW8VsO3h-y2WKb=oK4fWw@mail.gmail.com, where cache misses in
dynahash are probably near 25% of the total workload - by reducing the
size of these structs we can increase the amount of data we can fit in
the same amount of buffers, thus increasing performance for some
cache-size limited workloads that hit the lookup tables.

What I mean is, regarding benchmarks: what's the best case scenario
for this kind of patch, and what sort of performance difference would
you expect to see?

I assume that for the provided patch:
- there will be no (or minimal) regressions, and improved performance
when the buffer mapping table now fits in l2/l3 caches (if it
previously didn't).
- there can be some regressions if the additional offset calculation
is performance-critical.
- there can be some regressions if an additional cache miss during
redirection while holding the lock is performance-critical.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: /messages/by-id/CA+TgmoaGCFPhMjz7veJOeef30=KdpOxgywcLwNbr-Gny-mXwcg@mail.gmail.com

[^1] I make the assumption that when we resize the storage of pages of
shared_buffers we don't also resize the buffer mapping table and the
bufferdesc array; as blocking processes from accessing page data based
on buffer descriptors' state is easier than totally rewiring all
systems that currently assume constant-size bufferdescs. With effort
it might be possible to resize these, but I don't see an easy path
forward for that, while the 8KB page of every shared_buffers slot is
relatively easier to remove from memory.

[2]: /messages/by-id/CA+CZih5R_ZMjcuMHM3Ub10vTNDLEBjW8VsO3h-y2WKb=oK4fWw@mail.gmail.com

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Matthias van de Meent (#1)

Re: RFC: Packing the buffer lookup table

On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Together that results in the following prototype patchset.

Here's an alternative patch, which replaces dynahash in the buffer
lookup table with an open-coded replacement that has fewer
indirections during lookups, and allows for a much increased locality.

Overall, it has less memory usage than our current dynahash in all
cases; but compared to my previous patch it improves memory only in
some cases, in others it uses more memory.

The main design is a separate chaining hash table (similar to and
derived from code of Dynahash) to provide efficient access to free
elements. The innovation over Dynahash is that in this new hash table
the stored elements are slotted between the bucket's element pointers,
thus allowing access to hash elements without further cache misses if
the element that stores the bucket's data is stored on the same cache
line (very possible). Further, the implementation uses int-based
offset pointers, reducing pointer size overhead on 64-bit system and
limiting the hash table to 2^31 elements (for a buffer mapping table
that seems to be fine, given our current buffer count limit of
2^30-1).
Finally, it shows up with accurate memory usage in
pg_shmem_allocations due to the usage of a single allocation,
improving user debugability of resources vs dynahash.

0001 implements this hash table, with some minimal optimizations.
0002 adds the optimization where the hash table prefers to pick the
local hash element if it's not already in use.

Future optimizations could implement memory prefetching (so that
buffer tag compares won't have cache misses for memory stalls),
cacheline-local entry selection, and other optimizations, but that'll
need more effort which I don't want to put into it before getting a
green light on "this might actually be a good step in the right
direction".

As to why simplehash was not chosen: Simplehash doesn't (can't?)
support partitioning due to its open addressing model. By taking the
locality of open addressing and combining it with the partitioning
capabilities of separate chaining we get the best of both worlds (or
at least closer than dynahash got).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. This isn't a perfect solution; it still causes approximately
random memory accesses when we look up sequential pages of a relation.
However, the cost of these memory accesses can now be significantly
reduced without breaking APIs or significantly impacting memory
overheads. Replacing the hashtable with another seems fairly
straightforward, while replacing the hashtable with e.g. radixtree.h
would be a very significant undertaking of shared memory allocations
that I'm not comfortable with writing myself.

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Matthias van de Meent (#1)

Re: RFC: Packing the buffer lookup table

Hi,

On 2025-01-30 08:48:56 +0100, Matthias van de Meent wrote:

Some time ago I noticed that every buffer table entry is quite large at 40
bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
buffers array, with generally 8 more bytes used for the bucket pointers.
(32-bit systems: 32 (+4) bytes)

Does anyone know why we must have the buffer tag in the buffer table?

It turns out to actually substantially improve CPU cache hit ratio on
concurrent workloads. The BufferDesc is obviously frequently modified. Each
modification (pin, lock) will pull the cacheline into modified state, within a
single core. Causing higher latency when accessing that data on other
cores. That's obviously not great for a crucial hashtable... I think it
mainly matters for things like inner index pages etc.

It turns out that these misses due to dirtying already costs us rather dearly
when running read-heavy workloads on bigger machines, mainly due to nbtree
code doing things like BufferGetBlockNumber().

It'd be interesting to benchmark how a separate, more densely packed,
BufferTag array, shared by the hashtable and the BufferDesc itself. Obviously
that'd be a somewhat painful change.

It seems to me we can follow the offset pointer into the shared BufferDesc
array whenever we find out we need to compare the tags (as opposed to just
the hash, which is stored and present in HASHELEMENT). If we decide to just
follow the pointer, we can immediately shave 16 bytes (40%) off the lookup
table's per-element size, or 24 if we pack the 4-byte shared buffer offset
into the unused bytes in HASHELEMENT, reducing the memory usage of that
hash table by ~50%: We'd have 16 bytes for every
ELEMENT+shared_buffer_offset, plus 8 bytes for every bucket pointer (of
which there are approximately as many as there are elements), resulting in
24 bytes /max_alloc elements.

It does make the code more branchy...

Does anyone have an idea on how to best benchmark this kind of patch, apart
from "running pgbench"? Other ideas on how to improve this? Specific
concerns?

I'd recommend benchmarking at least the following workloads, all fully shared
buffer cache resident:

1) sequential scans

Tests: very low likelihood of cpu cache hit rates

2) sequential scan with a parameterized index nested loop over a small table

Tests: mixes low-cache hit likelihood (seqscan) with high cache hit rate
(index scan)

3) small range index scan with param nestloop join to index scan

Tests: Very high cache hit ratio

It's unfortunately fairly important to test these both with a single client
*and* a large number of clients on a multi-socket server. The latter makes
cache miss latency much more visible.

It might be worth looking at perf c2c profiles for before/after, I'd expect it
to change noticeably. Might also be worth at looking at perf stat for cache
misses, hitm, etc.

Greetings,

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Matthias van de Meent (#6)

Re: RFC: Packing the buffer lookup table

Hi,

On 2025-02-04 19:58:36 +0100, Matthias van de Meent wrote:

On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Together that results in the following prototype patchset.

Here's an alternative patch, which replaces dynahash in the buffer
lookup table with an open-coded replacement that has fewer
indirections during lookups, and allows for a much increased locality.

Overall, it has less memory usage than our current dynahash in all
cases; but compared to my previous patch it improves memory only in
some cases, in others it uses more memory.

The main design is a separate chaining hash table (similar to and
derived from code of Dynahash) to provide efficient access to free
elements.

Why do you want to use a separate chaining hash table? For something as
performance sensitive as this the lower cache hit ratio seems like a strong
reason to not go for such a hash table "architecture"?

I think it'd be better to work towards not using a hash table for the buffer
mapping table than to write a dedicated hash table implementation for it
though. A hash table

a) doesn't have ordered access support, which causes a bunch of O(NBuffers)
operations and makes things like readahead etc way more expensive
b) doesn't benefit from spatial locality, even though buffer lookups have a
very high degree of spatial locality

Greetings,

Andres Freund

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Andres Freund (#8)

Re: RFC: Packing the buffer lookup table

On Wed, 5 Feb 2025 at 02:22, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-02-04 19:58:36 +0100, Matthias van de Meent wrote:

On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Together that results in the following prototype patchset.

Here's an alternative patch, which replaces dynahash in the buffer
lookup table with an open-coded replacement that has fewer
indirections during lookups, and allows for a much increased locality.

Overall, it has less memory usage than our current dynahash in all
cases; but compared to my previous patch it improves memory only in
some cases, in others it uses more memory.

The main design is a separate chaining hash table (similar to and
derived from code of Dynahash) to provide efficient access to free
elements.

Why do you want to use a separate chaining hash table? For something as
performance sensitive as this the lower cache hit ratio seems like a strong
reason to not go for such a hash table "architecture"?

I think it's hard to defend this patchset against a perfect solution,
but lacking that perfect solution (and provided the current definitely
not great solution), I'll share my thoughts on why I chose for a
different hashmap. While wordy, I'm not trying to defend or attack
anything or anyone, just explaining my reasoning.

My main reasons are:

1.) this is a mostly drop-in replacement that, in the general case, is
better in every metric vs dynahash; which is probably preferred over
continuing the status quo given that it's now really starting to show
its limits (see e.g. the hash_search_with_hash_value thread).

2.) I'm not experienced enough with writing non-constant-sized
allocations in shared memory to get something like a radixtree.h
implemented in shared memory. Given that threading is still some
releases away, I thought a chaining hash table (which only needs
constant size allocations and thus requires only a simple slab
allocator) would be OK if it still improved on the status quo.
Note that radixtree's implementation has nodes sizing from 16 bytes up
to 1000+ bytes. Building an efficient allocator for that is difficult
(see: bin-packing problem), assuming it needs to interface with shmem
and thus statically pre-allocated.

3.) simplehash.h can't currently be used due to requiring resizing on
certain intervals, and we can't just dynamically re-allocate the
table. This makes the implementation unusable.

4.) simplehash also invalidates cache lines containing otherwise
unmodified elements due to its relocation strategy, which isn't great
either for cache sharing.

5.) A big reason why dynahash is slow, apart from random unpredictable
memory accesses, is said to be because it needs to follow 3 pointers
before it has its first element: The hash table has a pointer to a
segment, which is a pointer to a bucket, which is a pointer to an
ELEMENT - and the hash table itself is an allocated struct and thus
needs to be loaded through a pointer as well. It isn't the only reason
(e.g. bucket chains are expensive, too) but probably one large reason
why it's so slow.

As for what the buffer lookup table worries about; I considered the following:

cache locality: How quickly can you find an element, starting with 0 info)
dynahash (--) newhash (-) simplehash (+) radixtree (~?)
cache line dirtying: How many cache lines are dirtied with
insert/delete operations
dynahash (+++) newhash (++) simplehash (--) radixtree (++ ?)
spatial locality: How close are "near" elements to eachother in the structure?
dynahash (---) newhash (--?) simplehash (-?) radixtree (++)
partitioning: Can the structure be written to by multiple backends concurrently?
dynahash (+++) newhash (+++) simplehash (---) radixtree (--?)
memory usage: Compare memory overhead allocated for random buffers
dynahash (~) newhash (+) simplehash (++) radixtree (-?)

I think it'd be better to work towards not using a hash table for the buffer
mapping table than to write a dedicated hash table implementation for it
though. A hash table

a) doesn't have ordered access support, which causes a bunch of O(NBuffers)
operations and makes things like readahead etc way more expensive
b) doesn't benefit from spatial locality, even though buffer lookups have a
very high degree of spatial locality

While I hear those arguments, I'm not sure it's as easy as you make it
seem. You yourself have said that "dynahash is [a] really poor hash
table implementation." This new hash table was 'only' a few days'
work, and seems to at least have fixed at least 2 of the major issues
we see in dynahash: It reduces the pointer chasing for first result to
2 in the opmal case, and (with some further changes which I have yet
to share) it may achieve a much higher average locality vs the
current, unmodified, dynahash.
I did look at other solutions (specifically, adapting radixtree.h
and/or simplehash.h) but as I've mentioned above, the memory tooling
required for these solutions were a major turnoff as they
significantly increase the complexity of such a project.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#10

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Andres Freund (#7)

Re: RFC: Packing the buffer lookup table

On Wed, 5 Feb 2025 at 02:14, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-01-30 08:48:56 +0100, Matthias van de Meent wrote:

Some time ago I noticed that every buffer table entry is quite large at 40
bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
buffers array, with generally 8 more bytes used for the bucket pointers.
(32-bit systems: 32 (+4) bytes)

Does anyone know why we must have the buffer tag in the buffer table?

It turns out to actually substantially improve CPU cache hit ratio on
concurrent workloads. The BufferDesc is obviously frequently modified. Each
modification (pin, lock) will pull the cacheline into modified state, within a
single core. Causing higher latency when accessing that data on other
cores. That's obviously not great for a crucial hashtable... I think it
mainly matters for things like inner index pages etc.

It turns out that these misses due to dirtying already costs us rather dearly
when running read-heavy workloads on bigger machines, mainly due to nbtree
code doing things like BufferGetBlockNumber().

That is something I hadn't thought about, but indeed, you're right
that this wouldn't be great.

It'd be interesting to benchmark how a separate, more densely packed,
BufferTag array, shared by the hashtable and the BufferDesc itself. Obviously
that'd be a somewhat painful change.

Such a patch is actually not that bad, as surprisingly few files
actually touch on BufferDesc (let alone BufferDesc->tag - though the
number of places with changes is still 100+). I've prototyped with it,
and it removes another 12 bytes from the overhead of each buffer
(assuming we want to pack BufferDesc at 32 bytes, as that's possible).

Does anyone have an idea on how to best benchmark this kind of patch, apart
from "running pgbench"? Other ideas on how to improve this? Specific
concerns?

I'd recommend benchmarking at least the following workloads, all fully shared
buffer cache resident:

[...]

Thanks for the suggestions!

It's unfortunately fairly important to test these both with a single client
*and* a large number of clients on a multi-socket server. The latter makes
cache miss latency much more visible.

It might be worth looking at perf c2c profiles for before/after, I'd expect it
to change noticeably. Might also be worth at looking at perf stat for cache
misses, hitm, etc.

Hmm. I'll see if I can get some hardware to test this.

FYI, I've pushed a newer version of the 'newhash' approach to Github,
at [0]https://github.com/MMeent/postgres/tree/feat/new-buftable-hash. It extracts the buffer tags from the BufferDesc into its own
array to reduce the problems with false sharing due to pins, and has
some more code to try and forcefully increase the locality of hash
elements.

There are still a few more potential changes that can increase cache
locality of hash elements with the buckets that store their data, but
I have no immediate plans for that.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: https://github.com/MMeent/postgres/tree/feat/new-buftable-hash

RFC: Packing the buffer lookup table

Attachments:

Attachments: