RFC: Packing the buffer lookup table
Hi,
Some time ago I noticed that every buffer table entry is quite large at 40
bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
buffers array, with generally 8 more bytes used for the bucket pointers.
(32-bit systems: 32 (+4) bytes)
Does anyone know why we must have the buffer tag in the buffer table?
It seems to me we can follow the offset pointer into the shared BufferDesc
array whenever we find out we need to compare the tags (as opposed to just
the hash, which is stored and present in HASHELEMENT). If we decide to just
follow the pointer, we can immediately shave 16 bytes (40%) off the lookup
table's per-element size, or 24 if we pack the 4-byte shared buffer offset
into the unused bytes in HASHELEMENT, reducing the memory usage of that
hash table by ~50%: We'd have 16 bytes for every
ELEMENT+shared_buffer_offset, plus 8 bytes for every bucket pointer (of
which there are approximately as many as there are elements), resulting in
24 bytes /max_alloc elements.
(This was also discussed on Discord in the Hackers Mentoring server, over
at [0]https://discord.com/channels/1258108670710124574/1318997914777026580)
Together that results in the following prototype patchset. 0001 adds the
ability for dynahash users to opt in to using the 4-byte alignment hole in
HASHELEMENT (by providing size- and alignment info that dynahash uses to
partially move the entry into the alignment hole), 0002 uses that feature
to get the per-element size of the buffer lookup hash table down to 16
bytes (+8B for bucket pointers), or 12 (+4) on 32-bit systems
An alternative approach to current patch 1 (which introduces "element data
offset" to determine where to start looking for the key) would be to add an
option to allow "0-length" keys/entries when there is alignment space, and
make the hash/compare functions handle writing/reading of key data (thus
removing the new data dependencies in the hash lookup function), but I'm
not sure that's a winning idea as that requires the user of the API to have
knowledge about the internals of dynahash, rather than dynahash internally
optimizing usage based on a clearer picture of what the hash entry needs.
Does anyone have an idea on how to best benchmark this kind of patch, apart
from "running pgbench"? Other ideas on how to improve this? Specific
concerns?
Kind regards,
Matthias van de Meent
[0]: https://discord.com/channels/1258108670710124574/1318997914777026580
Attachments:
v0-0002-Buftable-Reduce-size-of-buffer-table-entries-by-6.patchapplication/x-patch; name=v0-0002-Buftable-Reduce-size-of-buffer-table-entries-by-6.patchDownload
From be1923e29a0ca53adae8853f7ca7b4534fa58258 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 20 Dec 2024 18:38:52 +0100
Subject: [PATCH v0 2/2] Buftable: Reduce size of buffer table entries by 60%
By using the stored id as identifier of which BufferDesc we point to
and thus which BufferTag we need to compare against, we can remove
that BufferTag from the BufferLookupEnt.
On 32-bit systems, that reduces the per-element size from 32 bytes to 12
bytes, a 62% reduction. On 64-bit systems we additionally fit the now
4-byte entry data into the empty space of the hash table element, thus
reducing the per-element size down to 16 bytes from 40 bytes; a 60%
reduction.
---
src/include/storage/buf_internals.h | 2 +-
src/backend/storage/buffer/buf_table.c | 112 ++++++++++++++++++++++---
src/backend/storage/buffer/bufmgr.c | 4 +-
3 files changed, 105 insertions(+), 13 deletions(-)
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1a65342177d..db752bf8368 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -439,7 +439,7 @@ extern void InitBufTable(int size);
extern uint32 BufTableHashCode(BufferTag *tagPtr);
extern int BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(int buffer, uint32 hashcode);
/* localbuf.c */
extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index a50955d5286..0cda2652807 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -22,16 +22,90 @@
#include "postgres.h"
#include "storage/buf_internals.h"
+#include "common/hashfn.h"
/* entry for buffer lookup hashtable */
typedef struct
{
- BufferTag key; /* Tag of a disk page */
- int id; /* Associated buffer ID */
+ int id; /* Buffer offset into BufferDescriptors */
} BufferLookupEnt;
+/*
+ * We don't store Buffer in BufTable, but offsets into BufferDescriptors, so
+ * we must use our own InvalidBuffer equivalent.
+ */
+#define SurrogateBuffer (-1)
+
static HTAB *SharedBufHash;
+static BufferTag *MyBackendBufLookup;
+
+static inline BufferTag *
+BufTableIdToBufTag(int id)
+{
+ if (id == SurrogateBuffer)
+ {
+ Assert(PointerIsValid(MyBackendBufLookup));
+ return MyBackendBufLookup;
+ }
+ else
+ {
+ BufferDesc *buf = GetBufferDescriptor(id);
+ /* can't be in the buffer table if the tag ain't valid */
+ Assert(pg_atomic_read_u32(&buf->state) & BM_TAG_VALID);
+ return &buf->tag;
+ }
+}
+
+static uint32
+BufTableHashValue(const void *key, Size keysize)
+{
+ int id = *(int *) key;
+
+ /*
+ * Key hashes are stored, and are provided during search/insert/delete
+ * operations, so the only opportunity for this to happen is the hash
+ * table doing hash operations (which we don't want to happen). Therefore,
+ * assert we don't get called, but correctly handle other operations.
+ */
+ Assert(false);
+
+ if (id == SurrogateBuffer)
+ {
+ if (MyBackendBufLookup == NULL)
+ {
+ elog(ERROR, "Surrogate buffer ID should have a lookup buffer tag set aside");
+ }
+
+ return tag_hash(MyBackendBufLookup, sizeof(BufferTag));
+ }
+
+ return tag_hash(BufTableIdToBufTag(id), keysize);
+}
+
+static int
+BufTableHashCompare(const void *key1, const void *key2, Size keysize)
+{
+ int id1 = *(int *) key1;
+ int id2 = *(int *) key2;
+
+ /*
+ * Hash table searches always provide existing entries as argument 1;
+ * and we can't have a missing ID as argument.
+ */
+ Assert(id1 != SurrogateBuffer);
+
+ if (id1 == id2)
+ return 0;
+
+ /* we're looking for exactly this buffer ID */
+ if (id2 != SurrogateBuffer)
+ return false;
+
+ return memcmp(BufTableIdToBufTag(id1),
+ BufTableIdToBufTag(id2),
+ sizeof(BufferTag));
+}
/*
* Estimate space needed for mapping hashtable
@@ -40,7 +114,9 @@ static HTAB *SharedBufHash;
Size
BufTableShmemSize(int size)
{
- return hash_estimate_size(size, sizeof(BufferLookupEnt));
+ return hash_estimate_size_aligned(size,
+ sizeof(BufferLookupEnt),
+ ALIGNOF_INT);
}
/*
@@ -55,14 +131,18 @@ InitBufTable(int size)
/* assume no locking is needed yet */
/* BufferTag maps to Buffer */
- info.keysize = sizeof(BufferTag);
+ info.keysize = sizeof(BufferLookupEnt);
info.entrysize = sizeof(BufferLookupEnt);
+ info.entryalign = ALIGNOF_INT;
info.num_partitions = NUM_BUFFER_PARTITIONS;
+ info.hash = BufTableHashValue;
+ info.match = BufTableHashCompare;
SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
size, size,
&info,
- HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+ HASH_ELEM | HASH_PARTITION | HASH_FUNCTION
+ | HASH_COMPARE | HASH_ALIGN);
}
/*
@@ -77,7 +157,7 @@ InitBufTable(int size)
uint32
BufTableHashCode(BufferTag *tagPtr)
{
- return get_hash_value(SharedBufHash, tagPtr);
+ return tag_hash(tagPtr, sizeof(BufferTag));
}
/*
@@ -89,15 +169,19 @@ BufTableHashCode(BufferTag *tagPtr)
int
BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
{
+ const int surrogateid = SurrogateBuffer;
BufferLookupEnt *result;
+ MyBackendBufLookup = tagPtr;
result = (BufferLookupEnt *)
hash_search_with_hash_value(SharedBufHash,
- tagPtr,
+ &surrogateid,
hashcode,
HASH_FIND,
NULL);
+ MyBackendBufLookup = NULL;
+
if (!result)
return -1;
@@ -117,21 +201,29 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
int
BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
{
+ const int surrogateid = SurrogateBuffer;
BufferLookupEnt *result;
bool found;
Assert(buf_id >= 0); /* -1 is reserved for not-in-table */
Assert(tagPtr->blockNum != P_NEW); /* invalid tag */
+ MyBackendBufLookup = tagPtr;
+
result = (BufferLookupEnt *)
hash_search_with_hash_value(SharedBufHash,
- tagPtr,
+ &surrogateid,
hashcode,
HASH_ENTER,
&found);
+ MyBackendBufLookup = NULL;
+
if (found) /* found something already in the table */
+ {
+ Assert(result->id != SurrogateBuffer);
return result->id;
+ }
result->id = buf_id;
@@ -145,13 +237,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
* Caller must hold exclusive lock on BufMappingLock for tag's partition
*/
void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(int buffer, uint32 hashcode)
{
BufferLookupEnt *result;
result = (BufferLookupEnt *)
hash_search_with_hash_value(SharedBufHash,
- tagPtr,
+ &buffer,
hashcode,
HASH_REMOVE,
NULL);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5008641baff..25398467a2b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1856,7 +1856,7 @@ retry:
* Remove the buffer from the lookup hashtable, if it was in there.
*/
if (oldFlags & BM_TAG_VALID)
- BufTableDelete(&oldTag, oldHash);
+ BufTableDelete(buf->buf_id, oldHash);
/*
* Done with mapping lock.
@@ -1935,7 +1935,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
/* finally delete buffer from the buffer mapping table */
- BufTableDelete(&tag, hash);
+ BufTableDelete(buf_hdr->buf_id, hash);
LWLockRelease(partition_lock);
--
2.45.2
v0-0001-Dynahash-Allow-improved-packing-of-hash-elements.patchapplication/x-patch; name=v0-0001-Dynahash-Allow-improved-packing-of-hash-elements.patchDownload
From 3f4cd31478d3e725ab3900eac8c9e7f8982727c5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 18 Dec 2024 16:43:27 +0100
Subject: [PATCH v0 1/2] Dynahash: Allow improved packing of hash elements
On 64-bit builds, there is a 4-byte gap in each Element, which is
wasteful when the entries align to 4 bytes, too.
This adds APIs to indicate entries align by 4 bytes, thus reducing
the space wastage by some bytes per entry (when the alignment works
out).
This will be used in future commits to reduce the size of each
buf_table element by 24 bytes down to 16 bytes, a reduction of 60%.
---
src/include/utils/hsearch.h | 10 +++-
src/backend/utils/hash/dynahash.c | 90 ++++++++++++++++++++++++-------
2 files changed, 79 insertions(+), 21 deletions(-)
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..0856e1209f3 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -74,6 +74,9 @@ typedef struct HASHCTL
/* Used if HASH_ELEM flag is set (which is now required): */
Size keysize; /* hash key length in bytes */
Size entrysize; /* total user element size in bytes */
+ /* Used if HASH_ALIGN flag is set: */
+ int entryalign; /* alignment of entries; either ALIGNOF_INT or
+ * MAXIMUM_ALIGNOF */
/* Used if HASH_FUNCTION flag is set: */
HashValueFunc hash; /* hash function */
/* Used if HASH_COMPARE flag is set: */
@@ -103,6 +106,7 @@ typedef struct HASHCTL
#define HASH_SHARED_MEM 0x0800 /* Hashtable is in shared memory */
#define HASH_ATTACH 0x1000 /* Do not initialize hctl */
#define HASH_FIXED_SIZE 0x2000 /* Initial size is a hard limit */
+#define HASH_ALIGN 0x4000 /* Pack-align elements */
/* max_dsize value to indicate expansible directory */
#define NO_MAX_DSIZE (-1)
@@ -149,7 +153,11 @@ extern void hash_seq_init_with_hash_value(HASH_SEQ_STATUS *status,
extern void *hash_seq_search(HASH_SEQ_STATUS *status);
extern void hash_seq_term(HASH_SEQ_STATUS *status);
extern void hash_freeze(HTAB *hashp);
-extern Size hash_estimate_size(long num_entries, Size entrysize);
+
+#define hash_estimate_size(num_entries, size) \
+ hash_estimate_size_aligned(num_entries, size, MAXIMUM_ALIGNOF)
+extern Size hash_estimate_size_aligned(long num_entries, Size entrysize,
+ Size entryalign);
extern long hash_select_dirsize(long num_entries);
extern Size hash_get_shared_size(HASHCTL *info, int flags);
extern void AtEOXact_HashTables(bool isCommit);
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index cd5a00132fc..f74c3ba0682 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -190,6 +190,10 @@ struct HASHHDR
/* These fields are fixed at hashtable creation */
Size keysize; /* hash key length in bytes */
Size entrysize; /* total user element size in bytes */
+ int entryalign; /* offset of user data into element, to pack
+ * data tightly */
+ int entryoffset; /* offset of entry into element, may point
+ * into the alignment gap in HASHELEMENT */
long num_partitions; /* # partitions (must be power of 2), or 0 */
long max_dsize; /* 'dsize' limit if directory is fixed size */
long ssize; /* segment size --- must be power of 2 */
@@ -234,6 +238,7 @@ struct HTAB
/* We keep local copies of these fixed values to reduce contention */
Size keysize; /* hash key length in bytes */
+ Size keyoff; /* key start offset off the base entry pointer */
long ssize; /* segment size --- must be power of 2 */
int sshift; /* segment shift = log2(ssize) */
};
@@ -241,7 +246,7 @@ struct HTAB
/*
* Key (also entry) part of a HASHELEMENT
*/
-#define ELEMENTKEY(helem) (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT)))
+#define ELEMENTKEY(helem, off) (((char *)(helem)) + off)
/*
* Obtain element pointer given pointer to key
@@ -270,7 +275,7 @@ static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
static void hdefault(HTAB *hashp);
-static int choose_nelem_alloc(Size entrysize);
+static int choose_nelem_alloc(Size entrysize, Size entryalign);
static bool init_htab(HTAB *hashp, long nelem);
static void hash_corrupted(HTAB *hashp) pg_attribute_noreturn();
static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue,
@@ -361,6 +366,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
Assert(flags & HASH_ELEM);
Assert(info->keysize > 0);
Assert(info->entrysize >= info->keysize);
+ Assert(flags & (~HASH_ALIGN) || info->entryalign == ALIGNOF_INT ||
+ info->entryalign == MAXIMUM_ALIGNOF);
/*
* For shared hash tables, we have a local hash header (HTAB struct) that
@@ -558,8 +565,37 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
hctl->keysize = info->keysize;
hctl->entrysize = info->entrysize;
+ if (flags & HASH_ALIGN)
+ {
+ if (info->entryalign != MAXIMUM_ALIGNOF &&
+ info->entryalign != ALIGNOF_INT)
+ {
+ elog(ERROR, "Invalid hash table entry alignment %d",
+ info->entryalign);
+ }
+
+ hctl->entryalign = info->entryalign;
+
+ if (info->entryalign == MAXIMUM_ALIGNOF)
+ hctl->entryoffset = sizeof(HASHELEMENT);
+ else if (info->entryalign == ALIGNOF_INT)
+ {
+ const Size startOfGap = offsetof(HASHELEMENT, hashvalue) + sizeof(uint32);
+ Size aligned = TYPEALIGN(ALIGNOF_INT, startOfGap);
+
+ hctl->entryoffset = aligned;
+ }
+ }
+ else
+ {
+ hctl->entryalign = MAXIMUM_ALIGNOF;
+ hctl->entryoffset = sizeof(HASHELEMENT);
+ }
+ Assert(TYPEALIGN(hctl->entryalign, hctl->entryoffset) == hctl->entryoffset);
+
/* make local copies of heavily-used constant fields */
hashp->keysize = hctl->keysize;
+ hashp->keyoff = hctl->entryoffset;
hashp->ssize = hctl->ssize;
hashp->sshift = hctl->sshift;
@@ -653,15 +689,19 @@ hdefault(HTAB *hashp)
* elements to add to the hash table when we need more.
*/
static int
-choose_nelem_alloc(Size entrysize)
+choose_nelem_alloc(Size entrysize, Size entryalign)
{
int nelem_alloc;
Size elementSize;
Size allocSize;
+ Assert(entryalign == 4 || entryalign == MAXIMUM_ALIGNOF);
/* Each element has a HASHELEMENT header plus user data. */
/* NB: this had better match element_alloc() */
- elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(entrysize);
+ elementSize = offsetof(HASHELEMENT, hashvalue) + sizeof(uint32);
+ elementSize = TYPEALIGN(entryalign, elementSize);
+
+ elementSize = MAXALIGN(elementSize + entrysize);
/*
* The idea here is to choose nelem_alloc at least 32, but round up so
@@ -756,7 +796,7 @@ init_htab(HTAB *hashp, long nelem)
}
/* Choose number of entries to allocate at a time */
- hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+ hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize, hctl->entryalign);
#ifdef HASH_DEBUG
fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -772,6 +812,8 @@ init_htab(HTAB *hashp, long nelem)
return true;
}
+
+
/*
* Estimate the space needed for a hashtable containing the given number
* of entries of given size.
@@ -780,7 +822,7 @@ init_htab(HTAB *hashp, long nelem)
* NB: assumes that all hash structure parameters have default values!
*/
Size
-hash_estimate_size(long num_entries, Size entrysize)
+hash_estimate_size_aligned(long num_entries, Size entrysize, Size entryalign)
{
Size size;
long nBuckets,
@@ -807,9 +849,12 @@ hash_estimate_size(long num_entries, Size entrysize)
size = add_size(size, mul_size(nSegments,
MAXALIGN(DEF_SEGSIZE * sizeof(HASHBUCKET))));
/* elements --- allocated in groups of choose_nelem_alloc() entries */
- elementAllocCnt = choose_nelem_alloc(entrysize);
+ elementAllocCnt = choose_nelem_alloc(entrysize, entryalign);
nElementAllocs = (num_entries - 1) / elementAllocCnt + 1;
- elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(entrysize);
+
+ elementSize = offsetof(HASHELEMENT, hashvalue) + sizeof(uint32);
+ elementSize = TYPEALIGN(entryalign, elementSize);
+ elementSize = MAXALIGN(elementSize + entrysize);
size = add_size(size,
mul_size(nElementAllocs,
mul_size(elementAllocCnt, elementSize)));
@@ -974,6 +1019,7 @@ hash_search_with_hash_value(HTAB *hashp,
HASHHDR *hctl = hashp->hctl;
int freelist_idx = FREELIST_IDX(hctl, hashvalue);
Size keysize;
+ Size keyoff;
HASHBUCKET currBucket;
HASHBUCKET *prevBucketPtr;
HashCompareFunc match;
@@ -1014,11 +1060,12 @@ hash_search_with_hash_value(HTAB *hashp,
*/
match = hashp->match; /* save one fetch in inner loop */
keysize = hashp->keysize; /* ditto */
+ keyoff = hashp->keyoff; /* ditto */
while (currBucket != NULL)
{
if (currBucket->hashvalue == hashvalue &&
- match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
+ match(ELEMENTKEY(currBucket, keyoff), keyPtr, keysize) == 0)
break;
prevBucketPtr = &(currBucket->link);
currBucket = *prevBucketPtr;
@@ -1038,7 +1085,7 @@ hash_search_with_hash_value(HTAB *hashp,
{
case HASH_FIND:
if (currBucket != NULL)
- return ELEMENTKEY(currBucket);
+ return ELEMENTKEY(currBucket, keyoff);
return NULL;
case HASH_REMOVE:
@@ -1067,7 +1114,7 @@ hash_search_with_hash_value(HTAB *hashp,
* element, because someone else is going to reuse it the next
* time something is added to the table
*/
- return ELEMENTKEY(currBucket);
+ return ELEMENTKEY(currBucket, keyoff);
}
return NULL;
@@ -1075,7 +1122,7 @@ hash_search_with_hash_value(HTAB *hashp,
case HASH_ENTER_NULL:
/* Return existing element if found, else create one */
if (currBucket != NULL)
- return ELEMENTKEY(currBucket);
+ return ELEMENTKEY(currBucket, keyoff);
/* disallow inserts if frozen */
if (hashp->frozen)
@@ -1105,7 +1152,7 @@ hash_search_with_hash_value(HTAB *hashp,
/* copy key into record */
currBucket->hashvalue = hashvalue;
- hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize);
+ hashp->keycopy(ELEMENTKEY(currBucket, keyoff), keyPtr, keysize);
/*
* Caller is expected to fill the data field on return. DO NOT
@@ -1114,7 +1161,7 @@ hash_search_with_hash_value(HTAB *hashp,
* caller's data structure.
*/
- return ELEMENTKEY(currBucket);
+ return ELEMENTKEY(currBucket, keyoff);
}
elog(ERROR, "unrecognized hash action code: %d", (int) action);
@@ -1149,6 +1196,7 @@ hash_update_hash_key(HTAB *hashp,
HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry);
uint32 newhashvalue;
Size keysize;
+ Size keyoff;
uint32 bucket;
uint32 newbucket;
HASHBUCKET currBucket;
@@ -1202,11 +1250,12 @@ hash_update_hash_key(HTAB *hashp,
*/
match = hashp->match; /* save one fetch in inner loop */
keysize = hashp->keysize; /* ditto */
+ keyoff = hashp->keyoff; /* ditto */
while (currBucket != NULL)
{
if (currBucket->hashvalue == newhashvalue &&
- match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0)
+ match(ELEMENTKEY(currBucket, keyoff), newKeyPtr, keysize) == 0)
break;
prevBucketPtr = &(currBucket->link);
currBucket = *prevBucketPtr;
@@ -1240,7 +1289,7 @@ hash_update_hash_key(HTAB *hashp,
/* copy new key into record */
currBucket->hashvalue = newhashvalue;
- hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize);
+ hashp->keycopy(ELEMENTKEY(currBucket, keyoff), newKeyPtr, keysize);
/* rest of record is untouched */
@@ -1440,7 +1489,7 @@ hash_seq_search(HASH_SEQ_STATUS *status)
status->curEntry = curElem->link;
if (status->hashvalue != curElem->hashvalue)
continue;
- return (void *) ELEMENTKEY(curElem);
+ return (void *) ELEMENTKEY(curElem, status->hashp->keyoff);
}
hash_seq_term(status);
@@ -1453,7 +1502,7 @@ hash_seq_search(HASH_SEQ_STATUS *status)
status->curEntry = curElem->link;
if (status->curEntry == NULL) /* end of this bucket */
++status->curBucket;
- return ELEMENTKEY(curElem);
+ return ELEMENTKEY(curElem, status->hashp->keyoff);
}
/*
@@ -1507,7 +1556,7 @@ hash_seq_search(HASH_SEQ_STATUS *status)
if (status->curEntry == NULL) /* end of this bucket */
++curBucket;
status->curBucket = curBucket;
- return ELEMENTKEY(curElem);
+ return ELEMENTKEY(curElem, status->hashp->keyoff);
}
void
@@ -1716,7 +1765,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
return false;
/* Each element has a HASHELEMENT header plus user data. */
- elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+ Assert(hctl->entryoffset >= offsetof(HASHELEMENT, hashvalue) + sizeof(uint32));
+ elementSize = MAXALIGN(hctl->entryoffset + hctl->entrysize);
CurrentDynaHashCxt = hashp->hcxt;
firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
--
2.45.2
On Wed, Jan 29, 2025 at 11:49 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Hi,
Some time ago I noticed that every buffer table entry is quite large at 40 bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared buffers array, with generally 8 more bytes used for the bucket pointers. (32-bit systems: 32 (+4) bytes)
Does anyone know why we must have the buffer tag in the buffer table?
It seems to me we can follow the offset pointer into the shared BufferDesc array whenever we find out we need to compare the tags (as opposed to just the hash, which is stored and present in HASHELEMENT). If we decide to just follow the pointer, we can immediately shave 16 bytes (40%) off the lookup table's per-element size, or 24 if we pack the 4-byte shared buffer offset into the unused bytes in HASHELEMENT, reducing the memory usage of that hash table by ~50%: ...
So every buffer table entry is 40 bytes, and every buffer itself is 8
KB. Also, every BufferDesc is padded to 64 bytes (on 64-bit systems).
So buffer table entry is < 0.5% of total buffer space.
I assume this means that the benefit of your patch doesn't come from
memory savings, but maybe from spatial locality? As in, saving 0.5% of
memory doesn't seem like a huge win by itelf, but maybe the hash table
will see fewer cache misses, since the entries are smaller, so they'll
be packed closer together. Now you can fit 2.7 hash entries on a cache
line, instead of 1.6.
Except, to check the buffer tag itself, now you have to dereference
the offset pointer into the shared BufferDesc array. And every
BufferDesc is padded to 64 bytes (on 64-bit systems), a cache line.
So, now, instead of having the buffer tag adjacent to the hash entry,
you have a memory stall.
If the buffer tags match (i.e., no collision), this is still a
possible win, because you'd have to load the BufferDesc anyway (since
this is the buffer you're looking for).
Does anyone have an idea on how to best benchmark this kind of patch, apart from "running pgbench"? Other ideas on how to improve this? Specific concerns?
What's the expected advantage of reducing the buffer table's entry
size? Better CPU cache usage when there's no collision, but worse in
the less-likely case where there is a collision?
What I mean is, regarding benchmarks: what's the best case scenario
for this kind of patch, and what sort of performance difference would
you expect to see?
Thanks,
James
Zhang Mingli
www.hashdata.xyz
On Jan 30, 2025 at 15:49 +0800, Matthias van de Meent <boekewurm+postgres@gmail.com>, wrote:
Some time ago I noticed that every buffer table entry is quite large at 40 bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared buffers array, with generally 8 more bytes used for the bucket pointers. (32-bit systems: 32 (+4) bytes)
Does anyone know why we must have the buffer tag in the buffer table?
It seems to me we can follow the offset pointer into the shared BufferDesc array whenever we find out we need to compare the tags (as opposed to just the hash, which is stored and present in HASHELEMENT). If we decide to just follow the pointer, we can immediately shave 16 bytes (40%) off the lookup table's per-element size, or 24 if we pack the 4-byte shared buffer offset into the unused bytes in HASHELEMENT, reducing the memory usage of that hash table by ~50%: We'd have 16 bytes for every ELEMENT+shared_buffer_offset, plus 8 bytes for every bucket pointer (of which there are approximately as many as there are elements), resulting in 24 bytes /max_alloc elements.
Hi,
Thanks for your insights.
While the buffer tag consumes a relatively small amount of space in the overall shared buffer architecture, including the BufferDescriptors array and page buffers, but every little helps, I think.
Regarding the code, I've read through some. Here are my initial thoughts:
```
int
BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
{
const int surrogateid = SurrogateBuffer;
BufferLookupEnt *result;
bool found;
Assert(buf_id >= 0); /* -1 is reserved for not-in-table */
Assert(tagPtr->blockNum != P_NEW); /* invalid tag */
MyBackendBufLookup = tagPtr;
result = (BufferLookupEnt *)
hash_search_with_hash_value(SharedBufHash,
&surrogateid,
hashcode,
HASH_ENTER,
&found);
MyBackendBufLookup = NULL;
if (found) /* found something already in the table */
{
Assert(result->id != SurrogateBuffer);
return result->id;
}
```
In the BufTableInsert function, it appears that the key changes from BufferTag to an integer surrogateid.
Given that multiple buckets exist based on the hash code, we need to iterate through the bucket lists to find a slot by comparing the keys, and if surrogateid is set to -1, will the comparison function always return false?
match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0) inside hash_search_with_hash_value
```
while (currBucket != NULL)
{
if (currBucket->hashvalue == hashvalue &&
match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
break;
prevBucketPtr = &(currBucket->link);
currBucket = *prevBucketPtr;
```
Additionally, I'm curious about the purpose of MyBackendBufLookup, which is set and reset around the hash_search_with_hash_value call. Is there a concurrency consideration here, even though we have a lock before the buffer insertion?
And, a potential drawback? : databases built on PostgreSQL might manipulate the buffer table directly (e.g., reading it for specific purposes).
In this case, the buffer tags stored in the table would reveal the infos without needing to reference the buffer descriptor array.
While I understand that Postgres doesn’t have promise about that, just a consideration.
On Sat, 1 Feb 2025 at 06:01, Zhang Mingli <zmlpostgres@gmail.com> wrote:
Zhang Mingli
www.hashdata.xyz
On Jan 30, 2025 at 15:49 +0800, Matthias van de Meent <
boekewurm+postgres@gmail.com>, wrote:
Hi,
Thanks for your insights.
While the buffer tag consumes a relatively small amount of space in the
overall shared buffer architecture, including the BufferDescriptors array
and page buffers, but every little helps, I think.
Regarding the code, I've read through some. Here are my initial thoughts:
```
int
BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
[...]
```In the BufTableInsert function, it appears that the key changes from
BufferTag to an integer surrogateid.
Given that multiple buckets exist based on the hash code, we need to
iterate through the bucket lists to find a slot by comparing the keys, and
if surrogateid is set to -1, will the comparison function always return
false?
Please note the use of our own compare function, which grabs the stored
value of 'key' and uses that buffer ID to find the buffertag of the
indicated buffer (or, in case of surrogateId, the buffertag that was
supplied by the caller of BufTableInsert()).
Additionally, I'm curious about the purpose of MyBackendBufLookup, which
is set and reset around the hash_search_with_hash_value call. Is there a
concurrency consideration here, even though we have a lock before the
buffer insertion?
In multi-threaded postgresql this might need additional considerations, but
in non-reentrent non-threaded PostgreSQL we shouldn't have any callers to
the BufTableInsert/Delete/Search() functions from inside those same
functions.
And, a potential drawback? : databases built on PostgreSQL might
manipulate the buffer table directly (e.g., reading it for specific
purposes).
In this case, the buffer tags stored in the table would reveal the infos
without needing to reference the buffer descriptor array.
While I understand that Postgres doesn’t have promise about that, just a
consideration.
The buffer lookup table reference is stored in a variable that's private to
buftable.c. Any user of the table that's not using the BufTable*-functions
would need to be doing some very weird trickery w.r.t. finding out where
the table base pointer is stored and how to access it (though attaching
through shared memory is technically possible, I wouldn't suggest that as a
valid and supported method).
Additionally, anyone who wants to scan that table would have to lock all
partitions for the duration of the scan to correctly scan all entries,
which would be catastrophic for the performance of a cluster that does any
amount of buffer swapping.
It is probably much cheaper to scan the bufferdesc array, as that has
per-entry locks, and the definitions for that array are shared in the
relevant headers, and are thus usable without resorting to magic values.
Kind regards,
Matthias van de Meent
On Fri, 31 Jan 2025 at 18:23, James Hunter <james.hunter.pg@gmail.com> wrote:
On Wed, Jan 29, 2025 at 11:49 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Hi,
Some time ago I noticed that every buffer table entry is quite large at 40 bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared buffers array, with generally 8 more bytes used for the bucket pointers. (32-bit systems: 32 (+4) bytes)
Does anyone know why we must have the buffer tag in the buffer table?
It seems to me we can follow the offset pointer into the shared BufferDesc array whenever we find out we need to compare the tags (as opposed to just the hash, which is stored and present in HASHELEMENT). If we decide to just follow the pointer, we can immediately shave 16 bytes (40%) off the lookup table's per-element size, or 24 if we pack the 4-byte shared buffer offset into the unused bytes in HASHELEMENT, reducing the memory usage of that hash table by ~50%: ...So every buffer table entry is 40 bytes, and every buffer itself is 8
KB. Also, every BufferDesc is padded to 64 bytes (on 64-bit systems).
So buffer table entry is < 0.5% of total buffer space.
Correct. But note that most buffers generally aren't all being
accessed, and thus aren't (fully) present in CPU caches.
I assume this means that the benefit of your patch doesn't come from
memory savings, but maybe from spatial locality? As in, saving 0.5% of
memory doesn't seem like a huge win by itelf, but maybe the hash table
will see fewer cache misses, since the entries are smaller, so they'll
be packed closer together. Now you can fit 2.7 hash entries on a cache
line, instead of 1.6.
Yes - though it's actually 4 entries per cache line: While the buckets
array uses space proportional to the total number of entries, we won't
access that bucket array again once we've used it, so any subsequent
memory accesses are 2.5x more likely to hit a cache line, compared to
the chances of what's currently on master.
Except, to check the buffer tag itself, now you have to dereference
the offset pointer into the shared BufferDesc array. And every
BufferDesc is padded to 64 bytes (on 64-bit systems), a cache line.
So, now, instead of having the buffer tag adjacent to the hash entry,
you have a memory stall.
Correct - though that only happens for hits and hash collisions. Hash
collisions are considered rare (assuming linear hashing, it'd be
1/2^16), and for hits you would load that buffer tag anyway, so the
cost is a potential cache miss _while holding the buftable lock_, be
it in shared or exclusive mode (most likely: shared, as we don't do
blind updates to the hash table).
If the buffer tags match (i.e., no collision), this is still a
possible win, because you'd have to load the BufferDesc anyway (since
this is the buffer you're looking for).
Exactly.
Does anyone have an idea on how to best benchmark this kind of patch, apart from "running pgbench"? Other ideas on how to improve this? Specific concerns?
What's the expected advantage of reducing the buffer table's entry
size? Better CPU cache usage when there's no collision, but worse in
the less-likely case where there is a collision?
Yes, plus a reduced overhead of defining a large shared buffers. Over
in [0]/messages/by-id/CA+TgmoaGCFPhMjz7veJOeef30=KdpOxgywcLwNbr-Gny-mXwcg@mail.gmail.com people discuss changing the size of shared_buffers on-the-fly.
When we have that capability, the overhead of allocating large shared
buffers can become a meaningful fraction of total shared memory in
reduced-shared_buffers systems: Currently, we have ±144
bytes/shared_buffers entry in shared_memory allocations, excluding the
page itself. This means that if someone wants to be able to scale
their buffers with a factor 1:100 min:max, the size of those shared
buffers' metadata allocations is 1.7x as large as the size of the
buffers in its smallest configuration. [^1]
While it is currently definitely not a reasonable expectation to
expect 100x scaling, I think reducing this overhead factor is going to
be very useful in the long run. See also [2]/messages/by-id/CA+CZih5R_ZMjcuMHM3Ub10vTNDLEBjW8VsO3h-y2WKb=oK4fWw@mail.gmail.com, where cache misses in
dynahash are probably near 25% of the total workload - by reducing the
size of these structs we can increase the amount of data we can fit in
the same amount of buffers, thus increasing performance for some
cache-size limited workloads that hit the lookup tables.
What I mean is, regarding benchmarks: what's the best case scenario
for this kind of patch, and what sort of performance difference would
you expect to see?
I assume that for the provided patch:
- there will be no (or minimal) regressions, and improved performance
when the buffer mapping table now fits in l2/l3 caches (if it
previously didn't).
- there can be some regressions if the additional offset calculation
is performance-critical.
- there can be some regressions if an additional cache miss during
redirection while holding the lock is performance-critical.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
[0]: /messages/by-id/CA+TgmoaGCFPhMjz7veJOeef30=KdpOxgywcLwNbr-Gny-mXwcg@mail.gmail.com
[^1] I make the assumption that when we resize the storage of pages of
shared_buffers we don't also resize the buffer mapping table and the
bufferdesc array; as blocking processes from accessing page data based
on buffer descriptors' state is easier than totally rewiring all
systems that currently assume constant-size bufferdescs. With effort
it might be possible to resize these, but I don't see an easy path
forward for that, while the 8KB page of every shared_buffers slot is
relatively easier to remove from memory.
[2]: /messages/by-id/CA+CZih5R_ZMjcuMHM3Ub10vTNDLEBjW8VsO3h-y2WKb=oK4fWw@mail.gmail.com
On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Together that results in the following prototype patchset.
Here's an alternative patch, which replaces dynahash in the buffer
lookup table with an open-coded replacement that has fewer
indirections during lookups, and allows for a much increased locality.
Overall, it has less memory usage than our current dynahash in all
cases; but compared to my previous patch it improves memory only in
some cases, in others it uses more memory.
The main design is a separate chaining hash table (similar to and
derived from code of Dynahash) to provide efficient access to free
elements. The innovation over Dynahash is that in this new hash table
the stored elements are slotted between the bucket's element pointers,
thus allowing access to hash elements without further cache misses if
the element that stores the bucket's data is stored on the same cache
line (very possible). Further, the implementation uses int-based
offset pointers, reducing pointer size overhead on 64-bit system and
limiting the hash table to 2^31 elements (for a buffer mapping table
that seems to be fine, given our current buffer count limit of
2^30-1).
Finally, it shows up with accurate memory usage in
pg_shmem_allocations due to the usage of a single allocation,
improving user debugability of resources vs dynahash.
0001 implements this hash table, with some minimal optimizations.
0002 adds the optimization where the hash table prefers to pick the
local hash element if it's not already in use.
Future optimizations could implement memory prefetching (so that
buffer tag compares won't have cache misses for memory stalls),
cacheline-local entry selection, and other optimizations, but that'll
need more effort which I don't want to put into it before getting a
green light on "this might actually be a good step in the right
direction".
As to why simplehash was not chosen: Simplehash doesn't (can't?)
support partitioning due to its open addressing model. By taking the
locality of open addressing and combining it with the partitioning
capabilities of separate chaining we get the best of both worlds (or
at least closer than dynahash got).
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
PS. This isn't a perfect solution; it still causes approximately
random memory accesses when we look up sequential pages of a relation.
However, the cost of these memory accesses can now be significantly
reduced without breaking APIs or significantly impacting memory
overheads. Replacing the hashtable with another seems fairly
straightforward, while replacing the hashtable with e.g. radixtree.h
would be a very significant undertaking of shared memory allocations
that I'm not comfortable with writing myself.
Attachments:
v1-0001-BufTable-Use-optimized-hash-table-for-buffer-look.patchapplication/octet-stream; name=v1-0001-BufTable-Use-optimized-hash-table-for-buffer-look.patchDownload
From 0b7e0fd25b55c8d869afcbc7b86535f87b32442f Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 4 Feb 2025 17:03:19 +0100
Subject: [PATCH v1 1/2] BufTable: Use optimized hash table for buffer lookups
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This uses up to 1/3rd the memory and 1 less redirections vs the current
dynahash.c table. In the worst case for 64-bit systems, it'll only use 16%
less memory, in the best case ±58%.
Future commits will further optimize allocation of entries to reduce cache
misses.
---
src/backend/storage/buffer/buf_table.c | 459 ++++++++++++++++++++++---
src/backend/storage/buffer/bufmgr.c | 4 +-
src/include/storage/buf_internals.h | 2 +-
3 files changed, 414 insertions(+), 51 deletions(-)
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index a50955d5286..b237822285a 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -21,17 +21,117 @@
*/
#include "postgres.h"
+#include "common/hashfn.h"
#include "storage/buf_internals.h"
+#include "utils/dynahash.h"
-/* entry for buffer lookup hashtable */
-typedef struct
+/*
+ * Hash element: Singly linked list with hash of value contained, and
+ * the reference to the buffer header that contains the to-be-hashed value.
+ *
+ * Note: when negative, buffer is a sentinel value indicating it as a freelist
+ * entry for list at offset (-1 - offset). The user should lock that freelist
+ * to acquire that slot, but not before confirming that the value hasn't
+ * been updated since.
+ */
+typedef struct HashElement
+{
+ int32 next;
+ uint32 hashValue;
+ pg_atomic_uint32 buffer;
+} HashElement;
+
+/*
+ * Free list entry for the hash.
+ *
+ * Note tha
+ */
+typedef struct HashFreeListEntry
+{
+ int32 next;
+ int32 prev;
+ pg_atomic_uint32 tag;
+} HashFreeListEntry;
+
+StaticAssertDecl(offsetof(HashFreeListEntry, tag) == offsetof(HashElement, buffer),
+ "tag and buffer should be at the same offset into HashFreeListEntry and HashElement");
+
+typedef struct HashBucket
+{
+ int32 bucketStartElement;
+ union {
+ HashElement element;
+ HashFreeListEntry freeEntry;
+ };
+} HashBucket;
+
+
+typedef struct FreeList
+{
+ slock_t mutex;
+ int32 firstFree;
+ int32 lastFree;
+ int32 myTag;
+} FreeList;
+
+
+typedef union FreeListPadded
+{
+ FreeList list;
+ char padding[PG_CACHE_LINE_SIZE];
+} FreeListPadded;
+
+typedef struct BufTableHeader
+{
+ uint32 numBuckets;
+ uint32 mask;
+} BufTableHeader;
+
+typedef struct BufTable {
+ union {
+ BufTableHeader hdr;
+ char pad[CACHELINEALIGN(sizeof(BufTableHeader))];
+ };
+ FreeListPadded freeList[NUM_BUFFER_PARTITIONS];
+ HashBucket buckets[FLEXIBLE_ARRAY_MEMBER];
+} BufTable;
+
+static BufTable *bufLookupTable;
+
+#define NumBucketsFromSize(size) (1 << my_log2((size * 5) / 4))
+
+#define FreeListIdxFromHash(hash) (hash % NUM_BUFFER_PARTITIONS)
+
+static inline FreeList *
+BufTableGetFreelist(BufTable *table, uint32 idx)
+{
+ Assert(idx < NUM_BUFFER_PARTITIONS);
+ return &table->freeList[idx].list;
+}
+
+static inline HashBucket *
+BufTableGetBucket(BufTable *table, uint32 hash)
{
- BufferTag key; /* Tag of a disk page */
- int id; /* Associated buffer ID */
-} BufferLookupEnt;
+ int32 offset = (hash & table->hdr.mask);
+
+ Assert(offset < table->hdr.numBuckets);
+
+ return &table->buckets[offset];
+}
-static HTAB *SharedBufHash;
+static inline HashElement *
+BufTableGetElement(BufTable *table, int32 offset)
+{
+ Assert(offset >= 0 && offset <= table->hdr.numBuckets);
+ return &table->buckets[offset].element;
+}
+static inline HashFreeListEntry *
+BufTableGetFreeEntry(BufTable *table, int32 offset)
+{
+ Assert(offset >= 0 && offset <= table->hdr.numBuckets);
+ return &table->buckets[offset].freeEntry;
+}
/*
* Estimate space needed for mapping hashtable
@@ -40,7 +140,53 @@ static HTAB *SharedBufHash;
Size
BufTableShmemSize(int size)
{
- return hash_estimate_size(size, sizeof(BufferLookupEnt));
+ uint64 nbuckets = NumBucketsFromSize(size);
+ Size alloc = 0;
+
+ alloc = add_size(alloc, PG_CACHE_LINE_SIZE);
+ alloc = add_size(alloc, offsetof(BufTable, buckets));
+ alloc = add_size(alloc, mul_size(nbuckets, sizeof(HashBucket)));
+
+ return alloc;
+}
+
+
+static inline void
+add_to_freelist(BufTable *table, HashFreeListEntry *freeEntry, uint32 index)
+{
+ uint32 freeListIdx;
+ FreeList *freeList;
+ int32 next;
+ int32 freeListTag;
+ Assert(freeEntry == BufTableGetFreeEntry(table, index));
+
+ freeListIdx = FreeListIdxFromHash(index);
+ freeList = BufTableGetFreelist(bufLookupTable, freeListIdx);
+ freeListTag = (-1 - (int32) freeListIdx);
+ freeEntry->next = freeListTag;
+ freeEntry->prev = freeListTag;
+
+ S_LOCK(&freeList->mutex);
+ next = freeList->firstFree;
+ if (next < 0)
+ {
+ pg_atomic_write_u32(&freeEntry->tag, (uint32) freeListTag);
+ freeList->firstFree = (int32) index;
+ freeList->lastFree = (int32) index;
+ }
+ else
+ {
+ HashFreeListEntry *prevEntry;
+
+ prevEntry = BufTableGetFreeEntry(table, next);
+ Assert(pg_atomic_read_u32(&prevEntry->tag) == (uint32) freeListTag);
+
+ pg_atomic_write_u32(&freeEntry->tag, (uint32) freeListTag);
+
+ freeEntry->prev = freeList->firstFree;
+ freeList->firstFree = prevEntry->next = (int32) index;
+ }
+ S_UNLOCK(&freeList->mutex);
}
/*
@@ -50,21 +196,132 @@ BufTableShmemSize(int size)
void
InitBufTable(int size)
{
- HASHCTL info;
+ void *baseptr;
+ bool found = false;
+ uint32 nbuckets = NumBucketsFromSize(size);
+
+ baseptr = ShmemInitStruct("Shared Buffer Lookup Table",
+ BufTableShmemSize(size),
+ &found);
+
+ /*
+ * Runs only in postmaster during startup; there is no concurrent process
+ * that can or is allowed to initialize this.
+ */
+ Assert(!found);
+
+ if (found)
+ return;
+
+ Assert(PointerIsAligned(baseptr, void *));
+ bufLookupTable = (void *) CACHELINEALIGN((char *) baseptr);
+
+ /* nbuckets is always a power of 2, so mask is always nbuckets-1 */
+ bufLookupTable->hdr.numBuckets = nbuckets;
+ bufLookupTable->hdr.mask = nbuckets - 1;
+
+ for (int i = 0; i < NUM_BUFFER_PARTITIONS; i++)
+ {
+ FreeList *list = &bufLookupTable->freeList[i].list;
+
+ S_INIT_LOCK(&list->mutex);
+ list->myTag = (-1 - i);
+
+ list->firstFree = list->myTag;
+ list->lastFree = list->myTag;
+ }
+
+ /* build per-partition freelists */
+ for (int i = 0; i < nbuckets; i++)
+ {
+ HashBucket *bucket = BufTableGetBucket(bufLookupTable, i);
+ bucket->bucketStartElement = -1;
+ add_to_freelist(bufLookupTable, &bucket->freeEntry, i);
+ }
+}
+
+/* Convert a hash value to a bucket number */
+static inline int32
+calc_bucket(BufTable *table, uint32 hash_val)
+{
+ uint32 bucket;
+
+ bucket = hash_val & table->hdr.mask;
+
+ return (int32) bucket;
+}
+
+static inline HashElement *
+initial_lookup(BufTable *table, uint32 hashvalue, HashBucket **hashBucket, int32 *idx)
+{
+ HashBucket *bucket;
+ int32 bucketno;
+ int32 startElementOffset;
+
+ bucketno = calc_bucket(table, hashvalue);
+
+ bucket = BufTableGetBucket(bufLookupTable, bucketno);
+ startElementOffset = bucket->bucketStartElement;
+ *idx = startElementOffset;
+ *hashBucket = bucket;
+
+ if (startElementOffset < 0)
+ return NULL;
+
+ if (likely(startElementOffset == bucketno))
+ return &bucket->element;
+
+ return BufTableGetElement(table, startElementOffset);
+}
+
+/*
+ * Get a new free element from a freelist.
+ *
+ * NOTE: This assumes there are free elements available. If that assumption
+ * does not hold this code will loop infinitely.
+ */
+static HashElement *
+buftable_get_free_element(BufTable *table, uint32 hashValue, int32 *pInt)
+{
+ FreeList *freeList;
+ HashFreeListEntry *entry;
+
+ do {
+ for (;; hashValue++)
+ {
+ freeList = BufTableGetFreelist(table, FreeListIdxFromHash(hashValue));
+
+ S_LOCK(&freeList->mutex);
+ if (freeList->firstFree != freeList->myTag)
+ break;
+ S_UNLOCK(&freeList->mutex);
+ }
+
+ entry = BufTableGetFreeEntry(table, freeList->firstFree);
+ *pInt = freeList->firstFree;
+ } while (false);
+
+ Assert(entry->next == freeList->myTag);
- /* assume no locking is needed yet */
+ /* unlink the entry from the linked list*/
+ if (entry->prev != freeList->myTag)
+ {
+ HashFreeListEntry *prevEntry = BufTableGetFreeEntry(table, entry->prev);
+ prevEntry->next = entry->next;
+ freeList->firstFree = entry->prev;
+ }
+ else
+ {
+ Assert(freeList->firstFree == freeList->lastFree);
+ Assert(entry->next == entry->prev);
+ }
- /* BufferTag maps to Buffer */
- info.keysize = sizeof(BufferTag);
- info.entrysize = sizeof(BufferLookupEnt);
- info.num_partitions = NUM_BUFFER_PARTITIONS;
+ S_UNLOCK(&freeList->mutex);
- SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
- size, size,
- &info,
- HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+ return (HashElement *) entry;
}
+
/*
* BufTableHashCode
* Compute the hash code associated with a BufferTag
@@ -77,7 +334,7 @@ InitBufTable(int size)
uint32
BufTableHashCode(BufferTag *tagPtr)
{
- return get_hash_value(SharedBufHash, tagPtr);
+ return hash_bytes((void *) tagPtr, sizeof(BufferTag));
}
/*
@@ -89,19 +346,40 @@ BufTableHashCode(BufferTag *tagPtr)
int
BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
{
- BufferLookupEnt *result;
+ BufTable *table = bufLookupTable;
+ HashBucket *currBucket;
+ HashElement *matchElement;
+ int32 curIdx;
+
+ /*
+ * Do the initial lookup
+ */
+ matchElement = initial_lookup(table, hashcode, &currBucket,
+ &curIdx);
+
+ /* follow the bucket's chain */
+ while (matchElement != NULL)
+ {
+ if (matchElement->hashValue == hashcode && BufferTagsEqual(
+ tagPtr,
+ &GetBufferDescriptor(pg_atomic_read_u32(&matchElement->buffer))->tag
+ ))
+ break;
- result = (BufferLookupEnt *)
- hash_search_with_hash_value(SharedBufHash,
- tagPtr,
- hashcode,
- HASH_FIND,
- NULL);
+ if (matchElement->next < 0)
+ {
+ matchElement = NULL;
+ break;
+ }
- if (!result)
+ curIdx = matchElement->next;
+ matchElement = BufTableGetElement(table, curIdx);
+ }
+
+ if (!matchElement)
return -1;
- return result->id;
+ return (int) pg_atomic_read_u32(&matchElement->buffer);
}
/*
@@ -117,23 +395,64 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
int
BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
{
- BufferLookupEnt *result;
- bool found;
+ BufTable *table = bufLookupTable;
+ HashBucket *currBucket;
+ HashElement *matchElement;
+ int32 *prevNextPtr;
+ int32 matchElemIdx;
Assert(buf_id >= 0); /* -1 is reserved for not-in-table */
Assert(tagPtr->blockNum != P_NEW); /* invalid tag */
- result = (BufferLookupEnt *)
- hash_search_with_hash_value(SharedBufHash,
- tagPtr,
- hashcode,
- HASH_ENTER,
- &found);
+ /*
+ * Do the initial lookup
+ */
+ matchElement = initial_lookup(table, hashcode, &currBucket,
+ &matchElemIdx);
+ prevNextPtr = &currBucket->bucketStartElement;
+
+ while (matchElement != NULL)
+ {
+ /*
+ * Make sure it's a valid element in the chain. We have a lock on
+ * that chain, so any breakage is a broken invariant.
+ */
+ Assert(0 <= (int32) pg_atomic_read_u32(&matchElement->buffer));
+
+ if (matchElement->hashValue == hashcode && BufferTagsEqual(
+ tagPtr,
+ &GetBufferDescriptor(pg_atomic_read_u32(&matchElement->buffer))->tag
+ ))
+ break;
- if (found) /* found something already in the table */
- return result->id;
+ prevNextPtr = &matchElement->next;
- result->id = buf_id;
+ if (matchElement->next < 0)
+ {
+ Assert(matchElement->next == -1);
+ matchElement = NULL;
+ break;
+ }
+
+ matchElemIdx = matchElement->next;
+ matchElement = BufTableGetElement(table, matchElemIdx);
+ }
+
+ /* Return existing buffer ID if we found a pre-existing buffer */
+ if (matchElement != NULL)
+ return (int) pg_atomic_read_u32(&matchElement->buffer);
+
+ /* else, create a new one */
+ matchElement = buftable_get_free_element(table, hashcode, &matchElemIdx);
+
+ /* link into hashbucket chain */
+ *prevNextPtr = matchElemIdx;
+
+ /* copy data into the element */
+ matchElement->next = -1; /* it's the tail of the bucket */
+ matchElement->hashValue = hashcode;
+
+ pg_atomic_write_u32(&matchElement->buffer, (uint32) ((int32) buf_id));
return -1;
}
@@ -145,17 +464,61 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
* Caller must hold exclusive lock on BufMappingLock for tag's partition
*/
void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, int buffer)
{
- BufferLookupEnt *result;
+ BufTable *table = bufLookupTable;
+ HashBucket *currBucket;
+ HashElement *matchElement;
+ int32 *prevNextPtr;
+ int32 matchElemIdx;
+
+ /*
+ * Do the initial lookup
+ */
+ matchElement = initial_lookup(table, hashcode, &currBucket,
+ &matchElemIdx);
+ prevNextPtr = &currBucket->bucketStartElement;
+
+ while (matchElement != NULL)
+ {
+ if (matchElement->hashValue == hashcode &&
+ pg_atomic_read_u32(&matchElement->buffer) == buffer)
+ break;
+
+ prevNextPtr = &matchElement->next;
+
+ if (matchElement->next < 0)
+ {
+ matchElement = NULL;
+ break;
+ }
+
+ matchElemIdx = matchElement->next;
+ matchElement = BufTableGetElement(table, matchElemIdx);
+ }
+
+ if (matchElement != NULL)
+ {
+ /*
+ * We can't test the buffer tag, because that may have been cleared
+ * by the caller by now.
+ */
+ Assert(0 <= (int32) pg_atomic_read_u32(&matchElement->buffer));
- result = (BufferLookupEnt *)
- hash_search_with_hash_value(SharedBufHash,
- tagPtr,
- hashcode,
- HASH_REMOVE,
- NULL);
+ /* unlink the element from the singly-linked list */
+ *prevNextPtr = matchElement->next;
- if (!result) /* shouldn't happen */
- elog(ERROR, "shared buffer hash table corrupted");
+ /*
+ * Remove record from hash bucket's chain.
+ *
+ * Note that bucket chains are singly linked lists, but freelists
+ * doubly linked lists.
+ */
+ add_to_freelist(bufLookupTable, (HashFreeListEntry *) matchElement,
+ matchElemIdx);
+ }
+ else
+ {
+ elog(PANIC, "shared buffer hash table corrupted");
+ }
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ee83669992b..362e2d2f134 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1856,7 +1856,7 @@ retry:
* Remove the buffer from the lookup hashtable, if it was in there.
*/
if (oldFlags & BM_TAG_VALID)
- BufTableDelete(&oldTag, oldHash);
+ BufTableDelete(&oldTag, oldHash, buf->buf_id);
/*
* Done with mapping lock.
@@ -1935,7 +1935,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
/* finally delete buffer from the buffer mapping table */
- BufTableDelete(&tag, hash);
+ BufTableDelete(&tag, hash, buf_hdr->buf_id);
LWLockRelease(partition_lock);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1a65342177d..5739bc44c00 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -439,7 +439,7 @@ extern void InitBufTable(int size);
extern uint32 BufTableHashCode(BufferTag *tagPtr);
extern int BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, int buffer);
/* localbuf.c */
extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
--
2.46.0
v1-0002-BufTable-Optimize-allocation-of-BufTable-entries.patchapplication/octet-stream; name=v1-0002-BufTable-Optimize-allocation-of-BufTable-entries.patchDownload
From 48e2de1758b93dedc060171acf54d45cb64a06fb Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 4 Feb 2025 17:51:11 +0100
Subject: [PATCH v1 2/2] BufTable: Optimize allocation of BufTable entries
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously the slots were assigned ±randomly based on freelist order,
now we first try to use slot in the bucket before trying to acquire
slots from the freelists. This reduces the number of pointers we need
to chase for lookups.
---
src/backend/storage/buffer/buf_table.c | 137 ++++++++++++++++++-------
1 file changed, 102 insertions(+), 35 deletions(-)
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index b237822285a..b9c302186a7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -25,6 +25,8 @@
#include "storage/buf_internals.h"
#include "utils/dynahash.h"
+#define NUM_FREELISTS NUM_BUFFER_PARTITIONS
+
/*
* Hash element: Singly linked list with hash of value contained, and
* the reference to the buffer header that contains the to-be-hashed value.
@@ -69,9 +71,9 @@ typedef struct HashBucket
typedef struct FreeList
{
slock_t mutex;
- int32 firstFree;
- int32 lastFree;
- int32 myTag;
+ int32 prev; /* first free */
+ int32 next; /* last free */
+ int32 freeListTag;
} FreeList;
@@ -92,7 +94,7 @@ typedef struct BufTable {
BufTableHeader hdr;
char pad[CACHELINEALIGN(sizeof(BufTableHeader))];
};
- FreeListPadded freeList[NUM_BUFFER_PARTITIONS];
+ FreeListPadded freeList[NUM_FREELISTS];
HashBucket buckets[FLEXIBLE_ARRAY_MEMBER];
} BufTable;
@@ -100,12 +102,12 @@ static BufTable *bufLookupTable;
#define NumBucketsFromSize(size) (1 << my_log2((size * 5) / 4))
-#define FreeListIdxFromHash(hash) (hash % NUM_BUFFER_PARTITIONS)
+#define FreeListIdxFromHash(hash) (hash % NUM_FREELISTS)
static inline FreeList *
BufTableGetFreelist(BufTable *table, uint32 idx)
{
- Assert(idx < NUM_BUFFER_PARTITIONS);
+ Assert(idx < NUM_FREELISTS);
return &table->freeList[idx].list;
}
@@ -151,7 +153,7 @@ BufTableShmemSize(int size)
}
-static inline void
+static void
add_to_freelist(BufTable *table, HashFreeListEntry *freeEntry, uint32 index)
{
uint32 freeListIdx;
@@ -167,12 +169,14 @@ add_to_freelist(BufTable *table, HashFreeListEntry *freeEntry, uint32 index)
freeEntry->prev = freeListTag;
S_LOCK(&freeList->mutex);
- next = freeList->firstFree;
- if (next < 0)
+ next = freeList->prev;
+ if (next < 0) /* list is empty */
{
pg_atomic_write_u32(&freeEntry->tag, (uint32) freeListTag);
- freeList->firstFree = (int32) index;
- freeList->lastFree = (int32) index;
+ freeList->prev = (int32) index;
+ freeList->next = (int32) index;
+ freeEntry->next = freeListTag;
+ freeEntry->prev = freeListTag;
}
else
{
@@ -183,8 +187,8 @@ add_to_freelist(BufTable *table, HashFreeListEntry *freeEntry, uint32 index)
pg_atomic_write_u32(&freeEntry->tag, (uint32) freeListTag);
- freeEntry->prev = freeList->firstFree;
- freeList->firstFree = prevEntry->next = (int32) index;
+ freeEntry->prev = freeList->prev;
+ freeList->prev = prevEntry->next = (int32) index;
}
S_UNLOCK(&freeList->mutex);
}
@@ -220,15 +224,15 @@ InitBufTable(int size)
bufLookupTable->hdr.numBuckets = nbuckets;
bufLookupTable->hdr.mask = nbuckets - 1;
- for (int i = 0; i < NUM_BUFFER_PARTITIONS; i++)
+ for (int i = 0; i < NUM_FREELISTS; i++)
{
FreeList *list = &bufLookupTable->freeList[i].list;
S_INIT_LOCK(&list->mutex);
- list->myTag = (-1 - i);
+ list->freeListTag = (-1 - i);
- list->firstFree = list->myTag;
- list->lastFree = list->myTag;
+ list->prev = list->freeListTag;
+ list->next = list->freeListTag;
}
/* build per-partition freelists */
@@ -281,39 +285,81 @@ initial_lookup(BufTable *table, uint32 hashvalue, HashBucket **hashBucket, int32
* does not hold this code will loop infinitely.
*/
static HashElement *
-buftable_get_free_element(BufTable *table, uint32 hashValue, int32 *pInt)
+buftable_get_free_element(BufTable *table, HashBucket *bucket, uint32 hashValue,
+ int32 *pInt)
{
FreeList *freeList;
HashFreeListEntry *entry;
+ uint32 freeListIdx = FreeListIdxFromHash(hashValue);
+ freeList = BufTableGetFreelist(table, freeListIdx);
- do {
- for (;; hashValue++)
+ /*
+ * First, try to fit the newly added element into the bucket's
+ * element slot, if there's space.
+ */
+ if (freeList->freeListTag == (int32) pg_atomic_read_u32(&bucket->freeEntry.tag))
+ {
+ S_LOCK(&freeList->mutex);
+ if (likely(freeList->freeListTag == (int32) pg_atomic_read_u32(&bucket->freeEntry.tag)))
{
- freeList = BufTableGetFreelist(table, FreeListIdxFromHash(hashValue));
+ *pInt = calc_bucket(table, hashValue);
+ entry = &bucket->freeEntry;
- S_LOCK(&freeList->mutex);
- if (freeList->firstFree != freeList->myTag)
- break;
- S_UNLOCK(&freeList->mutex);
+ goto freeElementFound;
}
+ S_UNLOCK(&freeList->mutex);
+ }
- entry = BufTableGetFreeEntry(table, freeList->firstFree);
- *pInt = freeList->firstFree;
- } while (false);
+ /*
+ * If all else failed, fall back to pulling the elements from the
+ * freelists directly.
+ */
+ for (;;)
+ {
+ S_LOCK(&freeList->mutex);
+ if (freeList->prev != freeList->freeListTag)
+ break;
+ S_UNLOCK(&freeList->mutex);
+ hashValue++;
+ freeListIdx = FreeListIdxFromHash(hashValue);
+ freeList = BufTableGetFreelist(table, freeListIdx);
+ }
- Assert(entry->next == freeList->myTag);
+ entry = BufTableGetFreeEntry(table, freeList->prev);
+ *pInt = freeList->prev;
- /* unlink the entry from the linked list*/
- if (entry->prev != freeList->myTag)
+ /*
+ * preconditions:
+ *
+ * - entry is set
+ * - *pInt is set to the offset of the entry
+ * - freeList contains the freeList associated with entry
+ * - freeList->mutex is locked
+ */
+freeElementFound:
+ /*
+ * Unlink the entry from the (doubly!) linked list.
+ */
+ if (entry->prev != freeList->freeListTag)
{
HashFreeListEntry *prevEntry = BufTableGetFreeEntry(table, entry->prev);
+ Assert(pg_atomic_read_u32(&prevEntry->tag) == freeList->freeListTag);
prevEntry->next = entry->next;
- freeList->firstFree = entry->prev;
}
else
{
- Assert(freeList->firstFree == freeList->lastFree);
- Assert(entry->next == entry->prev);
+ freeList->next = entry->next;
+ }
+
+ if (entry->next != freeList->freeListTag)
+ {
+ HashFreeListEntry *nextEntry = BufTableGetFreeEntry(table, entry->next);
+ Assert(pg_atomic_read_u32(&nextEntry->tag) == freeList->freeListTag);
+ nextEntry->prev = entry->prev;
+ }
+ else
+ {
+ freeList->prev = entry->prev;
}
S_UNLOCK(&freeList->mutex);
@@ -443,7 +489,8 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
return (int) pg_atomic_read_u32(&matchElement->buffer);
/* else, create a new one */
- matchElement = buftable_get_free_element(table, hashcode, &matchElemIdx);
+ matchElement = buftable_get_free_element(table, currBucket, hashcode,
+ &matchElemIdx);
/* link into hashbucket chain */
*prevNextPtr = matchElemIdx;
@@ -505,6 +552,26 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode, int buffer)
*/
Assert(0 <= (int32) pg_atomic_read_u32(&matchElement->buffer));
+ /*
+ * If we're dropping the optimally located hash entry, move the
+ * next element's slot data, and drop that next element instead.
+ */
+ if (&currBucket->element == matchElement && matchElement->next >= 0)
+ {
+ int32 nextIdx = matchElement->next;
+ HashElement *next = BufTableGetElement(table, nextIdx);
+
+ /* copy payloads into the "old deleted" slot */
+ pg_atomic_write_u32(&matchElement->buffer,
+ pg_atomic_read_u32(&next->buffer));
+ matchElement->hashValue = next->hashValue;
+
+ /* Mark the next element for deletion */
+ prevNextPtr = &matchElement->next;
+ matchElement = next;
+ matchElemIdx = nextIdx;
+ }
+
/* unlink the element from the singly-linked list */
*prevNextPtr = matchElement->next;
--
2.46.0
Hi,
On 2025-01-30 08:48:56 +0100, Matthias van de Meent wrote:
Some time ago I noticed that every buffer table entry is quite large at 40
bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
buffers array, with generally 8 more bytes used for the bucket pointers.
(32-bit systems: 32 (+4) bytes)Does anyone know why we must have the buffer tag in the buffer table?
It turns out to actually substantially improve CPU cache hit ratio on
concurrent workloads. The BufferDesc is obviously frequently modified. Each
modification (pin, lock) will pull the cacheline into modified state, within a
single core. Causing higher latency when accessing that data on other
cores. That's obviously not great for a crucial hashtable... I think it
mainly matters for things like inner index pages etc.
It turns out that these misses due to dirtying already costs us rather dearly
when running read-heavy workloads on bigger machines, mainly due to nbtree
code doing things like BufferGetBlockNumber().
It'd be interesting to benchmark how a separate, more densely packed,
BufferTag array, shared by the hashtable and the BufferDesc itself. Obviously
that'd be a somewhat painful change.
It seems to me we can follow the offset pointer into the shared BufferDesc
array whenever we find out we need to compare the tags (as opposed to just
the hash, which is stored and present in HASHELEMENT). If we decide to just
follow the pointer, we can immediately shave 16 bytes (40%) off the lookup
table's per-element size, or 24 if we pack the 4-byte shared buffer offset
into the unused bytes in HASHELEMENT, reducing the memory usage of that
hash table by ~50%: We'd have 16 bytes for every
ELEMENT+shared_buffer_offset, plus 8 bytes for every bucket pointer (of
which there are approximately as many as there are elements), resulting in
24 bytes /max_alloc elements.
It does make the code more branchy...
Does anyone have an idea on how to best benchmark this kind of patch, apart
from "running pgbench"? Other ideas on how to improve this? Specific
concerns?
I'd recommend benchmarking at least the following workloads, all fully shared
buffer cache resident:
1) sequential scans
Tests: very low likelihood of cpu cache hit rates
2) sequential scan with a parameterized index nested loop over a small table
Tests: mixes low-cache hit likelihood (seqscan) with high cache hit rate
(index scan)
3) small range index scan with param nestloop join to index scan
Tests: Very high cache hit ratio
It's unfortunately fairly important to test these both with a single client
*and* a large number of clients on a multi-socket server. The latter makes
cache miss latency much more visible.
It might be worth looking at perf c2c profiles for before/after, I'd expect it
to change noticeably. Might also be worth at looking at perf stat for cache
misses, hitm, etc.
Greetings,
Andres Freund
Hi,
On 2025-02-04 19:58:36 +0100, Matthias van de Meent wrote:
On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Together that results in the following prototype patchset.
Here's an alternative patch, which replaces dynahash in the buffer
lookup table with an open-coded replacement that has fewer
indirections during lookups, and allows for a much increased locality.Overall, it has less memory usage than our current dynahash in all
cases; but compared to my previous patch it improves memory only in
some cases, in others it uses more memory.The main design is a separate chaining hash table (similar to and
derived from code of Dynahash) to provide efficient access to free
elements.
Why do you want to use a separate chaining hash table? For something as
performance sensitive as this the lower cache hit ratio seems like a strong
reason to not go for such a hash table "architecture"?
I think it'd be better to work towards not using a hash table for the buffer
mapping table than to write a dedicated hash table implementation for it
though. A hash table
a) doesn't have ordered access support, which causes a bunch of O(NBuffers)
operations and makes things like readahead etc way more expensive
b) doesn't benefit from spatial locality, even though buffer lookups have a
very high degree of spatial locality
Greetings,
Andres Freund
On Wed, 5 Feb 2025 at 02:22, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-02-04 19:58:36 +0100, Matthias van de Meent wrote:
On Thu, 30 Jan 2025 at 08:48, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Together that results in the following prototype patchset.
Here's an alternative patch, which replaces dynahash in the buffer
lookup table with an open-coded replacement that has fewer
indirections during lookups, and allows for a much increased locality.Overall, it has less memory usage than our current dynahash in all
cases; but compared to my previous patch it improves memory only in
some cases, in others it uses more memory.The main design is a separate chaining hash table (similar to and
derived from code of Dynahash) to provide efficient access to free
elements.Why do you want to use a separate chaining hash table? For something as
performance sensitive as this the lower cache hit ratio seems like a strong
reason to not go for such a hash table "architecture"?
I think it's hard to defend this patchset against a perfect solution,
but lacking that perfect solution (and provided the current definitely
not great solution), I'll share my thoughts on why I chose for a
different hashmap. While wordy, I'm not trying to defend or attack
anything or anyone, just explaining my reasoning.
My main reasons are:
1.) this is a mostly drop-in replacement that, in the general case, is
better in every metric vs dynahash; which is probably preferred over
continuing the status quo given that it's now really starting to show
its limits (see e.g. the hash_search_with_hash_value thread).
2.) I'm not experienced enough with writing non-constant-sized
allocations in shared memory to get something like a radixtree.h
implemented in shared memory. Given that threading is still some
releases away, I thought a chaining hash table (which only needs
constant size allocations and thus requires only a simple slab
allocator) would be OK if it still improved on the status quo.
Note that radixtree's implementation has nodes sizing from 16 bytes up
to 1000+ bytes. Building an efficient allocator for that is difficult
(see: bin-packing problem), assuming it needs to interface with shmem
and thus statically pre-allocated.
3.) simplehash.h can't currently be used due to requiring resizing on
certain intervals, and we can't just dynamically re-allocate the
table. This makes the implementation unusable.
4.) simplehash also invalidates cache lines containing otherwise
unmodified elements due to its relocation strategy, which isn't great
either for cache sharing.
5.) A big reason why dynahash is slow, apart from random unpredictable
memory accesses, is said to be because it needs to follow 3 pointers
before it has its first element: The hash table has a pointer to a
segment, which is a pointer to a bucket, which is a pointer to an
ELEMENT - and the hash table itself is an allocated struct and thus
needs to be loaded through a pointer as well. It isn't the only reason
(e.g. bucket chains are expensive, too) but probably one large reason
why it's so slow.
As for what the buffer lookup table worries about; I considered the following:
cache locality: How quickly can you find an element, starting with 0 info)
dynahash (--) newhash (-) simplehash (+) radixtree (~?)
cache line dirtying: How many cache lines are dirtied with
insert/delete operations
dynahash (+++) newhash (++) simplehash (--) radixtree (++ ?)
spatial locality: How close are "near" elements to eachother in the structure?
dynahash (---) newhash (--?) simplehash (-?) radixtree (++)
partitioning: Can the structure be written to by multiple backends concurrently?
dynahash (+++) newhash (+++) simplehash (---) radixtree (--?)
memory usage: Compare memory overhead allocated for random buffers
dynahash (~) newhash (+) simplehash (++) radixtree (-?)
I think it'd be better to work towards not using a hash table for the buffer
mapping table than to write a dedicated hash table implementation for it
though. A hash tablea) doesn't have ordered access support, which causes a bunch of O(NBuffers)
operations and makes things like readahead etc way more expensive
b) doesn't benefit from spatial locality, even though buffer lookups have a
very high degree of spatial locality
While I hear those arguments, I'm not sure it's as easy as you make it
seem. You yourself have said that "dynahash is [a] really poor hash
table implementation." This new hash table was 'only' a few days'
work, and seems to at least have fixed at least 2 of the major issues
we see in dynahash: It reduces the pointer chasing for first result to
2 in the opmal case, and (with some further changes which I have yet
to share) it may achieve a much higher average locality vs the
current, unmodified, dynahash.
I did look at other solutions (specifically, adapting radixtree.h
and/or simplehash.h) but as I've mentioned above, the memory tooling
required for these solutions were a major turnoff as they
significantly increase the complexity of such a project.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Wed, 5 Feb 2025 at 02:14, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-01-30 08:48:56 +0100, Matthias van de Meent wrote:
Some time ago I noticed that every buffer table entry is quite large at 40
bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are
padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared
buffers array, with generally 8 more bytes used for the bucket pointers.
(32-bit systems: 32 (+4) bytes)Does anyone know why we must have the buffer tag in the buffer table?
It turns out to actually substantially improve CPU cache hit ratio on
concurrent workloads. The BufferDesc is obviously frequently modified. Each
modification (pin, lock) will pull the cacheline into modified state, within a
single core. Causing higher latency when accessing that data on other
cores. That's obviously not great for a crucial hashtable... I think it
mainly matters for things like inner index pages etc.It turns out that these misses due to dirtying already costs us rather dearly
when running read-heavy workloads on bigger machines, mainly due to nbtree
code doing things like BufferGetBlockNumber().
That is something I hadn't thought about, but indeed, you're right
that this wouldn't be great.
It'd be interesting to benchmark how a separate, more densely packed,
BufferTag array, shared by the hashtable and the BufferDesc itself. Obviously
that'd be a somewhat painful change.
Such a patch is actually not that bad, as surprisingly few files
actually touch on BufferDesc (let alone BufferDesc->tag - though the
number of places with changes is still 100+). I've prototyped with it,
and it removes another 12 bytes from the overhead of each buffer
(assuming we want to pack BufferDesc at 32 bytes, as that's possible).
Does anyone have an idea on how to best benchmark this kind of patch, apart
from "running pgbench"? Other ideas on how to improve this? Specific
concerns?I'd recommend benchmarking at least the following workloads, all fully shared
buffer cache resident:
[...]
Thanks for the suggestions!
It's unfortunately fairly important to test these both with a single client
*and* a large number of clients on a multi-socket server. The latter makes
cache miss latency much more visible.It might be worth looking at perf c2c profiles for before/after, I'd expect it
to change noticeably. Might also be worth at looking at perf stat for cache
misses, hitm, etc.
Hmm. I'll see if I can get some hardware to test this.
FYI, I've pushed a newer version of the 'newhash' approach to Github,
at [0]https://github.com/MMeent/postgres/tree/feat/new-buftable-hash. It extracts the buffer tags from the BufferDesc into its own
array to reduce the problems with false sharing due to pins, and has
some more code to try and forcefully increase the locality of hash
elements.
There are still a few more potential changes that can increase cache
locality of hash elements with the buckets that store their data, but
I have no immediate plans for that.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
[0]: https://github.com/MMeent/postgres/tree/feat/new-buftable-hash