Use simplehash.h instead of dynahash in SMgr

Started by David Rowleyover 4 years ago38 messages

dgrowleyml@gmail.com

over 4 years ago

2 attachment(s)

Hackers,

Last year, when working on making compactify_tuples() go faster for
19c60ad69, I did quite a bit of benchmarking of the recovery process.
The next thing that was slow after compactify_tuples() was the hash
lookups done in smgropen().

Currently, we use dynahash hash tables to store the SMgrRelation so we
can perform fast lookups by RelFileNodeBackend. However, I had in mind
that a simplehash table might perform better. So I tried it...

The attached converts the hash table lookups done in smgr.c to use
simplehash instead of dynahash.

This does require a few changes in simplehash.h to make it work. The
reason being is that RelationData.rd_smgr points directly into the
hash table entries. This works ok for dynahash as that hash table
implementation does not do any reallocations of existing items or move
any items around in the table, however, simplehash moves entries
around all the time, so we can't point any pointers directly at the
hash entries and expect them to be valid after adding or removing
anything else from the table.

To work around that, I've just made an additional type that serves as
the hash entry type that has a pointer to the SMgrRelationData along
with the hash status and hash value. It's just 16 bytes (or 12 on
32-bit machines). I opted to keep the hash key in the
SMgrRelationData rather than duplicating it as it keeps the SMgrEntry
struct nice and small. We only need to dereference the SMgrRelation
pointer when we find an entry with the same hash value. The chances
are quite good that an entry with the same hash value is the one that
we want, so any additional dereferences to compare the key are not
going to happen very often.

I did experiment with putting the hash key in SMgrEntry and found it
to be quite a bit slower. I also did try to use hash_bytes() but
found building a hash function that uses murmurhash32 to be quite a
bit faster.

Benchmarking
===========

I did some of that. It made my test case about 10% faster.

The test case was basically inserting 100 million rows one at a time
into a hash partitioned table with 1000 partitions and 2 int columns
and a primary key on one of those columns. It was about 12GB of WAL. I
used a hash partitioned table in the hope to create a fairly
random-looking SMgr hash table access pattern. Hopefully something
similar to what might happen in the real world.

Over 10 runs of recovery, master took an average of 124.89 seconds.
The patched version took 113.59 seconds. About 10% faster.

I bumped shared_buffers up to 10GB, max_wal_size to 20GB and
checkpoint_timeout to 60 mins.

To make the benchmark more easily to repeat I patched with the
attached recovery_panic.patch.txt. This just PANICs at the end of
recovery so that the database shuts down before performing the end of
recovery checkpoint. Just start the database up again to do another
run.

I did 10 runs. The end of recovery log message reported:

master (aa271209f)
CPU: user: 117.89 s, system: 5.70 s, elapsed: 123.65 s
CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s
CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.20 s
CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s
CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s
CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s
CPU: user: 120.60 s, system: 5.82 s, elapsed: 126.49 s
CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s
CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s
CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s

master + v1 patch
CPU: user: 106.90 s, system: 4.45 s, elapsed: 111.39 s
CPU: user: 107.31 s, system: 5.98 s, elapsed: 113.35 s
CPU: user: 107.14 s, system: 5.58 s, elapsed: 112.77 s
CPU: user: 105.79 s, system: 5.64 s, elapsed: 111.48 s
CPU: user: 105.78 s, system: 5.80 s, elapsed: 111.63 s
CPU: user: 113.18 s, system: 6.21 s, elapsed: 119.45 s
CPU: user: 107.74 s, system: 4.57 s, elapsed: 112.36 s
CPU: user: 107.42 s, system: 4.62 s, elapsed: 112.09 s
CPU: user: 106.54 s, system: 4.65 s, elapsed: 111.24 s
CPU: user: 113.24 s, system: 6.86 s, elapsed: 120.16 s

I wrote this patch a few days ago. I'm only posting it now as I know a
couple of other people have expressed an interest in working on this.
I didn't really want any duplicate efforts, so thought I'd better post
it now before someone else goes and writes a similar patch.

I'll park this here and have another look at it when the PG15 branch opens.

David

Attachments:

recovery_panic.patch.txttext/plain; charset=US-ASCII; name=recovery_panic.patch.txtDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adfc6f67e2..aa7accbe1b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7616,6 +7616,7 @@ StartupXLOG(void)
 						(errmsg("last completed transaction was at log time %s",
 								timestamptz_to_str(xtime))));
 
+			elog(PANIC, "recovery PANIC");
 			InRedo = false;
 		}
 		else

v1-0001-Use-simplehash.h-hashtables-in-SMgr.patchapplication/octet-stream; name=v1-0001-Use-simplehash.h-hashtables-in-SMgr.patchDownload

From d8737afb5d368015522b57f502bf1eced4220689 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 22 Apr 2021 17:03:46 +1200
Subject: [PATCH v1] Use simplehash.h hashtables in SMgr

The hash table lookups done in SMgr can quite often be a bottleneck during
crash recovery.  Traditionally these use dynahash. Here we swap dynahash
out and use simplehash instead.  This improves lookup performance.

Some changes are required from simplehash.h here to make this work.  The
reason for this is that code external to smgr.c does point to the hashed
SMgrRelation. Since simplehash does reallocate the bucket array when
increasing the size of the table and also shuffle entries around during
deletes, code pointing directly into hash entries would be a bad idea. To
overcome this issue we only store a pointer to the SMgrRelationData in the
hash table entry and maintain a separate allocation for that data. This
does mean an additional pointer dereference during lookups, but only when
the hash value matches, so the significant majority of the time that will
only be done for the item we are actually looking for.

Since the hash table key is stored in the referenced SMgrRelation, we need
to add two new macros to allow simplehash to allocate the memory for the
SMgrEntry during inserts before it tries to set the key.  A new macro has
also been added to allow simplehash implementations to perform cleanup
when items are removed from the table.
---
 src/backend/storage/smgr/smgr.c | 173 +++++++++++++++++++++++++-------
 src/include/lib/simplehash.h    |  48 ++++++++-
 2 files changed, 182 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..64a26e06c6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,14 +18,50 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
-#include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/* Hash table entry type for SMgrRelationHash */
+typedef struct SMgrEntry
+{
+	int			status;			/* Hash table status */
+	uint32		hash;			/* Hash value (cached) */
+	SMgrRelation data;			/* Pointer to the SMgrRelationData */
+} SMgrEntry;
+
+static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
+static void smgr_entry_cleanup(SMgrRelation reln);
+
+/*
+ * Because simplehash.h does not provide a stable pointer to hash table
+ * entries, we don't make the element type a SMgrRelation directly, instead we
+ * use an SMgrEntry type which has a pointer to the data field.  simplehash can
+ * move entries around when adding or removing items from the hash table so
+ * having the SMgrRelation as a pointer inside the SMgrEntry allows external
+ * code to keep their own pointers to the SMgrRelation.  Relcache does this.
+ * We use the SH_ENTRY_INITIALIZER to allocate memory for the SMgrRelationData
+ * when a new entry is created.  We also define SH_ENTRY_CLEANUP to execute
+ * some cleanup when removing an item from the table.
+ */
+#define SH_PREFIX		smgrtable
+#define SH_ELEMENT_TYPE	SMgrEntry
+#define SH_KEY_TYPE		RelFileNodeBackend
+#define	SH_KEY			data->smgr_rnode
+#define SH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define SH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
+#define	SH_SCOPE		static inline
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_ENTRY_INITIALIZER(a) a->data = MemoryContextAlloc(TopMemoryContext, sizeof(SMgrRelationData))
+#define SH_ENTRY_CLEANUP(a) smgr_entry_cleanup(a->data)
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -91,13 +127,62 @@ static const int NSmgr = lengthof(smgrsw);
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
  */
-static HTAB *SMgrRelationHash = NULL;
+static smgrtable_hash *SMgrRelationHash = NULL;
 
 static dlist_head unowned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/*
+ * relfilenodebackend_hash
+ *		Custom rolled hash function for simplehash table.
+ *
+ * smgropen() is often a bottleneck in CPU bound workloads during crash
+ * recovery.  We make use of this custom hash function rather than using
+ * hash_bytes as it gives us a little bit more performance.
+ *
+ * XXX What if sizeof(Oid) is not 4?
+ */
+static inline uint32
+relfilenodebackend_hash(RelFileNodeBackend *rnode)
+{
+	uint32		hashkey;
+
+	hashkey = murmurhash32((uint32) rnode->node.spcNode);
+
+	/* rotate hashkey left 1 bit at each step */
+	hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+
+	hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+
+	hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+	hashkey ^= murmurhash32((uint32) rnode->backend);
+
+	return hashkey;
+}
+
+/*
+ * smgr_entry_cleanup
+ *		Cleanup code for simplehash.h to execute when removing an item from
+ *		the hash table.
+ */
+static void
+smgr_entry_cleanup(SMgrRelation reln)
+{
+	/*
+	 * Unhook the owner pointer, if any.  We only do this when we're certain
+	 * the entry is removed from the hash table.  This allows us to leave the
+	 * owner attached if the hash table delete were to fail for some reason.
+	 */
+	if (reln->smgr_owner)
+		*reln->smgr_owner = NULL;
+
+	pfree(reln);
+}
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -147,31 +232,26 @@ smgropen(RelFileNode rnode, BackendId backend)
 {
 	RelFileNodeBackend brnode;
 	SMgrRelation reln;
+	SMgrEntry  *entry;
 	bool		found;
 
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 	{
 		/* First time through: initialize the hash table */
-		HASHCTL		ctl;
-
-		ctl.keysize = sizeof(RelFileNodeBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
-		SMgrRelationHash = hash_create("smgr relation table", 400,
-									   &ctl, HASH_ELEM | HASH_BLOBS);
+		SMgrRelationHash = smgrtable_create(TopMemoryContext, 400, NULL);
 		dlist_init(&unowned_relns);
 	}
 
 	/* Look up or create an entry */
 	brnode.node = rnode;
 	brnode.backend = backend;
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &brnode,
-									  HASH_ENTER, &found);
+	entry = smgrtable_insert(SMgrRelationHash, brnode, &found);
+	reln = entry->data;
 
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		/* hash_search already filled in the lookup key */
+		/* smgrtable_insert already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
@@ -250,10 +330,11 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	smgrclose() -- Close and delete an SMgrRelation object.
+ *	smgrclose() -- Close and delete an SMgrRelation object but don't
+ *	remove from the SMgrRelationHash table.
  */
-void
-smgrclose(SMgrRelation reln)
+static inline void
+smgrclose_internal(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -266,17 +347,18 @@ smgrclose(SMgrRelation reln)
 	if (!owner)
 		dlist_delete(&reln->node);
 
-	if (hash_search(SMgrRelationHash,
-					(void *) &(reln->smgr_rnode),
-					HASH_REMOVE, NULL) == NULL)
-		elog(ERROR, "SMgrRelation hashtable corrupted");
+}
 
-	/*
-	 * Unhook the owner pointer, if any.  We do this last since in the remote
-	 * possibility of failure above, the SMgrRelation object will still exist.
-	 */
-	if (owner)
-		*owner = NULL;
+/*
+ *	smgrclose() -- Close and delete an SMgrRelation object.
+ */
+void
+smgrclose(SMgrRelation reln)
+{
+	smgrclose_internal(reln);
+
+	if (!smgrtable_delete(SMgrRelationHash, reln->smgr_rnode))
+		elog(ERROR, "SMgrRelation hashtable corrupted");
 }
 
 /*
@@ -285,17 +367,25 @@ smgrclose(SMgrRelation reln)
 void
 smgrcloseall(void)
 {
-	HASH_SEQ_STATUS status;
-	SMgrRelation reln;
+	smgrtable_iterator iterator;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 		return;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	smgrtable_start_iterate(SMgrRelationHash, &iterator);
+	while ((entry = smgrtable_iterate(SMgrRelationHash, &iterator)) != NULL)
+		smgrclose_internal(entry->data);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+	/*
+	 * Finally, remove all entries from the hash table.  This is done last and
+	 * in a single operation as we're unable to remove multiple entries in the
+	 * above loop due to deletes moving elements around in the table.
+	 * Additionally, it is much more efficient to just wipe out all entries
+	 * rather than doing individual deletes of each entry.
+	 */
+	smgrtable_truncate(SMgrRelationHash);
 }
 
 /*
@@ -309,17 +399,24 @@ smgrcloseall(void)
 void
 smgrclosenode(RelFileNodeBackend rnode)
 {
-	SMgrRelation reln;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
+	entry = smgrtable_lookup(SMgrRelationHash, rnode);
+	if (entry != NULL)
+	{
+		/* Delete the entry, but skip the hash table delete... */
+		smgrclose_internal(entry->data);
 
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &rnode,
-									  HASH_FIND, NULL);
-	if (reln != NULL)
-		smgrclose(reln);
+		/*
+		 * ... as we can remove from the hash table directly due to already
+		 * having a pointer to the exact entry we want to delete.  This saves
+		 * an additional table lookup.
+		 */
+		smgrtable_delete_item(SMgrRelationHash, entry);
+	}
 }
 
 /*
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98..569104def0 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -50,6 +50,10 @@
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
+ *	  - SH_ENTRY_INITIALIZER(a) - if defined, the code in this macro is called
+ *		for new entries
+ *	  - SH_ENTRY_CLEANUP(a) - if defined, the code in this macro is called
+ *		when an entry is removed from the hash table.
  *
  *	  The element type is required to contain a "status" member that can store
  *	  the range of values defined in the SH_STATUS enum.
@@ -115,6 +119,7 @@
 #define SH_LOOKUP SH_MAKE_NAME(lookup)
 #define SH_LOOKUP_HASH SH_MAKE_NAME(lookup_hash)
 #define SH_GROW SH_MAKE_NAME(grow)
+#define SH_TRUNCATE SH_MAKE_NAME(truncate)
 #define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
 #define SH_START_ITERATE_AT SH_MAKE_NAME(start_iterate_at)
 #define SH_ITERATE SH_MAKE_NAME(iterate)
@@ -224,6 +229,9 @@ SH_SCOPE void SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry);
 /* bool <prefix>_delete(<prefix>_hash *tb, <key> key) */
 SH_SCOPE bool SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key);
 
+/* void <prefix>_truncate(<prefix>_hash *tb) */
+SH_SCOPE void SH_TRUNCATE(SH_TYPE * tb);
+
 /* void <prefix>_start_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
 SH_SCOPE void SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
 
@@ -634,6 +642,9 @@ restart:
 		if (entry->status == SH_STATUS_EMPTY)
 		{
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
@@ -721,6 +732,9 @@ restart:
 
 			/* and fill the now empty spot */
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
@@ -856,7 +870,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 			SH_ELEMENT_TYPE *lastentry = entry;
 
 			tb->members--;
-
+#ifdef SH_ENTRY_CLEANUP
+			SH_ENTRY_CLEANUP(entry);
+#endif
 			/*
 			 * Backward shift following elements till either an empty element
 			 * or an element at its optimal position is encountered.
@@ -919,6 +935,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 	curelem = entry - &tb->data[0];
 
 	tb->members--;
+#ifdef SH_ENTRY_CLEANUP
+	SH_ENTRY_CLEANUP(entry);
+#endif
 
 	/*
 	 * Backward shift following elements till either an empty element or an
@@ -959,6 +978,30 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 	}
 }
 
+/*
+ * Remove all entries from the table making the table empty.
+ */
+SH_SCOPE void
+SH_TRUNCATE(SH_TYPE * tb)
+{
+	int			i;
+
+	for (i = 0; i < tb->size; i++)
+	{
+		SH_ELEMENT_TYPE *entry = &tb->data[i];
+		if (entry->status != SH_STATUS_EMPTY)
+		{
+			entry->status = SH_STATUS_EMPTY;
+
+#ifdef SH_ENTRY_CLEANUP
+			SH_ENTRY_CLEANUP(entry);
+#endif
+		}
+	}
+
+	tb->members = 0;
+}
+
 /*
  * Initialize iterator.
  */
@@ -1133,6 +1176,8 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_DECLARE
 #undef SH_DEFINE
 #undef SH_GET_HASH
+#undef SH_ENTRY_INITIALIZER
+#undef SH_ENTRY_CLEANUP
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
@@ -1166,6 +1211,7 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_LOOKUP
 #undef SH_LOOKUP_HASH
 #undef SH_GROW
+#undef SH_TRUNCATE
 #undef SH_START_ITERATE
 #undef SH_START_ITERATE_AT
 #undef SH_ITERATE
-- 
2.27.0

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: David Rowley (#1)

2 attachment(s)

Re: Use simplehash.h instead of dynahash in SMgr

David Rowley писал 2021-04-24 18:58:

Hackers,

Last year, when working on making compactify_tuples() go faster for
19c60ad69, I did quite a bit of benchmarking of the recovery process.
The next thing that was slow after compactify_tuples() was the hash
lookups done in smgropen().

Currently, we use dynahash hash tables to store the SMgrRelation so we
can perform fast lookups by RelFileNodeBackend. However, I had in mind
that a simplehash table might perform better. So I tried it...

The attached converts the hash table lookups done in smgr.c to use
simplehash instead of dynahash.

This does require a few changes in simplehash.h to make it work. The
reason being is that RelationData.rd_smgr points directly into the
hash table entries. This works ok for dynahash as that hash table
implementation does not do any reallocations of existing items or move
any items around in the table, however, simplehash moves entries
around all the time, so we can't point any pointers directly at the
hash entries and expect them to be valid after adding or removing
anything else from the table.

To work around that, I've just made an additional type that serves as
the hash entry type that has a pointer to the SMgrRelationData along
with the hash status and hash value. It's just 16 bytes (or 12 on
32-bit machines). I opted to keep the hash key in the
SMgrRelationData rather than duplicating it as it keeps the SMgrEntry
struct nice and small. We only need to dereference the SMgrRelation
pointer when we find an entry with the same hash value. The chances
are quite good that an entry with the same hash value is the one that
we want, so any additional dereferences to compare the key are not
going to happen very often.

I did experiment with putting the hash key in SMgrEntry and found it
to be quite a bit slower. I also did try to use hash_bytes() but
found building a hash function that uses murmurhash32 to be quite a
bit faster.

Benchmarking
===========

I did some of that. It made my test case about 10% faster.

The test case was basically inserting 100 million rows one at a time
into a hash partitioned table with 1000 partitions and 2 int columns
and a primary key on one of those columns. It was about 12GB of WAL. I
used a hash partitioned table in the hope to create a fairly
random-looking SMgr hash table access pattern. Hopefully something
similar to what might happen in the real world.

Over 10 runs of recovery, master took an average of 124.89 seconds.
The patched version took 113.59 seconds. About 10% faster.

I bumped shared_buffers up to 10GB, max_wal_size to 20GB and
checkpoint_timeout to 60 mins.

To make the benchmark more easily to repeat I patched with the
attached recovery_panic.patch.txt. This just PANICs at the end of
recovery so that the database shuts down before performing the end of
recovery checkpoint. Just start the database up again to do another
run.

I did 10 runs. The end of recovery log message reported:

master (aa271209f)
CPU: user: 117.89 s, system: 5.70 s, elapsed: 123.65 s
CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s
CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.20 s
CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s
CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s
CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s
CPU: user: 120.60 s, system: 5.82 s, elapsed: 126.49 s
CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s
CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s
CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s

master + v1 patch
CPU: user: 106.90 s, system: 4.45 s, elapsed: 111.39 s
CPU: user: 107.31 s, system: 5.98 s, elapsed: 113.35 s
CPU: user: 107.14 s, system: 5.58 s, elapsed: 112.77 s
CPU: user: 105.79 s, system: 5.64 s, elapsed: 111.48 s
CPU: user: 105.78 s, system: 5.80 s, elapsed: 111.63 s
CPU: user: 113.18 s, system: 6.21 s, elapsed: 119.45 s
CPU: user: 107.74 s, system: 4.57 s, elapsed: 112.36 s
CPU: user: 107.42 s, system: 4.62 s, elapsed: 112.09 s
CPU: user: 106.54 s, system: 4.65 s, elapsed: 111.24 s
CPU: user: 113.24 s, system: 6.86 s, elapsed: 120.16 s

I wrote this patch a few days ago. I'm only posting it now as I know a
couple of other people have expressed an interest in working on this.
I didn't really want any duplicate efforts, so thought I'd better post
it now before someone else goes and writes a similar patch.

I'll park this here and have another look at it when the PG15 branch
opens.

David

Hi, David

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.

+	/* rotate hashkey left 1 bit at each step */
+	hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);

Why do you use so strange rotation expression? I know compillers are
able
to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
Do they recognize construction in your code as well?

Your construction looks more like "multiplate-modulo" operation in 32bit
Galois field . It is widely used operation in cryptographic, but it is
used modulo some primitive polynomial, and 0x100000001 is not such
polynomial. 0x1000000c5 is, therefore it should be:

hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
or
hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 0xc5);

But why don't just use hash_combine(uint32 a, uint32 b) instead (defined
in hashfn.h)? Yep, it could be a bit slower, but is it critical?

- *	smgrclose() -- Close and delete an SMgrRelation object.
+ *	smgrclose() -- Close and delete an SMgrRelation object but don't
+ *	remove from the SMgrRelationHash table.

I believe `smgrclose_internal()` should be in this comment.

Still I don't believe it worth to separate smgrclose_internal from
smgrclose. Is there measurable performance improvement from this
change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .

As well I don't support modification simplehash.h for
SH_ENTRY_INITIALIZER,
SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
doesn't mean much.

Summary:

regards,
Yura Sokolov

Attachments:

recovery_panic.patch.txttext/plain; charset=US-ASCII; name=recovery_panic.patch.txtDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adfc6f67e2..aa7accbe1b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7616,6 +7616,7 @@ StartupXLOG(void)
 						(errmsg("last completed transaction was at log time %s",
 								timestamptz_to_str(xtime))));
 
+			elog(PANIC, "recovery PANIC");
 			InRedo = false;
 		}
 		else

v1-0001-Use-simplehash.h-hashtables-in-SMgr.patchapplication/octet-stream; name=v1-0001-Use-simplehash.h-hashtables-in-SMgr.patchDownload

From d8737afb5d368015522b57f502bf1eced4220689 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 22 Apr 2021 17:03:46 +1200
Subject: [PATCH v1] Use simplehash.h hashtables in SMgr

The hash table lookups done in SMgr can quite often be a bottleneck during
crash recovery.  Traditionally these use dynahash. Here we swap dynahash
out and use simplehash instead.  This improves lookup performance.

Some changes are required from simplehash.h here to make this work.  The
reason for this is that code external to smgr.c does point to the hashed
SMgrRelation. Since simplehash does reallocate the bucket array when
increasing the size of the table and also shuffle entries around during
deletes, code pointing directly into hash entries would be a bad idea. To
overcome this issue we only store a pointer to the SMgrRelationData in the
hash table entry and maintain a separate allocation for that data. This
does mean an additional pointer dereference during lookups, but only when
the hash value matches, so the significant majority of the time that will
only be done for the item we are actually looking for.

Since the hash table key is stored in the referenced SMgrRelation, we need
to add two new macros to allow simplehash to allocate the memory for the
SMgrEntry during inserts before it tries to set the key.  A new macro has
also been added to allow simplehash implementations to perform cleanup
when items are removed from the table.
---
 src/backend/storage/smgr/smgr.c | 173 +++++++++++++++++++++++++-------
 src/include/lib/simplehash.h    |  48 ++++++++-
 2 files changed, 182 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..64a26e06c6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,14 +18,50 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
-#include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/* Hash table entry type for SMgrRelationHash */
+typedef struct SMgrEntry
+{
+	int			status;			/* Hash table status */
+	uint32		hash;			/* Hash value (cached) */
+	SMgrRelation data;			/* Pointer to the SMgrRelationData */
+} SMgrEntry;
+
+static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
+static void smgr_entry_cleanup(SMgrRelation reln);
+
+/*
+ * Because simplehash.h does not provide a stable pointer to hash table
+ * entries, we don't make the element type a SMgrRelation directly, instead we
+ * use an SMgrEntry type which has a pointer to the data field.  simplehash can
+ * move entries around when adding or removing items from the hash table so
+ * having the SMgrRelation as a pointer inside the SMgrEntry allows external
+ * code to keep their own pointers to the SMgrRelation.  Relcache does this.
+ * We use the SH_ENTRY_INITIALIZER to allocate memory for the SMgrRelationData
+ * when a new entry is created.  We also define SH_ENTRY_CLEANUP to execute
+ * some cleanup when removing an item from the table.
+ */
+#define SH_PREFIX		smgrtable
+#define SH_ELEMENT_TYPE	SMgrEntry
+#define SH_KEY_TYPE		RelFileNodeBackend
+#define	SH_KEY			data->smgr_rnode
+#define SH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define SH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
+#define	SH_SCOPE		static inline
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_ENTRY_INITIALIZER(a) a->data = MemoryContextAlloc(TopMemoryContext, sizeof(SMgrRelationData))
+#define SH_ENTRY_CLEANUP(a) smgr_entry_cleanup(a->data)
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -91,13 +127,62 @@ static const int NSmgr = lengthof(smgrsw);
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
  */
-static HTAB *SMgrRelationHash = NULL;
+static smgrtable_hash *SMgrRelationHash = NULL;
 
 static dlist_head unowned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/*
+ * relfilenodebackend_hash
+ *		Custom rolled hash function for simplehash table.
+ *
+ * smgropen() is often a bottleneck in CPU bound workloads during crash
+ * recovery.  We make use of this custom hash function rather than using
+ * hash_bytes as it gives us a little bit more performance.
+ *
+ * XXX What if sizeof(Oid) is not 4?
+ */
+static inline uint32
+relfilenodebackend_hash(RelFileNodeBackend *rnode)
+{
+	uint32		hashkey;
+
+	hashkey = murmurhash32((uint32) rnode->node.spcNode);
+
+	/* rotate hashkey left 1 bit at each step */
+	hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+
+	hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+
+	hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+	hashkey ^= murmurhash32((uint32) rnode->backend);
+
+	return hashkey;
+}
+
+/*
+ * smgr_entry_cleanup
+ *		Cleanup code for simplehash.h to execute when removing an item from
+ *		the hash table.
+ */
+static void
+smgr_entry_cleanup(SMgrRelation reln)
+{
+	/*
+	 * Unhook the owner pointer, if any.  We only do this when we're certain
+	 * the entry is removed from the hash table.  This allows us to leave the
+	 * owner attached if the hash table delete were to fail for some reason.
+	 */
+	if (reln->smgr_owner)
+		*reln->smgr_owner = NULL;
+
+	pfree(reln);
+}
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -147,31 +232,26 @@ smgropen(RelFileNode rnode, BackendId backend)
 {
 	RelFileNodeBackend brnode;
 	SMgrRelation reln;
+	SMgrEntry  *entry;
 	bool		found;
 
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 	{
 		/* First time through: initialize the hash table */
-		HASHCTL		ctl;
-
-		ctl.keysize = sizeof(RelFileNodeBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
-		SMgrRelationHash = hash_create("smgr relation table", 400,
-									   &ctl, HASH_ELEM | HASH_BLOBS);
+		SMgrRelationHash = smgrtable_create(TopMemoryContext, 400, NULL);
 		dlist_init(&unowned_relns);
 	}
 
 	/* Look up or create an entry */
 	brnode.node = rnode;
 	brnode.backend = backend;
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &brnode,
-									  HASH_ENTER, &found);
+	entry = smgrtable_insert(SMgrRelationHash, brnode, &found);
+	reln = entry->data;
 
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		/* hash_search already filled in the lookup key */
+		/* smgrtable_insert already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
@@ -250,10 +330,11 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	smgrclose() -- Close and delete an SMgrRelation object.
+ *	smgrclose() -- Close and delete an SMgrRelation object but don't
+ *	remove from the SMgrRelationHash table.
  */
-void
-smgrclose(SMgrRelation reln)
+static inline void
+smgrclose_internal(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -266,17 +347,18 @@ smgrclose(SMgrRelation reln)
 	if (!owner)
 		dlist_delete(&reln->node);
 
-	if (hash_search(SMgrRelationHash,
-					(void *) &(reln->smgr_rnode),
-					HASH_REMOVE, NULL) == NULL)
-		elog(ERROR, "SMgrRelation hashtable corrupted");
+}
 
-	/*
-	 * Unhook the owner pointer, if any.  We do this last since in the remote
-	 * possibility of failure above, the SMgrRelation object will still exist.
-	 */
-	if (owner)
-		*owner = NULL;
+/*
+ *	smgrclose() -- Close and delete an SMgrRelation object.
+ */
+void
+smgrclose(SMgrRelation reln)
+{
+	smgrclose_internal(reln);
+
+	if (!smgrtable_delete(SMgrRelationHash, reln->smgr_rnode))
+		elog(ERROR, "SMgrRelation hashtable corrupted");
 }
 
 /*
@@ -285,17 +367,25 @@ smgrclose(SMgrRelation reln)
 void
 smgrcloseall(void)
 {
-	HASH_SEQ_STATUS status;
-	SMgrRelation reln;
+	smgrtable_iterator iterator;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 		return;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	smgrtable_start_iterate(SMgrRelationHash, &iterator);
+	while ((entry = smgrtable_iterate(SMgrRelationHash, &iterator)) != NULL)
+		smgrclose_internal(entry->data);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+	/*
+	 * Finally, remove all entries from the hash table.  This is done last and
+	 * in a single operation as we're unable to remove multiple entries in the
+	 * above loop due to deletes moving elements around in the table.
+	 * Additionally, it is much more efficient to just wipe out all entries
+	 * rather than doing individual deletes of each entry.
+	 */
+	smgrtable_truncate(SMgrRelationHash);
 }
 
 /*
@@ -309,17 +399,24 @@ smgrcloseall(void)
 void
 smgrclosenode(RelFileNodeBackend rnode)
 {
-	SMgrRelation reln;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
+	entry = smgrtable_lookup(SMgrRelationHash, rnode);
+	if (entry != NULL)
+	{
+		/* Delete the entry, but skip the hash table delete... */
+		smgrclose_internal(entry->data);
 
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &rnode,
-									  HASH_FIND, NULL);
-	if (reln != NULL)
-		smgrclose(reln);
+		/*
+		 * ... as we can remove from the hash table directly due to already
+		 * having a pointer to the exact entry we want to delete.  This saves
+		 * an additional table lookup.
+		 */
+		smgrtable_delete_item(SMgrRelationHash, entry);
+	}
 }
 
 /*
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98..569104def0 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -50,6 +50,10 @@
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
+ *	  - SH_ENTRY_INITIALIZER(a) - if defined, the code in this macro is called
+ *		for new entries
+ *	  - SH_ENTRY_CLEANUP(a) - if defined, the code in this macro is called
+ *		when an entry is removed from the hash table.
  *
  *	  The element type is required to contain a "status" member that can store
  *	  the range of values defined in the SH_STATUS enum.
@@ -115,6 +119,7 @@
 #define SH_LOOKUP SH_MAKE_NAME(lookup)
 #define SH_LOOKUP_HASH SH_MAKE_NAME(lookup_hash)
 #define SH_GROW SH_MAKE_NAME(grow)
+#define SH_TRUNCATE SH_MAKE_NAME(truncate)
 #define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
 #define SH_START_ITERATE_AT SH_MAKE_NAME(start_iterate_at)
 #define SH_ITERATE SH_MAKE_NAME(iterate)
@@ -224,6 +229,9 @@ SH_SCOPE void SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry);
 /* bool <prefix>_delete(<prefix>_hash *tb, <key> key) */
 SH_SCOPE bool SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key);
 
+/* void <prefix>_truncate(<prefix>_hash *tb) */
+SH_SCOPE void SH_TRUNCATE(SH_TYPE * tb);
+
 /* void <prefix>_start_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
 SH_SCOPE void SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
 
@@ -634,6 +642,9 @@ restart:
 		if (entry->status == SH_STATUS_EMPTY)
 		{
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
@@ -721,6 +732,9 @@ restart:
 
 			/* and fill the now empty spot */
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
@@ -856,7 +870,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 			SH_ELEMENT_TYPE *lastentry = entry;
 
 			tb->members--;
-
+#ifdef SH_ENTRY_CLEANUP
+			SH_ENTRY_CLEANUP(entry);
+#endif
 			/*
 			 * Backward shift following elements till either an empty element
 			 * or an element at its optimal position is encountered.
@@ -919,6 +935,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 	curelem = entry - &tb->data[0];
 
 	tb->members--;
+#ifdef SH_ENTRY_CLEANUP
+	SH_ENTRY_CLEANUP(entry);
+#endif
 
 	/*
 	 * Backward shift following elements till either an empty element or an
@@ -959,6 +978,30 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 	}
 }
 
+/*
+ * Remove all entries from the table making the table empty.
+ */
+SH_SCOPE void
+SH_TRUNCATE(SH_TYPE * tb)
+{
+	int			i;
+
+	for (i = 0; i < tb->size; i++)
+	{
+		SH_ELEMENT_TYPE *entry = &tb->data[i];
+		if (entry->status != SH_STATUS_EMPTY)
+		{
+			entry->status = SH_STATUS_EMPTY;
+
+#ifdef SH_ENTRY_CLEANUP
+			SH_ENTRY_CLEANUP(entry);
+#endif
+		}
+	}
+
+	tb->members = 0;
+}
+
 /*
  * Initialize iterator.
  */
@@ -1133,6 +1176,8 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_DECLARE
 #undef SH_DEFINE
 #undef SH_GET_HASH
+#undef SH_ENTRY_INITIALIZER
+#undef SH_ENTRY_CLEANUP
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
@@ -1166,6 +1211,7 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_LOOKUP
 #undef SH_LOOKUP_HASH
 #undef SH_GROW
+#undef SH_TRUNCATE
 #undef SH_START_ITERATE
 #undef SH_START_ITERATE_AT
 #undef SH_ITERATE
-- 
2.27.0

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#2)

Re: Use simplehash.h instead of dynahash in SMgr

Thanks for having a look at this.

"On Sun, 25 Apr 2021 at 10:27, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.

You might be right there, although, with the particular benchmark I'm
using the size of the table does not change as a result of that. I'd
need to experiment with varying numbers of relations to see if
dropping the fillfactor helps or hinders performance.

FWIW, the hash stats at the end of recovery are:

LOG: redo done at 3/C6E34F0 system usage: CPU: user: 107.00 s,
system: 5.61 s, elapsed: 112.67 s
LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

Perhaps if try using a number of relations somewhere between 2048 *
0.75 and 2048 * 0.9 then I might see some gains. Because I have 2032,
the hash table grew up to 4096 buckets.

I did a quick test dropping the fillfactor down to 0.4. The aim there
was just to see if having 8192 buckets in this test would make it
faster or slower

LOG: redo done at 3/C6E34F0 system usage: CPU: user: 109.61 s,
system: 4.28 s, elapsed: 113.93 s
LOG: size: 8192, members: 2032, filled: 0.248047, total chain: 303,
max chain: 2, avg chain: 0.149114, total_collisions: 209,
max_collisions: 2, avg_collisions: 0.102854

it was slightly slower. I guess since the SMgrEntry is just 16 bytes
wide that 4 of these will sit on each cache line which means there is
a 75% chance that the next bucket over is on the same cache line.
Since the average chain length is just 0.49 then we'll mostly just
need to look at a single cache line to find the entry with the correct
hash key.

+     /* rotate hashkey left 1 bit at each step */
+     hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+     hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
Why do you use so strange rotation expression? I know compillers are
able
to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
Do they recognize construction in your code as well?

Not sure about all compilers, I only checked the earliest version of
clang and gcc at godbolt.org and they both use a single "rol"
instruction. https://godbolt.org/z/1GqdE6T3q

Your construction looks more like "multiplate-modulo" operation in 32bit
Galois field . It is widely used operation in cryptographic, but it is
used modulo some primitive polynomial, and 0x100000001 is not such
polynomial. 0x1000000c5 is, therefore it should be:

hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
or
hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 0xc5);

That does not really make sense to me. If you're shifting a 32-bit
variable left 31 places then why would you AND with 0xc5? The only
possible result is 1 or 0 depending on if the most significant bit is
on or off. I see gcc and clang are unable to optimise that into an
"rol" instruction. If I swap the "0xc5" for "1", then they're able to
optimise the expression.

But why don't just use hash_combine(uint32 a, uint32 b) instead (defined
in hashfn.h)? Yep, it could be a bit slower, but is it critical?

I had that function in the corner of my eye when writing this, but
TBH, the hash function performance was just too big a factor to slow
it down any further by using the more expensive hash_combine()
function. I saw pretty good performance gains from writing my own hash
function rather than using hash_bytes(). I didn't want to detract from
that by using hash_combine(). Rotating the bits left 1 slot seems
good enough for hash join and hash aggregate, so I don't have any
reason to believe it's a bad way to combine the hash values. Do you?

If you grep the source for "hashkey = (hashkey << 1) | ((hashkey &
0x80000000) ? 1 : 0);", then you'll see where else we do the same
rotate left trick.

- *   smgrclose() -- Close and delete an SMgrRelation object.
+ *   smgrclose() -- Close and delete an SMgrRelation object but don't
+ *   remove from the SMgrRelationHash table.

I believe `smgrclose_internal()` should be in this comment.

Oops. Yeah, that's a mistake.

Still I don't believe it worth to separate smgrclose_internal from
smgrclose. Is there measurable performance improvement from this
change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .

The reason I did that is due to the fact that smgrcloseall() loops
over the entire hash table and removes each entry one by one. The
problem is that if I do a smgrtable_delete or smgrtable_delete_item in
that loop then I'd need to restart the loop each time. Be aware that
a simplehash delete can move entries earlier in the table, so it might
cause us to miss entries during the loop. Restarting the loop each
iteration is not going to be very efficient, so instead, I opted to
make a version of smgrclose() that does not remove from the table so
that I can just wipe out all table entries at the end of the loop. I
called that smgrclose_internal(). Maybe there's a better name, but I
don't really see any realistic way of not having some version that
skips the hash table delete. I was hoping the 5 line comment I added
to smgrcloseall() would explain the reason for the code being written
way.

An additional small benefit is that smgrclosenode() can get away with
a single hashtable lookup rather than having to lookup the entry again
with smgrtable_delete(). Using smgrtable_delete_item() deletes by
bucket rather than key value which should be a good bit faster in many
cases. I think the SH_ENTRY_CLEANUP macro is quite useful here as I
don't need to worry about NULLing out the smgr_owner in yet another
location where I do a hash delete.

As well I don't support modification simplehash.h for
SH_ENTRY_INITIALIZER,
SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
doesn't mean much.

Can you share what you've got in mind here?

The problem I'm solving with SH_ENTRY_INITIALIZER is the fact that in
SH_INSERT_HASH_INTERNAL(), when we add a new item, we do entry->SH_KEY
= key; to set the new entries key. Since I have SH_KEY defined as:

#define SH_KEY data->smgr_rnode

then I need some way to allocate the memory for ->data before the key
is set. Doing that in smrgopen() is too late. We've already crashed by
then for referencing uninitialised memory.

I did try putting the key in SMgrEntry but found the performance to be
quite a bit worse than keeping the SMgrEntry down to 16 bytes. That
makes sense to me as we only need to compare the key when we find an
entry with the same hash value as the one we're looking for. There's a
pretty high chance of that being the entry we want. If I got my hash
function right then the odds are about 1 in 4 billion of it not being
the one we want. The only additional price we pay when we get two
entries with the same hash value is an additional pointer dereference
and a key comparison.

David

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: David Rowley (#3)

Re: Use simplehash.h instead of dynahash in SMgr

David Rowley писал 2021-04-25 05:23:

Thanks for having a look at this.

"On Sun, 25 Apr 2021 at 10:27, Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.

You might be right there, although, with the particular benchmark I'm
using the size of the table does not change as a result of that. I'd
need to experiment with varying numbers of relations to see if
dropping the fillfactor helps or hinders performance.

FWIW, the hash stats at the end of recovery are:

LOG: redo done at 3/C6E34F0 system usage: CPU: user: 107.00 s,
system: 5.61 s, elapsed: 112.67 s
LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

Perhaps if try using a number of relations somewhere between 2048 *
0.75 and 2048 * 0.9 then I might see some gains. Because I have 2032,
the hash table grew up to 4096 buckets.

I did a quick test dropping the fillfactor down to 0.4. The aim there
was just to see if having 8192 buckets in this test would make it
faster or slower

LOG: redo done at 3/C6E34F0 system usage: CPU: user: 109.61 s,
system: 4.28 s, elapsed: 113.93 s
LOG: size: 8192, members: 2032, filled: 0.248047, total chain: 303,
max chain: 2, avg chain: 0.149114, total_collisions: 209,
max_collisions: 2, avg_collisions: 0.102854

it was slightly slower.

Certainly. That is because in unmodified case you've got fillfactor 0.49
because table just grew. Below somewhat near 0.6 there is no gain in
lower
fillfactor. But if you test it when it closer to upper bound, you will
notice difference. Try to test it with 3600 nodes, for example, if
going down to 1800 nodes is not possible.

+     /* rotate hashkey left 1 bit at each step */
+     hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+     hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
Why do you use so strange rotation expression? I know compillers are
able
to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
Do they recognize construction in your code as well?
Not sure about all compilers, I only checked the earliest version of
clang and gcc at godbolt.org and they both use a single "rol"
instruction. https://godbolt.org/z/1GqdE6T3q

Yep, looks like all compilers recognize such construction with single
exception of old icc compiler (both 13.0.1 and 16.0.3):
https://godbolt.org/z/qsrjY5Eof
and all compilers recognize `(h << 1) | (h >> 31)` well

Your construction looks more like "multiplate-modulo" operation in
32bit
Galois field . It is widely used operation in cryptographic, but it is
used modulo some primitive polynomial, and 0x100000001 is not such
polynomial. 0x1000000c5 is, therefore it should be:

hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
or
hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) &
0xc5);

That does not really make sense to me. If you're shifting a 32-bit
variable left 31 places then why would you AND with 0xc5? The only
possible result is 1 or 0 depending on if the most significant bit is
on or off.

That is why there is cast to signed int before shifting: `(int32)hashkey

31`.

Shift is then also signed ie arithmetic, and results are 0 or
0xffffffff.

But why don't just use hash_combine(uint32 a, uint32 b) instead
(defined
in hashfn.h)? Yep, it could be a bit slower, but is it critical?

I had that function in the corner of my eye when writing this, but
TBH, the hash function performance was just too big a factor to slow
it down any further by using the more expensive hash_combine()
function. I saw pretty good performance gains from writing my own hash
function rather than using hash_bytes(). I didn't want to detract from
that by using hash_combine(). Rotating the bits left 1 slot seems
good enough for hash join and hash aggregate, so I don't have any
reason to believe it's a bad way to combine the hash values. Do you?

Well, if think a bit more, this hash values could be combined with using
just addition: `hash(a) + hash(b) + hash(c)`.

I thought more about consistency in a codebase. But looks like both ways
(`hash_combine(a,b)` and `rotl(a,1)^b`) are used in a code.
- hash_combine is in one time/three lines in hashTupleDesc at
tupledesc.c
- rotl+xor six times:
-- three times/three lines in execGrouping.c with construction like
yours
-- three times in jsonb_util.c, multirangetypes.c and rangetypes.c with
`(h << 1) | (h >> 31)`.
Therefore I step down on recommendation in this place.

Looks like it is possibility for micropatch to unify hash combining :-)

If you grep the source for "hashkey = (hashkey << 1) | ((hashkey &
0x80000000) ? 1 : 0);", then you'll see where else we do the same
rotate left trick.
- *   smgrclose() -- Close and delete an SMgrRelation object.
+ *   smgrclose() -- Close and delete an SMgrRelation object but don't
+ *   remove from the SMgrRelationHash table.
I believe `smgrclose_internal()` should be in this comment.
Oops. Yeah, that's a mistake.

Still I don't believe it worth to separate smgrclose_internal from
smgrclose. Is there measurable performance improvement from this
change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .

The reason I did that is due to the fact that smgrcloseall() loops
over the entire hash table and removes each entry one by one. The
problem is that if I do a smgrtable_delete or smgrtable_delete_item in
that loop then I'd need to restart the loop each time. Be aware that
a simplehash delete can move entries earlier in the table, so it might
cause us to miss entries during the loop. Restarting the loop each
iteration is not going to be very efficient, so instead, I opted to
make a version of smgrclose() that does not remove from the table so
that I can just wipe out all table entries at the end of the loop. I
called that smgrclose_internal().

If you read comments in SH_START_ITERATE, you'll see:

* Search for the first empty element. As deletions during iterations
are
* supported, we want to start/end at an element that cannot be
affected
* by elements being shifted.

* Iterate backwards, that allows the current element to be deleted,
even
* if there are backward shifts

Therefore, it is safe to delete during iteration, and it doesn't lead
nor
require loop restart.

An additional small benefit is that smgrclosenode() can get away with
a single hashtable lookup rather than having to lookup the entry again
with smgrtable_delete(). Using smgrtable_delete_item() deletes by
bucket rather than key value which should be a good bit faster in many
cases. I think the SH_ENTRY_CLEANUP macro is quite useful here as I
don't need to worry about NULLing out the smgr_owner in yet another
location where I do a hash delete.

Doubtfully it makes sense since smgrclosenode is called only in
LocalExecuteInvalidationMessage, ie when other backend drops some
relation. There is no useful performance gain from it.

As well I don't support modification simplehash.h for
SH_ENTRY_INITIALIZER,
SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
doesn't mean much.

Can you share what you've got in mind here?

The problem I'm solving with SH_ENTRY_INITIALIZER is the fact that in
SH_INSERT_HASH_INTERNAL(), when we add a new item, we do entry->SH_KEY
= key; to set the new entries key. Since I have SH_KEY defined as:

#define SH_KEY data->smgr_rnode

then I need some way to allocate the memory for ->data before the key
is set. Doing that in smrgopen() is too late. We've already crashed by
then for referencing uninitialised memory.

Oh, now I see.
I could suggest work-around:
- use entry->hash as a whole key value and manually resolve hash
collision with chaining.
But it looks ugly: use hash table and still manually resolve collisions.

Therefore perhaps SH_ENTRY_INITIALIZER has sense.

But SH_ENTRY_CLEANUP is abused in the patch: it is not symmetric to
SH_ENTRY_INITIALIZER. It smells bad. `smgr_owner` is better cleaned
in a way it is cleaned now in smgrclose because it is less obscure.
And SH_ENTRY_CLEANUP should be just `pfree(a->data)`.

And still no reason to have SH_TRUNCATE.

I did try putting the key in SMgrEntry but found the performance to be
quite a bit worse than keeping the SMgrEntry down to 16 bytes. That
makes sense to me as we only need to compare the key when we find an
entry with the same hash value as the one we're looking for. There's a
pretty high chance of that being the entry we want. If I got my hash
function right then the odds are about 1 in 4 billion of it not being
the one we want. The only additional price we pay when we get two
entries with the same hash value is an additional pointer dereference
and a key comparison.

It has sense: whole benefit of simplehash is cache locality, and
it is gained with smaller entry.

regards,
Yura Sokolov

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#4)

4 attachment(s)

Re: Use simplehash.h instead of dynahash in SMgr

On Sun, 25 Apr 2021 at 18:48, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

If you read comments in SH_START_ITERATE, you'll see:

* Search for the first empty element. As deletions during iterations
are
* supported, we want to start/end at an element that cannot be
affected
* by elements being shifted.

* Iterate backwards, that allows the current element to be deleted,
even
* if there are backward shifts

Therefore, it is safe to delete during iteration, and it doesn't lead
nor
require loop restart.

I had only skimmed that with a pre-loaded assumption that it wouldn't
be safe. I didn't do a very good job of reading it as I failed to
notice the lack of guarantees were about deleting items other than the
current one. I didn't consider the option of finding a free bucket
then looping backwards to avoid missing entries that are moved up
during a delete.

With that, I changed the patch to get rid of the SH_TRUNCATE and got
rid of the smgrclose_internal which skips the hash delete. The code
is much more similar to how it was now.

In regards to the hashing stuff. I added a new function to
pg_bitutils.h to rotate left and I'm using that instead of the other
expression that was taken from nodeHash.c

For the hash function, I've done some further benchmarking with:

1) The attached v2 patch
2) The attached + plus use_hash_combine.patch.txt which uses
hash_combine() instead of pg_rotate_left32()ing the hashkey each time.
3) The attached v2 with use_hash_bytes.patch.txt applied.
4) Master

I've also included the hash stats from each version of the hash function.

I hope the numbers help indicate the reason I picked the hash function
that I did.

1) v2 patch.
CPU: user: 108.23 s, system: 6.97 s, elapsed: 115.63 s
CPU: user: 114.78 s, system: 6.88 s, elapsed: 121.71 s
CPU: user: 107.53 s, system: 5.70 s, elapsed: 113.28 s
CPU: user: 108.43 s, system: 5.73 s, elapsed: 114.22 s
CPU: user: 106.18 s, system: 5.73 s, elapsed: 111.96 s
CPU: user: 108.04 s, system: 5.29 s, elapsed: 113.39 s
CPU: user: 107.64 s, system: 5.64 s, elapsed: 113.34 s
CPU: user: 106.64 s, system: 5.58 s, elapsed: 112.27 s
CPU: user: 107.91 s, system: 5.40 s, elapsed: 113.36 s
CPU: user: 115.35 s, system: 6.60 s, elapsed: 122.01 s

Median = 113.375 s

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

2) v2 patch + use_hash_combine.patch.txt
CPU: user: 113.22 s, system: 5.52 s, elapsed: 118.80 s
CPU: user: 116.63 s, system: 5.87 s, elapsed: 122.56 s
CPU: user: 115.33 s, system: 5.73 s, elapsed: 121.12 s
CPU: user: 113.11 s, system: 5.61 s, elapsed: 118.78 s
CPU: user: 112.56 s, system: 5.51 s, elapsed: 118.13 s
CPU: user: 114.55 s, system: 5.80 s, elapsed: 120.40 s
CPU: user: 121.79 s, system: 6.45 s, elapsed: 128.29 s
CPU: user: 113.98 s, system: 4.50 s, elapsed: 118.52 s
CPU: user: 113.24 s, system: 5.63 s, elapsed: 118.93 s
CPU: user: 114.11 s, system: 5.60 s, elapsed: 119.78 s

Median = 119.355 s

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 971,
max chain: 6, avg chain: 0.477854, total_collisions: 433,
max_collisions: 4, avg_collisions: 0.213091

3) v2 patch + use_hash_bytes.patch.txt
CPU: user: 120.87 s, system: 6.69 s, elapsed: 127.62 s
CPU: user: 112.40 s, system: 4.68 s, elapsed: 117.14 s
CPU: user: 113.19 s, system: 5.44 s, elapsed: 118.69 s
CPU: user: 112.15 s, system: 4.73 s, elapsed: 116.93 s
CPU: user: 111.10 s, system: 5.59 s, elapsed: 116.74 s
CPU: user: 112.03 s, system: 5.74 s, elapsed: 117.82 s
CPU: user: 113.69 s, system: 4.33 s, elapsed: 118.07 s
CPU: user: 113.30 s, system: 4.19 s, elapsed: 117.55 s
CPU: user: 112.77 s, system: 5.57 s, elapsed: 118.39 s
CPU: user: 112.25 s, system: 4.59 s, elapsed: 116.88 s

Median = 117.685 s

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 900,
max chain: 4, avg chain: 0.442913, total_collisions: 415,
max_collisions: 4, avg_collisions: 0.204232

4) master
CPU: user: 117.89 s, system: 5.7 s, elapsed: 123.65 s
CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s
CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.2 s
CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s
CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s
CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s
CPU: user: 120.6 s, system: 5.82 s, elapsed: 126.49 s
CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s
CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s
CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s

Median = 124.49 s

You can see that the bare v2 patch is quite a bit faster than any of
the alternatives. We'd be better of with hash_bytes than using
hash_combine() both for performance and for the seemingly better job
the hash function does at reducing the hash collisions.

David

Attachments:

use_hash_bytes.patch.txttext/plain; charset=US-ASCII; name=use_hash_bytes.patch.txtDownload

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8310f73212..33e5cadd03 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -34,8 +34,6 @@ typedef struct SMgrEntry
 	SMgrRelation data;			/* Pointer to the SMgrRelationData */
 } SMgrEntry;
 
-static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
-
 /*
  * Because simplehash.h does not provide a stable pointer to hash table
  * entries, we don't make the element type a SMgrRelation directly, instead we
@@ -51,7 +49,7 @@ static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
 #define SH_ELEMENT_TYPE	SMgrEntry
 #define SH_KEY_TYPE		RelFileNodeBackend
 #define	SH_KEY			data->smgr_rnode
-#define SH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define SH_HASH_KEY(tb, key)	hash_bytes((const unsigned char *) &key, sizeof(RelFileNodeBackend))
 #define SH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
 #define	SH_SCOPE		static inline
 #define SH_STORE_HASH
@@ -133,37 +131,6 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
-/*
- * relfilenodebackend_hash
- *		Custom rolled hash function for simplehash table.
- *
- * smgropen() is often a bottleneck in CPU bound workloads during crash
- * recovery.  We make use of this custom hash function rather than using
- * hash_bytes as it gives us a little bit more performance.
- *
- * XXX What if sizeof(Oid) is not 4?
- */
-static inline uint32
-relfilenodebackend_hash(RelFileNodeBackend *rnode)
-{
-	uint32		hashkey;
-
-	hashkey = murmurhash32((uint32) rnode->node.spcNode);
-
-	/* rotate hashkey left 1 bit at each step */
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
-
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
-
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->backend);
-
-	return hashkey;
-}
-
-
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								  managers.

v2-0001-Use-simplehash.h-hashtables-in-SMgr.patchapplication/octet-stream; name=v2-0001-Use-simplehash.h-hashtables-in-SMgr.patchDownload

From 3b4f2c964e40c36824ed18658a53f98568e20af7 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 22 Apr 2021 17:03:46 +1200
Subject: [PATCH v2] Use simplehash.h hashtables in SMgr

The hash table lookups done in SMgr can quite often be a bottleneck during
crash recovery.  Traditionally these use dynahash. Here we swap dynahash
out and use simplehash instead.  This improves lookup performance.

Some changes are required from simplehash.h here to make this work.  The
reason for this is that code external to smgr.c does point to the hashed
SMgrRelation. Since simplehash does reallocate the bucket array when
increasing the size of the table and also shuffle entries around during
deletes, code pointing directly into hash entries would be a bad idea. To
overcome this issue we only store a pointer to the SMgrRelationData in the
hash table entry and maintain a separate allocation for that data. This
does mean an additional pointer dereference during lookups, but only when
the hash value matches, so the significant majority of the time that will
only be done for the item we are actually looking for.

Since the hash table key is stored in the referenced SMgrRelation, we need
to add two new macros to allow simplehash to allocate the memory for the
SMgrEntry during inserts before it tries to set the key.  A new macro has
also been added to allow simplehash implementations to perform cleanup
when items are removed from the table.
---
 src/backend/storage/smgr/smgr.c | 111 ++++++++++++++++++++++++--------
 src/include/lib/simplehash.h    |  20 +++++-
 src/include/port/pg_bitutils.h  |  13 +++-
 3 files changed, 114 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..8310f73212 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,14 +18,49 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
-#include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/* Hash table entry type for SMgrRelationHash */
+typedef struct SMgrEntry
+{
+	int			status;			/* Hash table status */
+	uint32		hash;			/* Hash value (cached) */
+	SMgrRelation data;			/* Pointer to the SMgrRelationData */
+} SMgrEntry;
+
+static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
+
+/*
+ * Because simplehash.h does not provide a stable pointer to hash table
+ * entries, we don't make the element type a SMgrRelation directly, instead we
+ * use an SMgrEntry type which has a pointer to the data field.  simplehash can
+ * move entries around when adding or removing items from the hash table so
+ * having the SMgrRelation as a pointer inside the SMgrEntry allows external
+ * code to keep their own pointers to the SMgrRelation.  Relcache does this.
+ * We use the SH_ENTRY_INITIALIZER to allocate memory for the SMgrRelationData
+ * when a new entry is created.  We also define SH_ENTRY_CLEANUP to execute
+ * some cleanup when removing an item from the table.
+ */
+#define SH_PREFIX		smgrtable
+#define SH_ELEMENT_TYPE	SMgrEntry
+#define SH_KEY_TYPE		RelFileNodeBackend
+#define	SH_KEY			data->smgr_rnode
+#define SH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define SH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
+#define	SH_SCOPE		static inline
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_ENTRY_INITIALIZER(a) a->data = MemoryContextAlloc(TopMemoryContext, sizeof(SMgrRelationData))
+#define SH_ENTRY_CLEANUP(a) pfree(a->data)
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -91,13 +126,43 @@ static const int NSmgr = lengthof(smgrsw);
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
  */
-static HTAB *SMgrRelationHash = NULL;
+static smgrtable_hash *SMgrRelationHash = NULL;
 
 static dlist_head unowned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/*
+ * relfilenodebackend_hash
+ *		Custom rolled hash function for simplehash table.
+ *
+ * smgropen() is often a bottleneck in CPU bound workloads during crash
+ * recovery.  We make use of this custom hash function rather than using
+ * hash_bytes as it gives us a little bit more performance.
+ *
+ * XXX What if sizeof(Oid) is not 4?
+ */
+static inline uint32
+relfilenodebackend_hash(RelFileNodeBackend *rnode)
+{
+	uint32		hashkey;
+
+	hashkey = murmurhash32((uint32) rnode->node.spcNode);
+
+	/* rotate hashkey left 1 bit at each step */
+	hashkey = pg_rotate_left32(hashkey, 1);
+	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+
+	hashkey = pg_rotate_left32(hashkey, 1);
+	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+
+	hashkey = pg_rotate_left32(hashkey, 1);
+	hashkey ^= murmurhash32((uint32) rnode->backend);
+
+	return hashkey;
+}
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -147,31 +212,26 @@ smgropen(RelFileNode rnode, BackendId backend)
 {
 	RelFileNodeBackend brnode;
 	SMgrRelation reln;
+	SMgrEntry  *entry;
 	bool		found;
 
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 	{
 		/* First time through: initialize the hash table */
-		HASHCTL		ctl;
-
-		ctl.keysize = sizeof(RelFileNodeBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
-		SMgrRelationHash = hash_create("smgr relation table", 400,
-									   &ctl, HASH_ELEM | HASH_BLOBS);
+		SMgrRelationHash = smgrtable_create(TopMemoryContext, 400, NULL);
 		dlist_init(&unowned_relns);
 	}
 
 	/* Look up or create an entry */
 	brnode.node = rnode;
 	brnode.backend = backend;
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &brnode,
-									  HASH_ENTER, &found);
+	entry = smgrtable_insert(SMgrRelationHash, brnode, &found);
+	reln = entry->data;
 
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		/* hash_search already filled in the lookup key */
+		/* smgrtable_insert already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
@@ -266,9 +326,7 @@ smgrclose(SMgrRelation reln)
 	if (!owner)
 		dlist_delete(&reln->node);
 
-	if (hash_search(SMgrRelationHash,
-					(void *) &(reln->smgr_rnode),
-					HASH_REMOVE, NULL) == NULL)
+	if (!smgrtable_delete(SMgrRelationHash, reln->smgr_rnode))
 		elog(ERROR, "SMgrRelation hashtable corrupted");
 
 	/*
@@ -285,17 +343,17 @@ smgrclose(SMgrRelation reln)
 void
 smgrcloseall(void)
 {
-	HASH_SEQ_STATUS status;
-	SMgrRelation reln;
+	smgrtable_iterator iterator;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	smgrtable_start_iterate(SMgrRelationHash, &iterator);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+	while ((entry = smgrtable_iterate(SMgrRelationHash, &iterator)) != NULL)
+		smgrclose(entry->data);
 }
 
 /*
@@ -309,17 +367,14 @@ smgrcloseall(void)
 void
 smgrclosenode(RelFileNodeBackend rnode)
 {
-	SMgrRelation reln;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
-
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &rnode,
-									  HASH_FIND, NULL);
-	if (reln != NULL)
-		smgrclose(reln);
+	entry = smgrtable_lookup(SMgrRelationHash, rnode);
+	if (entry != NULL)
+		smgrclose(entry->data);
 }
 
 /*
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98..4fce182de6 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -50,6 +50,11 @@
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
+ *	  - SH_ENTRY_INITIALIZER(a) - if defined, the code in this macro is called
+ *		for new entries directly before any other internal code makes any
+ *		changes to setup the new entry
+ *	  - SH_ENTRY_CLEANUP(a) - if defined, the code in this macro is called
+ *		when an entry is removed from the hash table
  *
  *	  The element type is required to contain a "status" member that can store
  *	  the range of values defined in the SH_STATUS enum.
@@ -634,6 +639,9 @@ restart:
 		if (entry->status == SH_STATUS_EMPTY)
 		{
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
@@ -721,6 +729,9 @@ restart:
 
 			/* and fill the now empty spot */
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
@@ -856,7 +867,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 			SH_ELEMENT_TYPE *lastentry = entry;
 
 			tb->members--;
-
+#ifdef SH_ENTRY_CLEANUP
+			SH_ENTRY_CLEANUP(entry);
+#endif
 			/*
 			 * Backward shift following elements till either an empty element
 			 * or an element at its optimal position is encountered.
@@ -919,6 +932,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 	curelem = entry - &tb->data[0];
 
 	tb->members--;
+#ifdef SH_ENTRY_CLEANUP
+	SH_ENTRY_CLEANUP(entry);
+#endif
 
 	/*
 	 * Backward shift following elements till either an empty element or an
@@ -1133,6 +1149,8 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_DECLARE
 #undef SH_DEFINE
 #undef SH_GET_HASH
+#undef SH_ENTRY_INITIALIZER
+#undef SH_ENTRY_CLEANUP
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index f9b77ec278..581957fe55 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -215,7 +215,8 @@ extern int	(*pg_popcount64) (uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
 /*
- * Rotate the bits of "word" to the right by n bits.
+ * pg_rotate_right32
+ *		Rotate the bits of 'word' to the right by 'n' bits.
  */
 static inline uint32
 pg_rotate_right32(uint32 word, int n)
@@ -223,4 +224,14 @@ pg_rotate_right32(uint32 word, int n)
 	return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
 }
 
+/*
+ * pg_rotate_left32
+ *		Rotate the bits of 'word' to the left by 'n' bits.
+ */
+static inline uint32
+pg_rotate_left32(uint32 word, int n)
+{
+	return (word << n) | (word >> (sizeof(word) * BITS_PER_BYTE - n));
+}
+
 #endif							/* PG_BITUTILS_H */
-- 
2.21.0.windows.1

use_hash_combine.patch.txttext/plain; charset=US-ASCII; name=use_hash_combine.patch.txtDownload

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8310f73212..f291ded795 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,18 +147,19 @@ static inline uint32
 relfilenodebackend_hash(RelFileNodeBackend *rnode)
 {
 	uint32		hashkey;
+	uint32		hashkey2;
 
 	hashkey = murmurhash32((uint32) rnode->node.spcNode);
 
 	/* rotate hashkey left 1 bit at each step */
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+	hashkey2 = murmurhash32((uint32) rnode->node.dbNode);
+	hashkey = hash_combine(hashkey, hashkey2);
 
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+	hashkey2 = murmurhash32((uint32) rnode->node.relNode);
+	hashkey = hash_combine(hashkey, hashkey2);
 
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->backend);
+	hashkey2 = murmurhash32((uint32) rnode->backend);
+	hashkey = hash_combine(hashkey, hashkey2);
 
 	return hashkey;
 }

recovery_panic.patch.txttext/plain; charset=US-ASCII; name=recovery_panic.patch.txtDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adfc6f67e2..5d34e06eab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7615,7 +7615,8 @@ StartupXLOG(void)
 				ereport(LOG,
 						(errmsg("last completed transaction was at log time %s",
 								timestamptz_to_str(xtime))));
-
+			smgrstats();
+			elog(PANIC, "recovery PANIC");
 			InRedo = false;
 		}
 		else
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 64a26e06c6..850b51e316 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -183,6 +183,12 @@ smgr_entry_cleanup(SMgrRelation reln)
 	pfree(reln);
 }
 
+void
+smgrstats(void)
+{
+	if (SMgrRelationHash != NULL)
+		smgrtable_stat(SMgrRelationHash);
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..ac010af74a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -77,6 +77,7 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
+extern void smgrstats(void);
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: David Rowley (#5)

4 attachment(s)

Re: Use simplehash.h instead of dynahash in SMgr

David Rowley писал 2021-04-25 16:36:

On Sun, 25 Apr 2021 at 18:48, Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

If you read comments in SH_START_ITERATE, you'll see:

* Search for the first empty element. As deletions during
iterations
are
* supported, we want to start/end at an element that cannot be
affected
* by elements being shifted.

* Iterate backwards, that allows the current element to be deleted,
even
* if there are backward shifts

Therefore, it is safe to delete during iteration, and it doesn't lead
nor
require loop restart.

I had only skimmed that with a pre-loaded assumption that it wouldn't
be safe. I didn't do a very good job of reading it as I failed to
notice the lack of guarantees were about deleting items other than the
current one. I didn't consider the option of finding a free bucket
then looping backwards to avoid missing entries that are moved up
during a delete.

With that, I changed the patch to get rid of the SH_TRUNCATE and got
rid of the smgrclose_internal which skips the hash delete. The code
is much more similar to how it was now.

In regards to the hashing stuff. I added a new function to
pg_bitutils.h to rotate left and I'm using that instead of the other
expression that was taken from nodeHash.c

For the hash function, I've done some further benchmarking with:

1) The attached v2 patch
2) The attached + plus use_hash_combine.patch.txt which uses
hash_combine() instead of pg_rotate_left32()ing the hashkey each time.
3) The attached v2 with use_hash_bytes.patch.txt applied.
4) Master

I've also included the hash stats from each version of the hash
function.

I hope the numbers help indicate the reason I picked the hash function
that I did.

1) v2 patch.
Median = 113.375 s

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

2) v2 patch + use_hash_combine.patch.txt
Median = 119.355 s

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 971,
max chain: 6, avg chain: 0.477854, total_collisions: 433,
max_collisions: 4, avg_collisions: 0.213091

3) v2 patch + use_hash_bytes.patch.txt
Median = 117.685 s

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 900,
max chain: 4, avg chain: 0.442913, total_collisions: 415,
max_collisions: 4, avg_collisions: 0.204232

4) master
Median = 124.49 s

You can see that the bare v2 patch is quite a bit faster than any of
the alternatives. We'd be better of with hash_bytes than using
hash_combine() both for performance and for the seemingly better job
the hash function does at reducing the hash collisions.

David

Looks much better! Simpler is almost always better.

Minor remarks:

Comment for SH_ENTRY_INITIALIZER e. May be like:
- SH_ENTRY_INITIALIZER(a) - if defined, this macro is called for new
entries
before key or hash is stored in. For example, it can be used to make
necessary memory allocations.

`pg_rotate_left32(x, 1) == pg_rotate_right(x, 31)`, therefore there's
no need to add `pg_rotate_left32` at all. More over, for hash combining
there's no much difference between `pg_rotate_left32(x, 1)` and
`pg_rotate_right(x, 1)`. (To be honestly, there is a bit of difference
due to murmur construction, but it should not be very big.)

If your test so sensitive to hash function speed, then I'd suggest
to try something even simpler:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
uint32 h = 0;
#define step(x) h ^= (uint32)(x) * 0x85ebca6b; h = pg_rotate_right(h,
11); h *= 9;
step(rnode->node.relNode);
step(rnode->node.spcNode); // spcNode could be different for same
relNode only
// during table movement. Does it pay
to hash it?
step(rnode->node.dbNode);
step(rnode->backend); // does it matter to hash backend?
// It equals to InvalidBackendId for
non-temporary relations
// and temporary relations in same
database never have same
// relNode (have they?).
return murmurhash32(hashkey);
}

I'd like to see benchmark code. It quite interesting this place became
measurable at all.

regards,
Yura Sokolov.

Attachments:

use_hash_bytes.patch.txttext/plain; charset=US-ASCII; name=use_hash_bytes.patch.txtDownload

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8310f73212..33e5cadd03 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -34,8 +34,6 @@ typedef struct SMgrEntry
 	SMgrRelation data;			/* Pointer to the SMgrRelationData */
 } SMgrEntry;
 
-static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
-
 /*
  * Because simplehash.h does not provide a stable pointer to hash table
  * entries, we don't make the element type a SMgrRelation directly, instead we
@@ -51,7 +49,7 @@ static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
 #define SH_ELEMENT_TYPE	SMgrEntry
 #define SH_KEY_TYPE		RelFileNodeBackend
 #define	SH_KEY			data->smgr_rnode
-#define SH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define SH_HASH_KEY(tb, key)	hash_bytes((const unsigned char *) &key, sizeof(RelFileNodeBackend))
 #define SH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
 #define	SH_SCOPE		static inline
 #define SH_STORE_HASH
@@ -133,37 +131,6 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
-/*
- * relfilenodebackend_hash
- *		Custom rolled hash function for simplehash table.
- *
- * smgropen() is often a bottleneck in CPU bound workloads during crash
- * recovery.  We make use of this custom hash function rather than using
- * hash_bytes as it gives us a little bit more performance.
- *
- * XXX What if sizeof(Oid) is not 4?
- */
-static inline uint32
-relfilenodebackend_hash(RelFileNodeBackend *rnode)
-{
-	uint32		hashkey;
-
-	hashkey = murmurhash32((uint32) rnode->node.spcNode);
-
-	/* rotate hashkey left 1 bit at each step */
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
-
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
-
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->backend);
-
-	return hashkey;
-}
-
-
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								  managers.

v2-0001-Use-simplehash.h-hashtables-in-SMgr.patchapplication/octet-stream; name=v2-0001-Use-simplehash.h-hashtables-in-SMgr.patchDownload

From 3b4f2c964e40c36824ed18658a53f98568e20af7 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 22 Apr 2021 17:03:46 +1200
Subject: [PATCH v2] Use simplehash.h hashtables in SMgr

The hash table lookups done in SMgr can quite often be a bottleneck during
crash recovery.  Traditionally these use dynahash. Here we swap dynahash
out and use simplehash instead.  This improves lookup performance.

Some changes are required from simplehash.h here to make this work.  The
reason for this is that code external to smgr.c does point to the hashed
SMgrRelation. Since simplehash does reallocate the bucket array when
increasing the size of the table and also shuffle entries around during
deletes, code pointing directly into hash entries would be a bad idea. To
overcome this issue we only store a pointer to the SMgrRelationData in the
hash table entry and maintain a separate allocation for that data. This
does mean an additional pointer dereference during lookups, but only when
the hash value matches, so the significant majority of the time that will
only be done for the item we are actually looking for.

Since the hash table key is stored in the referenced SMgrRelation, we need
to add two new macros to allow simplehash to allocate the memory for the
SMgrEntry during inserts before it tries to set the key.  A new macro has
also been added to allow simplehash implementations to perform cleanup
when items are removed from the table.
---
 src/backend/storage/smgr/smgr.c | 111 ++++++++++++++++++++++++--------
 src/include/lib/simplehash.h    |  20 +++++-
 src/include/port/pg_bitutils.h  |  13 +++-
 3 files changed, 114 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..8310f73212 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,14 +18,49 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
-#include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/* Hash table entry type for SMgrRelationHash */
+typedef struct SMgrEntry
+{
+	int			status;			/* Hash table status */
+	uint32		hash;			/* Hash value (cached) */
+	SMgrRelation data;			/* Pointer to the SMgrRelationData */
+} SMgrEntry;
+
+static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
+
+/*
+ * Because simplehash.h does not provide a stable pointer to hash table
+ * entries, we don't make the element type a SMgrRelation directly, instead we
+ * use an SMgrEntry type which has a pointer to the data field.  simplehash can
+ * move entries around when adding or removing items from the hash table so
+ * having the SMgrRelation as a pointer inside the SMgrEntry allows external
+ * code to keep their own pointers to the SMgrRelation.  Relcache does this.
+ * We use the SH_ENTRY_INITIALIZER to allocate memory for the SMgrRelationData
+ * when a new entry is created.  We also define SH_ENTRY_CLEANUP to execute
+ * some cleanup when removing an item from the table.
+ */
+#define SH_PREFIX		smgrtable
+#define SH_ELEMENT_TYPE	SMgrEntry
+#define SH_KEY_TYPE		RelFileNodeBackend
+#define	SH_KEY			data->smgr_rnode
+#define SH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define SH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
+#define	SH_SCOPE		static inline
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_ENTRY_INITIALIZER(a) a->data = MemoryContextAlloc(TopMemoryContext, sizeof(SMgrRelationData))
+#define SH_ENTRY_CLEANUP(a) pfree(a->data)
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -91,13 +126,43 @@ static const int NSmgr = lengthof(smgrsw);
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
  */
-static HTAB *SMgrRelationHash = NULL;
+static smgrtable_hash *SMgrRelationHash = NULL;
 
 static dlist_head unowned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/*
+ * relfilenodebackend_hash
+ *		Custom rolled hash function for simplehash table.
+ *
+ * smgropen() is often a bottleneck in CPU bound workloads during crash
+ * recovery.  We make use of this custom hash function rather than using
+ * hash_bytes as it gives us a little bit more performance.
+ *
+ * XXX What if sizeof(Oid) is not 4?
+ */
+static inline uint32
+relfilenodebackend_hash(RelFileNodeBackend *rnode)
+{
+	uint32		hashkey;
+
+	hashkey = murmurhash32((uint32) rnode->node.spcNode);
+
+	/* rotate hashkey left 1 bit at each step */
+	hashkey = pg_rotate_left32(hashkey, 1);
+	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+
+	hashkey = pg_rotate_left32(hashkey, 1);
+	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+
+	hashkey = pg_rotate_left32(hashkey, 1);
+	hashkey ^= murmurhash32((uint32) rnode->backend);
+
+	return hashkey;
+}
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -147,31 +212,26 @@ smgropen(RelFileNode rnode, BackendId backend)
 {
 	RelFileNodeBackend brnode;
 	SMgrRelation reln;
+	SMgrEntry  *entry;
 	bool		found;
 
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 	{
 		/* First time through: initialize the hash table */
-		HASHCTL		ctl;
-
-		ctl.keysize = sizeof(RelFileNodeBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
-		SMgrRelationHash = hash_create("smgr relation table", 400,
-									   &ctl, HASH_ELEM | HASH_BLOBS);
+		SMgrRelationHash = smgrtable_create(TopMemoryContext, 400, NULL);
 		dlist_init(&unowned_relns);
 	}
 
 	/* Look up or create an entry */
 	brnode.node = rnode;
 	brnode.backend = backend;
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &brnode,
-									  HASH_ENTER, &found);
+	entry = smgrtable_insert(SMgrRelationHash, brnode, &found);
+	reln = entry->data;
 
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		/* hash_search already filled in the lookup key */
+		/* smgrtable_insert already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
@@ -266,9 +326,7 @@ smgrclose(SMgrRelation reln)
 	if (!owner)
 		dlist_delete(&reln->node);
 
-	if (hash_search(SMgrRelationHash,
-					(void *) &(reln->smgr_rnode),
-					HASH_REMOVE, NULL) == NULL)
+	if (!smgrtable_delete(SMgrRelationHash, reln->smgr_rnode))
 		elog(ERROR, "SMgrRelation hashtable corrupted");
 
 	/*
@@ -285,17 +343,17 @@ smgrclose(SMgrRelation reln)
 void
 smgrcloseall(void)
 {
-	HASH_SEQ_STATUS status;
-	SMgrRelation reln;
+	smgrtable_iterator iterator;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	smgrtable_start_iterate(SMgrRelationHash, &iterator);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+	while ((entry = smgrtable_iterate(SMgrRelationHash, &iterator)) != NULL)
+		smgrclose(entry->data);
 }
 
 /*
@@ -309,17 +367,14 @@ smgrcloseall(void)
 void
 smgrclosenode(RelFileNodeBackend rnode)
 {
-	SMgrRelation reln;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
-
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &rnode,
-									  HASH_FIND, NULL);
-	if (reln != NULL)
-		smgrclose(reln);
+	entry = smgrtable_lookup(SMgrRelationHash, rnode);
+	if (entry != NULL)
+		smgrclose(entry->data);
 }
 
 /*
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98..4fce182de6 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -50,6 +50,11 @@
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
+ *	  - SH_ENTRY_INITIALIZER(a) - if defined, the code in this macro is called
+ *		for new entries directly before any other internal code makes any
+ *		changes to setup the new entry
+ *	  - SH_ENTRY_CLEANUP(a) - if defined, the code in this macro is called
+ *		when an entry is removed from the hash table
  *
  *	  The element type is required to contain a "status" member that can store
  *	  the range of values defined in the SH_STATUS enum.
@@ -634,6 +639,9 @@ restart:
 		if (entry->status == SH_STATUS_EMPTY)
 		{
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
@@ -721,6 +729,9 @@ restart:
 
 			/* and fill the now empty spot */
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
@@ -856,7 +867,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 			SH_ELEMENT_TYPE *lastentry = entry;
 
 			tb->members--;
-
+#ifdef SH_ENTRY_CLEANUP
+			SH_ENTRY_CLEANUP(entry);
+#endif
 			/*
 			 * Backward shift following elements till either an empty element
 			 * or an element at its optimal position is encountered.
@@ -919,6 +932,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 	curelem = entry - &tb->data[0];
 
 	tb->members--;
+#ifdef SH_ENTRY_CLEANUP
+	SH_ENTRY_CLEANUP(entry);
+#endif
 
 	/*
 	 * Backward shift following elements till either an empty element or an
@@ -1133,6 +1149,8 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_DECLARE
 #undef SH_DEFINE
 #undef SH_GET_HASH
+#undef SH_ENTRY_INITIALIZER
+#undef SH_ENTRY_CLEANUP
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index f9b77ec278..581957fe55 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -215,7 +215,8 @@ extern int	(*pg_popcount64) (uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
 /*
- * Rotate the bits of "word" to the right by n bits.
+ * pg_rotate_right32
+ *		Rotate the bits of 'word' to the right by 'n' bits.
  */
 static inline uint32
 pg_rotate_right32(uint32 word, int n)
@@ -223,4 +224,14 @@ pg_rotate_right32(uint32 word, int n)
 	return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
 }
 
+/*
+ * pg_rotate_left32
+ *		Rotate the bits of 'word' to the left by 'n' bits.
+ */
+static inline uint32
+pg_rotate_left32(uint32 word, int n)
+{
+	return (word << n) | (word >> (sizeof(word) * BITS_PER_BYTE - n));
+}
+
 #endif							/* PG_BITUTILS_H */
-- 
2.21.0.windows.1

use_hash_combine.patch.txttext/plain; charset=US-ASCII; name=use_hash_combine.patch.txtDownload

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8310f73212..f291ded795 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,18 +147,19 @@ static inline uint32
 relfilenodebackend_hash(RelFileNodeBackend *rnode)
 {
 	uint32		hashkey;
+	uint32		hashkey2;
 
 	hashkey = murmurhash32((uint32) rnode->node.spcNode);
 
 	/* rotate hashkey left 1 bit at each step */
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+	hashkey2 = murmurhash32((uint32) rnode->node.dbNode);
+	hashkey = hash_combine(hashkey, hashkey2);
 
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+	hashkey2 = murmurhash32((uint32) rnode->node.relNode);
+	hashkey = hash_combine(hashkey, hashkey2);
 
-	hashkey = pg_rotate_left32(hashkey, 1);
-	hashkey ^= murmurhash32((uint32) rnode->backend);
+	hashkey2 = murmurhash32((uint32) rnode->backend);
+	hashkey = hash_combine(hashkey, hashkey2);
 
 	return hashkey;
 }

recovery_panic.patch.txttext/plain; charset=US-ASCII; name=recovery_panic.patch.txtDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adfc6f67e2..5d34e06eab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7615,7 +7615,8 @@ StartupXLOG(void)
 				ereport(LOG,
 						(errmsg("last completed transaction was at log time %s",
 								timestamptz_to_str(xtime))));
-
+			smgrstats();
+			elog(PANIC, "recovery PANIC");
 			InRedo = false;
 		}
 		else
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 64a26e06c6..850b51e316 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -183,6 +183,12 @@ smgr_entry_cleanup(SMgrRelation reln)
 	pfree(reln);
 }
 
+void
+smgrstats(void)
+{
+	if (SMgrRelationHash != NULL)
+		smgrtable_stat(SMgrRelationHash);
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..ac010af74a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -77,6 +77,7 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
+extern void smgrstats(void);
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#6)

Re: Use simplehash.h instead of dynahash in SMgr

On Mon, 26 Apr 2021 at 05:03, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

If your test so sensitive to hash function speed, then I'd suggest
to try something even simpler:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
uint32 h = 0;
#define step(x) h ^= (uint32)(x) * 0x85ebca6b; h = pg_rotate_right(h,
11); h *= 9;
step(rnode->node.relNode);
step(rnode->node.spcNode); // spcNode could be different for same
relNode only
// during table movement. Does it pay
to hash it?
step(rnode->node.dbNode);
step(rnode->backend); // does it matter to hash backend?
// It equals to InvalidBackendId for
non-temporary relations
// and temporary relations in same
database never have same
// relNode (have they?).
return murmurhash32(hashkey);
}

I tried that and it got a median result of 113.795 seconds over 14
runs with this recovery benchmark test.

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 1014,
max chain: 6, avg chain: 0.499016, total_collisions: 428,
max_collisions: 3, avg_collisions: 0.210630

I also tried the following hash function just to see how much
performance might be left from speeding it up:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
uint32 h;

h = pg_rotate_right32((uint32) rnode->node.relNode, 16) ^ ((uint32)
rnode->node.dbNode);
return murmurhash32(h);
}

I got a median of 112.685 seconds over 14 runs with:

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 1044,
max chain: 7, avg chain: 0.513780, total_collisions: 438,
max_collisions: 3, avg_collisions: 0.215551

So it looks like there might not be too much left given that v2 was
113.375 seconds (median over 10 runs)

I'd like to see benchmark code. It quite interesting this place became
measurable at all.

Sure.

$ cat recoverybench_insert_hash.sh
#!/bin/bash

pg_ctl stop -D pgdata -m smart
pg_ctl start -D pgdata -l pg.log -w
psql -f setup1.sql postgres > /dev/null
psql -c "create table log_wal (lsn pg_lsn not null);" postgres > /dev/null
psql -c "insert into log_wal values(pg_current_wal_lsn());" postgres > /dev/null
psql -c "insert into hp select x,0 from generate_series(1,100000000)
x;" postgres > /dev/null
psql -c "insert into log_wal values(pg_current_wal_lsn());" postgres > /dev/null
psql -c "select 'Used ' ||
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), lsn)) || ' of
WAL' from log_wal limit 1;" postgres
pg_ctl stop -D pgdata -m immediate -w
echo Starting Postgres...
pg_ctl start -D pgdata -l pg.log

$ cat setup1.sql
drop table if exists hp;
create table hp (a int primary key, b int not null) partition by hash(a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 1000, remainder '||x||');' from generate_Series(0,999) x;
\gexec

config:
shared_buffers = 10GB
checkpoint_timeout = 60min
max_wal_size = 20GB
min_wal_size = 20GB

For subsequent runs, if you apply the patch that does the PANIC at the
end of recovery, you'll just need to start the database up again to
perform recovery again. You can then just tail -f on your postgres
logs to watch for the "redo done" message which will show you the time
spent doing recovery.

David.

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Yura Sokolov (#2)

Re: Use simplehash.h instead of dynahash in SMgr

Hi,

On 2021-04-25 01:27:24 +0300, Yura Sokolov wrote:

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.

It's not a "plain" linear probing hash table (although it is on the lookup
side). During insertions collisions are reordered so that the average distance
from the "optimal" position is ~even ("robin hood hashing"). That allows a
higher load factor than a plain linear probed hash table (for which IIRC
there's data to show 0.75 to be a good default load factor).

There of course may still be a benefit in lowering the load factor, but I'd
not start there.

David's test aren't really suited to benchmarking the load factor, but to me
the stats he showed didn't highlight a need to lower the load factor. Lowering
the fill factor does influence the cache hit ratio...

Greetings,

Andres Freund

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Andres Freund (#8)

Re: Use simplehash.h instead of dynahash in SMgr

Andres Freund писал 2021-04-26 21:46:

Hi,

On 2021-04-25 01:27:24 +0300, Yura Sokolov wrote:

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.

It's not a "plain" linear probing hash table (although it is on the
lookup
side). During insertions collisions are reordered so that the average
distance
from the "optimal" position is ~even ("robin hood hashing"). That
allows a
higher load factor than a plain linear probed hash table (for which
IIRC
there's data to show 0.75 to be a good default load factor).

Even for Robin Hood hashing 0.9 fill factor is too high. It leads to too
much movements on insertion/deletion and longer average collision chain.

Note that Robin Hood doesn't optimize average case. Indeed, usually
Robin Hood
has worse (longer) average collision chain than simple linear probing.
Robin Hood hashing optimizes worst case, ie it guarantees there is no
unnecessary
long collision chain.

See discussion on Rust hash table fill factor when it were Robin Hood:
https://github.com/rust-lang/rust/issues/38003

There of course may still be a benefit in lowering the load factor, but
I'd
not start there.

David's test aren't really suited to benchmarking the load factor, but
to me
the stats he showed didn't highlight a need to lower the load factor.
Lowering
the fill factor does influence the cache hit ratio...

Greetings,

Andres Freund

regards,
Yura.

#10

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Yura Sokolov (#9)

Re: Use simplehash.h instead of dynahash in SMgr

Hi,

On 2021-04-26 22:44:13 +0300, Yura Sokolov wrote:

Even for Robin Hood hashing 0.9 fill factor is too high. It leads to too
much movements on insertion/deletion and longer average collision chain.

That's true for modification heavy cases - but most hash tables in PG,
including the smgr one, are quite read heavy. For workloads where
there's a lot of smgr activity, the other overheads in relation
creation/drop handling are magnitudes more expensive than the collision
handling.

Note that simplehash.h also grows when the distance gets too big and
when there are too many elements to move, not just based on the fill
factor.

I kinda wish we had a chained hashtable implementation with the same
interface as simplehash. It's very use-case dependent which approach is
better, and right now we might be forcing some users to choose linear
probing because simplehash.h is still faster than dynahash, even though
chaining would actually be more appropriate.

Note that Robin Hood doesn't optimize average case.

Right.

See discussion on Rust hash table fill factor when it were Robin Hood:
https://github.com/rust-lang/rust/issues/38003

The first sentence actually confirms my point above, about it being a
question of read vs write heavy.

Greetings,

Andres Freund

#11

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: David Rowley (#7)

Re: Use simplehash.h instead of dynahash in SMgr

David Rowley писал 2021-04-26 09:43:

I tried that and it got a median result of 113.795 seconds over 14
runs with this recovery benchmark test.

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 1014,
max chain: 6, avg chain: 0.499016, total_collisions: 428,
max_collisions: 3, avg_collisions: 0.210630

I also tried the following hash function just to see how much
performance might be left from speeding it up:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
uint32 h;

h = pg_rotate_right32((uint32) rnode->node.relNode, 16) ^ ((uint32)
rnode->node.dbNode);
return murmurhash32(h);
}

I got a median of 112.685 seconds over 14 runs with:

LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 1044,
max chain: 7, avg chain: 0.513780, total_collisions: 438,
max_collisions: 3, avg_collisions: 0.215551

The best result is with just:

return (uint32)rnode->node.relNode;

ie, relNode could be taken without mixing at all.
relNode is unique inside single database, and almost unique among whole
cluster
since it is Oid.

I'd like to see benchmark code. It quite interesting this place became
measurable at all.

Sure.

$ cat recoverybench_insert_hash.sh
....

David.

So, I've repeated benchmark with different number of partitons (I tried
to catch higher fillfactor) and less amount of inserted data (since I
don't
want to stress my SSD).

partitions/ | dynahash | dynahash + | simplehash | simplehash + |
fillfactor | | simple func | | simple func |
------------+----------+-------------+--------------+
3500/0.43 | 3.73s | 3.54s | 3.58s | 3.34s |
3200/0.78 | 3.64s | 3.46s | 3.47s | 3.25s |
1500/0.74 | 3.18s | 2.97s | 3.03s | 2.79s |

Fillfactor is effective fillfactor in simplehash with than number of
partitions.
I wasn't able to measure with fillfactor close to 0.9 since looks like
simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.

Simple function is hash function that returns only rnode->node.relNode.
I've test it both with simplehash and dynahash.
For dynahash also custom match function were made.

Conclusion:
- trivial hash function gives better results for both simplehash and
dynahash,
- simplehash improves performance for both complex and trivial hash
function,
- simplehash + trivial function perform best.

I'd like to hear other's people comments on trivial hash function. But
since
generation of relation's Oid are not subject of human interventions, I'd
recommend
to stick with trivial.

Since patch is simple, harmless and gives measurable improvement,
I think it is ready for commit fest.

regards,
Yura Sokolov.
Postgres Proffesional https://www.postgrespro.com

PS. David, please send patch once again since my mail client reattached
files in
previous messages, and commit fest robot could think I'm author.

#12

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#11)

Re: Use simplehash.h instead of dynahash in SMgr

On Thu, 29 Apr 2021 at 00:28, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

The best result is with just:

return (uint32)rnode->node.relNode;

ie, relNode could be taken without mixing at all.
relNode is unique inside single database, and almost unique among whole
cluster
since it is Oid.

I admit to having tried that too just to almost eliminate the cost of
hashing. I just didn't consider it something we'd actually do.

The system catalogues are quite likely to all have the same
relfilenode in all databases, so for workloads that have a large
number of databases, looking up WAL records that touch the catalogues
is going to be pretty terrible.

The simplified hash function I wrote with just the relNode and dbNode
would be the least I'd be willing to entertain. However, I just
wouldn't be surprised if there was a good reason for that being bad
too.

So, I've repeated benchmark with different number of partitons (I tried
to catch higher fillfactor) and less amount of inserted data (since I
don't
want to stress my SSD).

partitions/ | dynahash | dynahash + | simplehash | simplehash + |
fillfactor | | simple func | | simple func |
------------+----------+-------------+--------------+
3500/0.43 | 3.73s | 3.54s | 3.58s | 3.34s |
3200/0.78 | 3.64s | 3.46s | 3.47s | 3.25s |
1500/0.74 | 3.18s | 2.97s | 3.03s | 2.79s |

Fillfactor is effective fillfactor in simplehash with than number of
partitions.
I wasn't able to measure with fillfactor close to 0.9 since looks like
simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.

Thanks for testing that.

I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to
see if it made a difference. The test was similar to last time, but I
trimmed the number of rows inserted from 100 million down to 25
million. Last time I tested with 1000 partitions, this time with each
of: 100 200 300 400 500 600 700 800 900 1000 partitions. There didn't
seem to be any point of testing lower than that as the minimum hash
table size is 512.

The averages over 10 runs were:

nparts ff75 ff90
100 21.898 22.226
200 23.105 25.493
300 25.274 24.251
400 25.139 25.611
500 25.738 25.454
600 26.656 26.82
700 27.577 27.102
800 27.608 27.546
900 27.284 28.186
1000 29 28.153

Or to summarise a bit, we could just look at the sum of all the
results per fillfactor:

sum ff75 2592.79
sum ff90 2608.42 100.6%

fillfactor 75 did come out slightly faster, but only just. It seems
close enough that it might be better just to keep the 90 to save a
little memory and improve caching elsewhere. Also, from below, you
can see that for 300, 500, 600, 700, 1000 tables tests, the hash
tables ended up the same size, yet there's a bit of variability in the
timing result. So the 0.6% gain might just be noise.

I don't think it's worth making the fillfactor 75.

drowley@amd3990x:~/recoverylogs$ grep -rH -m 1 "collisions"
ff75_tb100.log:LOG: size: 1024, members: 231, filled: 0.225586, total
chain: 33, max chain: 2, avg chain: 0.142857, total_collisions: 20,
max_collisions: 2, avg_collisions: 0.086580
ff90_tb100.log:LOG: size: 512, members: 231, filled: 0.451172, total
chain: 66, max chain: 2, avg chain: 0.285714, total_collisions: 36,
max_collisions: 2, avg_collisions: 0.155844

ff75_tb200.log:LOG: size: 1024, members: 431, filled: 0.420898, total
chain: 160, max chain: 4, avg chain: 0.371230, total_collisions: 81,
max_collisions: 3, avg_collisions: 0.187935
ff90_tb200.log:LOG: size: 512, members: 431, filled: 0.841797, total
chain: 942, max chain: 9, avg chain: 2.185615, total_collisions: 134,
max_collisions: 3, avg_collisions: 0.310905

ff90_tb300.log:LOG: size: 1024, members: 631, filled: 0.616211, total
chain: 568, max chain: 9, avg chain: 0.900158, total_collisions: 158,
max_collisions: 4, avg_collisions: 0.250396
ff75_tb300.log:LOG: size: 1024, members: 631, filled: 0.616211, total
chain: 568, max chain: 9, avg chain: 0.900158, total_collisions: 158,
max_collisions: 4, avg_collisions: 0.250396

ff75_tb400.log:LOG: size: 2048, members: 831, filled: 0.405762, total
chain: 341, max chain: 4, avg chain: 0.410349, total_collisions: 162,
max_collisions: 3, avg_collisions: 0.194946
ff90_tb400.log:LOG: size: 1024, members: 831, filled: 0.811523, total
chain: 1747, max chain: 15, avg chain: 2.102286, total_collisions:
269, max_collisions: 3, avg_collisions: 0.323706

ff75_tb500.log:LOG: size: 2048, members: 1031, filled: 0.503418,
total chain: 568, max chain: 5, avg chain: 0.550921, total_collisions:
219, max_collisions: 4, avg_collisions: 0.212415
ff90_tb500.log:LOG: size: 2048, members: 1031, filled: 0.503418,
total chain: 568, max chain: 5, avg chain: 0.550921, total_collisions:
219, max_collisions: 4, avg_collisions: 0.212415

ff75_tb600.log:LOG: size: 2048, members: 1231, filled: 0.601074,
total chain: 928, max chain: 7, avg chain: 0.753859, total_collisions:
298, max_collisions: 4, avg_collisions: 0.242080
ff90_tb600.log:LOG: size: 2048, members: 1231, filled: 0.601074,
total chain: 928, max chain: 7, avg chain: 0.753859, total_collisions:
298, max_collisions: 4, avg_collisions: 0.242080

ff75_tb700.log:LOG: size: 2048, members: 1431, filled: 0.698730,
total chain: 1589, max chain: 9, avg chain: 1.110412,
total_collisions: 391, max_collisions: 4, avg_collisions: 0.273235
ff90_tb700.log:LOG: size: 2048, members: 1431, filled: 0.698730,
total chain: 1589, max chain: 9, avg chain: 1.110412,
total_collisions: 391, max_collisions: 4, avg_collisions: 0.273235

ff75_tb800.log:LOG: size: 4096, members: 1631, filled: 0.398193,
total chain: 628, max chain: 6, avg chain: 0.385040, total_collisions:
296, max_collisions: 3, avg_collisions: 0.181484
ff90_tb800.log:LOG: size: 2048, members: 1631, filled: 0.796387,
total chain: 2903, max chain: 12, avg chain: 1.779890,
total_collisions: 515, max_collisions: 3, avg_collisions: 0.315757

ff75_tb900.log:LOG: size: 4096, members: 1831, filled: 0.447021,
total chain: 731, max chain: 5, avg chain: 0.399235, total_collisions:
344, max_collisions: 3, avg_collisions: 0.187875
ff90_tb900.log:LOG: size: 2048, members: 1831, filled: 0.894043,
total chain: 6364, max chain: 14, avg chain: 3.475696,
total_collisions: 618, max_collisions: 4, avg_collisions: 0.337520

ff75_tb1000.log:LOG: size: 4096, members: 2031, filled: 0.495850,
total chain: 1024, max chain: 6, avg chain: 0.504185,
total_collisions: 416, max_collisions: 3, avg_collisions: 0.204825
ff90_tb1000.log:LOG: size: 4096, members: 2031, filled: 0.495850,
total chain: 1024, max chain: 6, avg chain: 0.504185,
total_collisions: 416, max_collisions: 3, avg_collisions: 0.204825

Another line of thought for making it go faster would be to do
something like get rid of the hash status field from SMgrEntry. That
could be either coded into a single bit we'd borrow from the hash
value, or it could be coded into the least significant bit of the data
field. A pointer to palloc'd memory should always be MAXALIGNed,
which means at least the lower two bits are always zero. We'd just
need to make sure and do something like "data & ~((uintptr_t) 3)" to
trim off the hash status bits before dereferencing the pointer. That
would make the SMgrEntry 12 bytes on a 64-bit machine. However, it
would also mean that some entries would span 2 cache lines, which
might affect performance a bit.

PS. David, please send patch once again since my mail client reattached
files in
previous messages, and commit fest robot could think I'm author.

Authors are listed manually in the CF app. The app will pickup .patch
files from the latest email in the thread and the CF bot will test
those. So it does pay to be pretty careful when attaching patches to
threads that are in the CF app. That's the reason I added the .txt
extension to the recovery panic patch. The CF bot machines would have
complained about that.

#13

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: David Rowley (#12)

Re: Use simplehash.h instead of dynahash in SMgr

David Rowley писал 2021-04-29 02:51:

On Thu, 29 Apr 2021 at 00:28, Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

The best result is with just:

return (uint32)rnode->node.relNode;

ie, relNode could be taken without mixing at all.
relNode is unique inside single database, and almost unique among
whole
cluster
since it is Oid.

I admit to having tried that too just to almost eliminate the cost of
hashing. I just didn't consider it something we'd actually do.

The system catalogues are quite likely to all have the same
relfilenode in all databases, so for workloads that have a large
number of databases, looking up WAL records that touch the catalogues
is going to be pretty terrible.

Single workload that could touch system catalogues in different
databases is recovery (and autovacuum?). Client backends couldn't
be connected to more than one database.

But netherless, you're right. I oversimplified it intentionally.
I wrote originally:

hashcode = (uint32)rnode->node.dbNode * 0xc2b2ae35;
hashcode ^= (uint32)rnode->node.relNode;
return hashcode;

But before sending mail I'd cut dbNode part.
Still, main point: relNode could be put unmixed into final value.
This way less collisions happen.

The simplified hash function I wrote with just the relNode and dbNode
would be the least I'd be willing to entertain. However, I just
wouldn't be surprised if there was a good reason for that being bad
too.

So, I've repeated benchmark with different number of partitons (I
tried
to catch higher fillfactor) and less amount of inserted data (since I
don't
want to stress my SSD).

partitions/ | dynahash | dynahash + | simplehash | simplehash + |
fillfactor | | simple func | | simple func |
------------+----------+-------------+--------------+
3500/0.43 | 3.73s | 3.54s | 3.58s | 3.34s |
3200/0.78 | 3.64s | 3.46s | 3.47s | 3.25s |
1500/0.74 | 3.18s | 2.97s | 3.03s | 2.79s |

Fillfactor is effective fillfactor in simplehash with than number of
partitions.
I wasn't able to measure with fillfactor close to 0.9 since looks like
simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.

Thanks for testing that.

I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to
see if it made a difference. The test was similar to last time, but I
trimmed the number of rows inserted from 100 million down to 25
million. Last time I tested with 1000 partitions, this time with each
of: 100 200 300 400 500 600 700 800 900 1000 partitions. There didn't
seem to be any point of testing lower than that as the minimum hash
table size is 512.

The averages over 10 runs were:

nparts ff75 ff90
100 21.898 22.226
200 23.105 25.493
300 25.274 24.251
400 25.139 25.611
500 25.738 25.454
600 26.656 26.82
700 27.577 27.102
800 27.608 27.546
900 27.284 28.186
1000 29 28.153

Or to summarise a bit, we could just look at the sum of all the
results per fillfactor:

sum ff75 2592.79
sum ff90 2608.42 100.6%

fillfactor 75 did come out slightly faster, but only just. It seems
close enough that it might be better just to keep the 90 to save a
little memory and improve caching elsewhere. Also, from below, you
can see that for 300, 500, 600, 700, 1000 tables tests, the hash
tables ended up the same size, yet there's a bit of variability in the
timing result. So the 0.6% gain might just be noise.

I don't think it's worth making the fillfactor 75.

To be clear: I didn't change SH_FILLFACTOR. It were equal to 0.9 .
I just were not able to catch table with fill factor more than 0.78.
Looks like you've got it with 900 partitions :-)

Another line of thought for making it go faster would be to do
something like get rid of the hash status field from SMgrEntry. That
could be either coded into a single bit we'd borrow from the hash
value, or it could be coded into the least significant bit of the data
field. A pointer to palloc'd memory should always be MAXALIGNed,
which means at least the lower two bits are always zero. We'd just
need to make sure and do something like "data & ~((uintptr_t) 3)" to
trim off the hash status bits before dereferencing the pointer. That
would make the SMgrEntry 12 bytes on a 64-bit machine. However, it
would also mean that some entries would span 2 cache lines, which
might affect performance a bit.

Then data pointer will be itself unaligned to 8 bytes. While x86 is
mostly indifferent to this, doubtfully this memory economy will pay
off.

regards,
Yura Sokolov.

#14

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#13)

Re: Use simplehash.h instead of dynahash in SMgr

On Thu, 29 Apr 2021 at 12:30, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

David Rowley писал 2021-04-29 02:51:

Another line of thought for making it go faster would be to do
something like get rid of the hash status field from SMgrEntry. That
could be either coded into a single bit we'd borrow from the hash
value, or it could be coded into the least significant bit of the data
field. A pointer to palloc'd memory should always be MAXALIGNed,
which means at least the lower two bits are always zero. We'd just
need to make sure and do something like "data & ~((uintptr_t) 3)" to
trim off the hash status bits before dereferencing the pointer. That
would make the SMgrEntry 12 bytes on a 64-bit machine. However, it
would also mean that some entries would span 2 cache lines, which
might affect performance a bit.

Then data pointer will be itself unaligned to 8 bytes. While x86 is
mostly indifferent to this, doubtfully this memory economy will pay
off.

Actually, I didn't think very hard about that. The struct would still
be 16 bytes and just have padding so the data pointer was aligned to 8
bytes (assuming a 64-bit machine).

David

#15

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: David Rowley (#14)

1 attachment(s)

Re: Use simplehash.h instead of dynahash in SMgr

I've attached an updated patch. I forgot to call SH_ENTRY_CLEANUP,
when it's defined during SH_RESET.

I also tided up a couple of comments and change the code to use
pg_rotate_right32(.., 31) instead of adding a new function for
pg_rotate_left32 and calling that to shift left 1 bit.

David

Attachments:

v3-0001-Use-simplehash.h-hashtables-in-SMgr.patchapplication/octet-stream; name=v3-0001-Use-simplehash.h-hashtables-in-SMgr.patchDownload

From ce604dfa855e765383656baf37bbc5ce51bb6034 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 22 Apr 2021 17:03:46 +1200
Subject: [PATCH v3] Use simplehash.h hashtables in SMgr

The hash table lookups done in SMgr can quite often be a bottleneck during
crash recovery.  Traditionally these use dynahash. Here we swap dynahash
out and use simplehash instead.  This improves lookup performance.

Some changes are required from simplehash.h here to make this work.  The
reason for this is that code external to smgr.c does point to the hashed
SMgrRelation. Since simplehash does reallocate the bucket array when
increasing the size of the table and also shuffle entries around during
deletes, code pointing directly into hash entries would be a bad idea. To
overcome this issue we only store a pointer to the SMgrRelationData in the
hash table entry and maintain a separate allocation for that data. This
does mean an additional pointer dereference during lookups, but only when
the hash value matches, so the significant majority of the time that will
only be done for the item we are actually looking for.

Since the hash table key is stored in the referenced SMgrRelation, we need
to add two new macros to allow simplehash to allocate the memory for the
SMgrEntry during inserts before it tries to set the key.  A new macro has
also been added to allow simplehash implementations to perform cleanup
when items are removed from the table.
---
 src/backend/storage/smgr/smgr.c | 111 ++++++++++++++++++++++++--------
 src/include/lib/simplehash.h    |  32 +++++++++
 2 files changed, 115 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..209218f781 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,14 +18,49 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
-#include "utils/hsearch.h"
 #include "utils/inval.h"
 
+/* Hash table entry type for SMgrRelationHash */
+typedef struct SMgrEntry
+{
+	int			status;			/* Hash table status */
+	uint32		hash;			/* Hash value (cached) */
+	SMgrRelation data;			/* Pointer to the SMgrRelationData */
+} SMgrEntry;
+
+static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
+
+/*
+ * Because simplehash.h does not provide a stable pointer to hash table
+ * entries, we don't make the element type a SMgrRelation directly, instead we
+ * use an SMgrEntry type which has a pointer to the data field.  simplehash can
+ * move entries around when adding or removing items from the hash table so
+ * having the SMgrRelation as a pointer inside the SMgrEntry allows external
+ * code to keep their own pointers to the SMgrRelation.  Relcache does this.
+ * We use the SH_ENTRY_INITIALIZER to allocate memory for the SMgrRelationData
+ * when a new entry is created.  We also define SH_ENTRY_CLEANUP to execute
+ * some cleanup when removing an item from the table.
+ */
+#define SH_PREFIX		smgrtable
+#define SH_ELEMENT_TYPE	SMgrEntry
+#define SH_KEY_TYPE		RelFileNodeBackend
+#define SH_KEY			data->smgr_rnode
+#define SH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define SH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
+#define SH_SCOPE		static inline
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_ENTRY_INITIALIZER(a) a->data = MemoryContextAlloc(TopMemoryContext, sizeof(SMgrRelationData))
+#define SH_ENTRY_CLEANUP(a) pfree(a->data)
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -91,13 +126,43 @@ static const int NSmgr = lengthof(smgrsw);
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
  */
-static HTAB *SMgrRelationHash = NULL;
+static smgrtable_hash *SMgrRelationHash = NULL;
 
 static dlist_head unowned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/*
+ * relfilenodebackend_hash
+ *		Custom rolled hash function for simplehash table.
+ *
+ * smgropen() is often a bottleneck in CPU bound workloads during crash
+ * recovery.  We make use of this custom hash function rather than using
+ * hash_bytes as it gives us a little bit more performance.
+ *
+ * XXX What if sizeof(Oid) is not 4?
+ */
+static inline uint32
+relfilenodebackend_hash(RelFileNodeBackend *rnode)
+{
+	uint32		hashkey;
+
+	hashkey = murmurhash32((uint32) rnode->node.spcNode);
+
+	/* rotate hashkey left 1 bit at each step */
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->backend);
+
+	return hashkey;
+}
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -147,31 +212,26 @@ smgropen(RelFileNode rnode, BackendId backend)
 {
 	RelFileNodeBackend brnode;
 	SMgrRelation reln;
+	SMgrEntry  *entry;
 	bool		found;
 
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 	{
 		/* First time through: initialize the hash table */
-		HASHCTL		ctl;
-
-		ctl.keysize = sizeof(RelFileNodeBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
-		SMgrRelationHash = hash_create("smgr relation table", 400,
-									   &ctl, HASH_ELEM | HASH_BLOBS);
+		SMgrRelationHash = smgrtable_create(TopMemoryContext, 400, NULL);
 		dlist_init(&unowned_relns);
 	}
 
 	/* Look up or create an entry */
 	brnode.node = rnode;
 	brnode.backend = backend;
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &brnode,
-									  HASH_ENTER, &found);
+	entry = smgrtable_insert(SMgrRelationHash, brnode, &found);
+	reln = entry->data;
 
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		/* hash_search already filled in the lookup key */
+		/* smgrtable_insert already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
@@ -266,9 +326,7 @@ smgrclose(SMgrRelation reln)
 	if (!owner)
 		dlist_delete(&reln->node);
 
-	if (hash_search(SMgrRelationHash,
-					(void *) &(reln->smgr_rnode),
-					HASH_REMOVE, NULL) == NULL)
+	if (!smgrtable_delete(SMgrRelationHash, reln->smgr_rnode))
 		elog(ERROR, "SMgrRelation hashtable corrupted");
 
 	/*
@@ -285,17 +343,17 @@ smgrclose(SMgrRelation reln)
 void
 smgrcloseall(void)
 {
-	HASH_SEQ_STATUS status;
-	SMgrRelation reln;
+	smgrtable_iterator iterator;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	smgrtable_start_iterate(SMgrRelationHash, &iterator);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+	while ((entry = smgrtable_iterate(SMgrRelationHash, &iterator)) != NULL)
+		smgrclose(entry->data);
 }
 
 /*
@@ -309,17 +367,14 @@ smgrcloseall(void)
 void
 smgrclosenode(RelFileNodeBackend rnode)
 {
-	SMgrRelation reln;
+	SMgrEntry  *entry;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
-
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &rnode,
-									  HASH_FIND, NULL);
-	if (reln != NULL)
-		smgrclose(reln);
+	entry = smgrtable_lookup(SMgrRelationHash, rnode);
+	if (entry != NULL)
+		smgrclose(entry->data);
 }
 
 /*
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98..2c4fc7e8c4 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -50,6 +50,13 @@
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
+ *	  - SH_ENTRY_INITIALIZER(a) - if defined, the code in this macro is called
+ *		for new entries directly before any other internal code makes any
+ *		changes to setup the new entry.  This could be used to do things like
+ *		initialize memory for the bucket.
+ *	  - SH_ENTRY_CLEANUP(a) - if defined, the code in this macro is called
+ *		when an entry is removed from the hash table.  This could be used to
+ *		free memory allocated in the bucket by SH_ENTRY_INITIALIZER
  *
  *	  The element type is required to contain a "status" member that can store
  *	  the range of values defined in the SH_STATUS enum.
@@ -464,6 +471,17 @@ SH_DESTROY(SH_TYPE * tb)
 SH_SCOPE void
 SH_RESET(SH_TYPE * tb)
 {
+#ifdef SH_ENTRY_CLEANUP
+	/* Execute the cleanup code when SH_ENTRY_CLEANUP has been defined */
+	for (int i = 0; i < tb->size; i++)
+	{
+		SH_ELEMENT_TYPE *entry = &tb->data[i];
+
+		if (entry->status == SH_STATUS_IN_USE)
+			SH_ENTRY_CLEANUP(entry);
+	}
+#endif
+
 	memset(tb->data, 0, sizeof(SH_ELEMENT_TYPE) * tb->size);
 	tb->members = 0;
 }
@@ -634,6 +652,9 @@ restart:
 		if (entry->status == SH_STATUS_EMPTY)
 		{
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
@@ -721,6 +742,9 @@ restart:
 
 			/* and fill the now empty spot */
 			tb->members++;
+#ifdef SH_ENTRY_INITIALIZER
+			SH_ENTRY_INITIALIZER(entry);
+#endif
 
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
@@ -856,6 +880,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 			SH_ELEMENT_TYPE *lastentry = entry;
 
 			tb->members--;
+#ifdef SH_ENTRY_CLEANUP
+			SH_ENTRY_CLEANUP(entry);
+#endif
 
 			/*
 			 * Backward shift following elements till either an empty element
@@ -919,6 +946,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 	curelem = entry - &tb->data[0];
 
 	tb->members--;
+#ifdef SH_ENTRY_CLEANUP
+	SH_ENTRY_CLEANUP(entry);
+#endif
 
 	/*
 	 * Backward shift following elements till either an empty element or an
@@ -1133,6 +1163,8 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_DECLARE
 #undef SH_DEFINE
 #undef SH_GET_HASH
+#undef SH_ENTRY_INITIALIZER
+#undef SH_ENTRY_CLEANUP
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
-- 
2.27.0

#16

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: David Rowley (#15)

Re: Use simplehash.h instead of dynahash in SMgr

Hi David,

You're probably aware of this, but just to make it explicit: Jakub
Wartak was testing performance of recovery, and one of the bottlenecks
he found in some of his cases was dynahash as used by SMgr. It seems
quite possible that this work would benefit some of his test workloads.
He last posted about it here:

/messages/by-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0@VI1PR0701MB6960.eurprd07.prod.outlook.com

though the fraction of dynahash-from-SMgr is not as high there as in
some of other reports I saw.

--
ï¿½lvaro Herrera Valdivia, Chile

#17

Jakub Wartak

Jakub.Wartak@tomtom.com

over 4 years ago

In reply to: Alvaro Herrera (#16)

RE: Use simplehash.h instead of dynahash in SMgr

Hi David, Alvaro, -hackers

Hi David,

You're probably aware of this, but just to make it explicit: Jakub Wartak was
testing performance of recovery, and one of the bottlenecks he found in
some of his cases was dynahash as used by SMgr. It seems quite possible
that this work would benefit some of his test workloads.

I might be a little bit out of the loop, but as Alvaro stated - Thomas did plenty of excellent job related to recovery performance in that thread. In my humble opinion and if I'm not mistaken (I'm speculating here) it might be *not* how Smgr hash works, but how often it is being exercised and that would also explain relatively lower than expected(?) gains here. There are at least two very important emails from him that I'm aware that are touching the topic of reordering/compacting/batching calls to Smgr:
/messages/by-id/CA+hUKG+2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw@mail.gmail.com
/messages/by-id/CA+hUKGK4StQ=eXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H+Ueg@mail.gmail.com

Another potential option that we've discussed is that the redo generation itself is likely a brake of efficient recovery performance today (e.g. INSERT-SELECT on table with indexes, generates interleaved WAL records that touch often limited set of blocks that usually put Smgr into spotlight).

-Jakub Wartak.

#18

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Jakub Wartak (#17)

Re: Use simplehash.h instead of dynahash in SMgr

Hi Jakub,

On Wed, 5 May 2021 at 20:16, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:

I might be a little bit out of the loop, but as Alvaro stated - Thomas did plenty of excellent job related to recovery performance in that thread. In my humble opinion and if I'm not mistaken (I'm speculating here) it might be *not* how Smgr hash works, but how often it is being exercised and that would also explain relatively lower than expected(?) gains here. There are at least two very important emails from him that I'm aware that are touching the topic of reordering/compacting/batching calls to Smgr:
/messages/by-id/CA+hUKG+2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw@mail.gmail.com
/messages/by-id/CA+hUKGK4StQ=eXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H+Ueg@mail.gmail.com

I'm not much of an expert here and I didn't follow the recovery
prefetching stuff closely. So, with that in mind, I think there are
lots that could be done along the lines of what Thomas is mentioning.
Batching WAL records up by filenode then replaying each filenode one
by one when our batching buffer is full. There could be some sort of
parallel options there too, where workers replay a filenode each.
However, that wouldn't really work for recovery on a hot-standby
though. We'd need to ensure we replay the commit record for each
transaction last. I think you'd have to batch by filenode and
transaction in that case. Each batch might be pretty small on a
typical OLTP workload, so it might not help much there, or it might
hinder.

But having said that, I don't think any of those possibilities should
stop us speeding up smgropen().

Another potential option that we've discussed is that the redo generation itself is likely a brake of efficient recovery performance today (e.g. INSERT-SELECT on table with indexes, generates interleaved WAL records that touch often limited set of blocks that usually put Smgr into spotlight).

I'm not quite sure if I understand what you mean here. Is this
queuing up WAL records up during transactions and flush them out to
WAL every so often after rearranging them into an order that's more
optimal for replay?

David

#19

Jakub Wartak

Jakub.Wartak@tomtom.com

over 4 years ago

In reply to: David Rowley (#18)

RE: Use simplehash.h instead of dynahash in SMgr

Hey David,

I think you'd have to batch by filenode and transaction in that case. Each batch might be pretty small on a typical OLTP workload, so it might not help much there, or it might hinder.

True, it is very workload dependent (I was chasing mainly INSERTs multiValues, INSERT-SELECT) that often hit the same $block, certainly not OLTP. I would even say that INSERT-as-SELECT would be more suited for DWH-like processing.

But having said that, I don't think any of those possibilities should stop us speeding up smgropen().

Of course! I've tried a couple of much more smaller ideas, but without any big gains. I was able to squeeze like 300-400k function calls per second (WAL records/s), that was the point I think where I think smgropen() got abused.

Another potential option that we've discussed is that the redo generation

itself is likely a brake of efficient recovery performance today (e.g. INSERT-
SELECT on table with indexes, generates interleaved WAL records that touch
often limited set of blocks that usually put Smgr into spotlight).

I'm not quite sure if I understand what you mean here. Is this queuing up
WAL records up during transactions and flush them out to WAL every so
often after rearranging them into an order that's more optimal for replay?

Why not both? 😉 We were very concentrated on standby side, but on primary side one could also change how WAL records are generated:

1) Minimalization of records towards same repeated $block eg. Heap2 table_multi_insert() API already does this and it matters to generate more optimal stream for replay:

postgres@test=# create table t (id bigint primary key);
postgres@test=# insert into t select generate_series(1, 10);

results in many calls due to interleave heap with btree records for the same block from Smgr perspective (this is especially visible on highly indexed tables) =>
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243284, lsn: 4/E7000108, prev 4/E70000A0, desc: INSERT_LEAF off 1, blkref #0: rel 1663/16384/32794 blk 1
rmgr: Heap len (rec/tot): 63/ 63, tx: 17243284, lsn: 4/E7000148, prev 4/E7000108, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243284, lsn: 4/E7000188, prev 4/E7000148, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/32794 blk 1
rmgr: Heap len (rec/tot): 63/ 63, tx: 17243284, lsn: 4/E70001C8, prev 4/E7000188, desc: INSERT off 3 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243284, lsn: 4/E7000208, prev 4/E70001C8, desc: INSERT_LEAF off 3, blkref #0: rel 1663/16384/32794 blk 1
rmgr: Heap len (rec/tot): 63/ 63, tx: 17243284, lsn: 4/E7000248, prev 4/E7000208, desc: INSERT off 4 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243284, lsn: 4/E7000288, prev 4/E7000248, desc: INSERT_LEAF off 4, blkref #0: rel 1663/16384/32794 blk 1
rmgr: Heap len (rec/tot): 63/ 63, tx: 17243284, lsn: 4/E70002C8, prev 4/E7000288, desc: INSERT off 5 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0
[..]
Similar stuff happens for UPDATE. It basically prevents recent-buffer optimization that avoid repeated calls to smgropen().

And here's already existing table_multi_inserts v2 API (Heap2) sample with obvious elimination of unnecessary individual calls to smgopen() via one big MULTI_INSERT instead (for CTAS/COPY/REFRESH MV) :
postgres@test=# create table t (id bigint primary key);
postgres@test=# copy (select generate_series (1, 10)) to '/tmp/t';
postgres@test=# copy t from '/tmp/t';
=>
rmgr: Heap2 len (rec/tot): 210/ 210, tx: 17243290, lsn: 4/E9000028, prev 4/E8004410, desc: MULTI_INSERT+INIT 10 tuples flags 0x02, blkref #0: rel 1663/16384/32801 blk 0
rmgr: Btree len (rec/tot): 102/ 102, tx: 17243290, lsn: 4/E9000100, prev 4/E9000028, desc: NEWROOT lev 0, blkref #0: rel 1663/16384/32804 blk 1, blkref #2: rel 1663/16384/32804 blk 0
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 1663/16384/32804 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/32804 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 1663/16384/32804 blk 1
[..]
Here Btree it is very localized (at least when concurrent sessions are not generating WAL) and it enables Thomas's recent-buffer to kick in

DELETE is much more simple (thanks to not chewing out those Btree records) and also thanks to Thomas's recent-buffer should theoretically put much less stress on smgropen() already:
rmgr: Heap len (rec/tot): 54/ 54, tx: 17243296, lsn: 4/ED000028, prev 4/EC002800, desc: DELETE off 1 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap len (rec/tot): 54/ 54, tx: 17243296, lsn: 4/ED000060, prev 4/ED000028, desc: DELETE off 2 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap len (rec/tot): 54/ 54, tx: 17243296, lsn: 4/ED000098, prev 4/ED000060, desc: DELETE off 3 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap len (rec/tot): 54/ 54, tx: 17243296, lsn: 4/ED0000D0, prev 4/ED000098, desc: DELETE off 4 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
[..]

2) So what's missing - I may be wrong on this one - something like "index_multi_inserts" Btree2 API to avoid repeatedly overwhelming smgropen() on recovery side for same index's $buffer. Not sure it is worth the effort, though especially recent-buffer fixes that:
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 1663/16384/32804 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/32804 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 1663/16384/32804 blk 1
right?

3) Concurrent DML sessions mixing WAL records: the buffering on backend's side of things (on private "thread" of WAL - in private memory - that would be simply "copied" into logwriter's main WAL buffer when committing/buffer full) - it would seem like an very interesting idea to limit interleaving concurrent sessions WAL records between each other and exploit the recent-buffer enhancement to avoid repeating the same calls to Smgr, wouldn't it? (I'm just mentioning it as I saw you were benchmarking it here and called out this idea).

I could be wrong though with many of those simplifications, in any case please consult with Thomas as he knows much better and is much more trusted source than me 😉

-J.

#20

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Jakub Wartak (#19)

Re: Use simplehash.h instead of dynahash in SMgr

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

I believe it is ready for committer.

The new status of this patch is: Ready for Committer

#21

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#20)

2 attachment(s)

Re: Use simplehash.h instead of dynahash in SMgr

I'd been thinking of this patch again. When testing with simplehash,
I found that the width of the hash bucket type was fairly critical for
getting good performance from simplehash.h. With simplehash.h I
didn't manage to narrow this any more than 16 bytes. I needed to store
the 32-bit hash value and a pointer to the data. On a 64-bit machine,
with padding, that's 16-bytes. I've been thinking about a way to
narrow this down further to just 8 bytes and also solve the stable
pointer problem at the same time...

I've come up with a new hash table implementation that I've called
generichash. It works similarly to simplehash in regards to the
linear probing, only instead of storing the data in the hash bucket,
we just store a uint32 index that indexes off into an array. To keep
the pointers in that array stable, we cannot resize the array as the
table grows. Instead, I just allocate another array of the same size.
Since these arrays are always sized as powers of 2, it's very fast to
index into them using the uint32 index that's stored in the bucket.
Unused buckets just store the special index of 0xFFFFFFFF.

I've also proposed to use this hash table implementation over in [1]/messages/by-id/CAApHDvoKqWRxw5nnUPZ8+mAJKHPOPxYGoY1gQdh0WeS4+biVhg@mail.gmail.com
to speed up LockReleaseAll(). The 0001 patch here is just the same as
the patch from [1]/messages/by-id/CAApHDvoKqWRxw5nnUPZ8+mAJKHPOPxYGoY1gQdh0WeS4+biVhg@mail.gmail.com.

The 0002 patch includes using a generichash hash table for SMgr.

The performance using generichash.h is about the same as the
simplehash.h version of the patch. Although, the test was not done on
the same version of master.

Master (97b713418)
drowley@amd3990x:~$ tail -f pg.log | grep "redo done"
CPU: user: 124.85 s, system: 6.83 s, elapsed: 131.74 s
CPU: user: 115.01 s, system: 4.76 s, elapsed: 119.83 s
CPU: user: 122.13 s, system: 6.41 s, elapsed: 128.60 s
CPU: user: 113.85 s, system: 6.11 s, elapsed: 120.02 s
CPU: user: 121.40 s, system: 6.28 s, elapsed: 127.74 s
CPU: user: 113.71 s, system: 5.80 s, elapsed: 119.57 s
CPU: user: 113.96 s, system: 5.90 s, elapsed: 119.92 s
CPU: user: 122.74 s, system: 6.21 s, elapsed: 129.01 s
CPU: user: 122.00 s, system: 6.38 s, elapsed: 128.44 s
CPU: user: 113.06 s, system: 6.14 s, elapsed: 119.25 s
CPU: user: 114.42 s, system: 4.35 s, elapsed: 118.82 s

Median: 120.02 s

master + v1 + v2

drowley@amd3990x:~$ tail -n 0 -f pg.log | grep "redo done"
CPU: user: 107.75 s, system: 4.61 s, elapsed: 112.41 s
CPU: user: 108.07 s, system: 4.49 s, elapsed: 112.61 s
CPU: user: 106.89 s, system: 5.55 s, elapsed: 112.49 s
CPU: user: 107.42 s, system: 5.64 s, elapsed: 113.12 s
CPU: user: 106.85 s, system: 4.42 s, elapsed: 111.31 s
CPU: user: 107.36 s, system: 4.76 s, elapsed: 112.16 s
CPU: user: 107.20 s, system: 4.47 s, elapsed: 111.72 s
CPU: user: 106.94 s, system: 5.89 s, elapsed: 112.88 s
CPU: user: 115.32 s, system: 6.12 s, elapsed: 121.49 s
CPU: user: 108.02 s, system: 4.48 s, elapsed: 112.54 s
CPU: user: 106.93 s, system: 4.54 s, elapsed: 111.51 s

Median: 112.49 s

So about a 6.69% speedup

David

[1]: /messages/by-id/CAApHDvoKqWRxw5nnUPZ8+mAJKHPOPxYGoY1gQdh0WeS4+biVhg@mail.gmail.com

Attachments:

v1-0001-Add-a-new-hash-table-type-which-has-stable-pointe.patchapplication/octet-stream; name=v1-0001-Add-a-new-hash-table-type-which-has-stable-pointe.patchDownload

From 3974822be2b094c229bad9f638d3189f0892b81d Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 18 Jun 2021 04:22:11 +1200
Subject: [PATCH v1 1/2] Add a new hash table type which has stable pointers

This is named generichash.  It's similar and takes most of the code from
simplehash.h but provides stable pointers to hashed elements.  simplehash
will move these around which means it's not possible to have anything
point to your hash entry.

generichash.h allocates elements in "segments", by default 256 at a time.
When those are filled another segment is allocated.  When items are
removed from the table new items will try to fill from the lowest segment
with available space.  This should help reduce fragmentation of the data.

Sequential over the table should remain fast.  We use a bitmap to record
which elements of each segment are in use.  This allows us to quickly loop
over only used elements and skip to the next segment.

Make use of this new hash table type to help speed up the locallock table
in lock.c
---
 src/backend/storage/lmgr/lock.c    |  115 ++-
 src/backend/utils/cache/relcache.c |    9 +-
 src/include/lib/generichash.h      | 1409 ++++++++++++++++++++++++++++
 src/include/storage/lock.h         |    2 +-
 4 files changed, 1484 insertions(+), 51 deletions(-)
 create mode 100644 src/include/lib/generichash.h

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d9023..081a06b417 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -37,6 +37,7 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -270,6 +271,19 @@ typedef struct
 
 static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
 
+#define GH_PREFIX				locallocktable
+#define GH_ELEMENT_TYPE			LOCALLOCK
+#define GH_KEY_TYPE				LOCALLOCKTAG
+#define GH_KEY					tag
+#define GH_HASH_KEY(tb, key)	hash_bytes((unsigned char *) &key, sizeof(LOCALLOCKTAG))
+#define GH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(LOCALLOCKTAG)) == 0)
+#define GH_ALLOCATE(b)			MemoryContextAllocExtended(TopMemoryContext, b, MCXT_ALLOC_HUGE)
+#define GH_ALLOCATE_ZERO(b)		MemoryContextAllocExtended(TopMemoryContext, b, MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO)
+#define GH_FREE(p)				pfree(p)
+#define GH_SCOPE				static inline
+#define GH_DECLARE
+#define GH_DEFINE
+#include "lib/generichash.h"
 
 /*
  * Pointers to hash tables containing lock state
@@ -279,7 +293,7 @@ static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
  */
 static HTAB *LockMethodLockHash;
 static HTAB *LockMethodProcLockHash;
-static HTAB *LockMethodLocalHash;
+static locallocktable_hash *LockMethodLocalHash;
 
 
 /* private state for error cleanup */
@@ -467,15 +481,9 @@ InitLocks(void)
 	 * ought to be empty in the postmaster, but for safety let's zap it.)
 	 */
 	if (LockMethodLocalHash)
-		hash_destroy(LockMethodLocalHash);
+		locallocktable_destroy(LockMethodLocalHash);
 
-	info.keysize = sizeof(LOCALLOCKTAG);
-	info.entrysize = sizeof(LOCALLOCK);
-
-	LockMethodLocalHash = hash_create("LOCALLOCK hash",
-									  16,
-									  &info,
-									  HASH_ELEM | HASH_BLOBS);
+	LockMethodLocalHash = locallocktable_create(16);
 }
 
 
@@ -606,22 +614,37 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	return (locallock && locallock->nLocks > 0);
 }
 
 #ifdef USE_ASSERT_CHECKING
 /*
- * GetLockMethodLocalHash -- return the hash of local locks, for modules that
- *		evaluate assertions based on all locks held.
+ * GetLockMethodLocalLocks -- returns an array of all LOCALLOCKs stored in
+ *		LockMethodLocalHash.
+ *
+ * The caller must pfree the return value when done. *size is set to the
+ * number of elements in the returned array.
  */
-HTAB *
-GetLockMethodLocalHash(void)
+LOCALLOCK **
+GetLockMethodLocalLocks(uint32 *size)
 {
-	return LockMethodLocalHash;
+	locallocktable_iterator iterator;
+	LOCALLOCK **locallocks;
+	LOCALLOCK  *locallock;
+	uint32		i = 0;
+
+	locallocks = (LOCALLOCK **) palloc(sizeof(LOCALLOCK *) *
+									   LockMethodLocalHash->members);
+
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
+		locallocks[i++] = locallock;
+
+	*size = i;
+	return locallocks;
 }
 #endif
 
@@ -661,9 +684,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	/*
 	 * let the caller print its own error message, too. Do not ereport(ERROR).
@@ -823,9 +844,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_ENTER, &found);
+	locallock = locallocktable_insert(LockMethodLocalHash, localtag, &found);
 
 	/*
 	 * if it's a new locallock object, initialize it
@@ -1390,9 +1409,7 @@ RemoveLocalLock(LOCALLOCK *locallock)
 		SpinLockRelease(&FastPathStrongRelationLocks->mutex);
 	}
 
-	if (!hash_search(LockMethodLocalHash,
-					 (void *) &(locallock->tag),
-					 HASH_REMOVE, NULL))
+	if (!locallocktable_delete(LockMethodLocalHash, locallock->tag))
 		elog(WARNING, "locallock table corrupted");
 
 	/*
@@ -2002,9 +2019,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	localtag.lock = *locktag;
 	localtag.mode = lockmode;
 
-	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
-										  (void *) &localtag,
-										  HASH_FIND, NULL);
+	locallock = locallocktable_lookup(LockMethodLocalHash, localtag);
 
 	/*
 	 * let the caller print its own error message, too. Do not ereport(ERROR).
@@ -2178,7 +2193,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 void
 LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LockMethod	lockMethodTable;
 	int			i,
 				numLockModes;
@@ -2216,9 +2231,10 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 	 * pointers.  Fast-path locks are cleaned up during the locallock table
 	 * scan, though.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		/*
 		 * If the LOCALLOCK entry is unused, we must've run out of shared
@@ -2452,15 +2468,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
 void
 LockReleaseSession(LOCKMETHODID lockmethodid)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		/* Ignore items that are not of the specified lock method */
 		if (LOCALLOCK_LOCKMETHOD(*locallock) != lockmethodid)
@@ -2484,12 +2501,13 @@ LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 {
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		locallocktable_iterator iterator;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
+		locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+												   &iterator)) != NULL)
 			ReleaseLockIfHeld(locallock, false);
 	}
 	else
@@ -2583,12 +2601,13 @@ LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks)
 
 	if (locallocks == NULL)
 	{
-		HASH_SEQ_STATUS status;
+		locallocktable_iterator iterator;
 		LOCALLOCK  *locallock;
 
-		hash_seq_init(&status, LockMethodLocalHash);
+		locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-		while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+		while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+												   &iterator)) != NULL)
 			LockReassignOwner(locallock, parent);
 	}
 	else
@@ -3220,7 +3239,7 @@ LockRefindAndRelease(LockMethod lockMethodTable, PGPROC *proc,
 void
 AtPrepare_Locks(void)
 {
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 
 	/*
@@ -3229,9 +3248,10 @@ AtPrepare_Locks(void)
 	 * Fast-path locks are an exception, however: we move any such locks to
 	 * the main table before allowing PREPARE TRANSACTION to succeed.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		TwoPhaseLockRecord record;
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
@@ -3331,7 +3351,7 @@ void
 PostPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid, false);
-	HASH_SEQ_STATUS status;
+	locallocktable_iterator iterator;
 	LOCALLOCK  *locallock;
 	LOCK	   *lock;
 	PROCLOCK   *proclock;
@@ -3354,9 +3374,10 @@ PostPrepare_Locks(TransactionId xid)
 	 * pointing to the same proclock, and we daren't end up with any dangling
 	 * pointers.
 	 */
-	hash_seq_init(&status, LockMethodLocalHash);
+	locallocktable_start_iterate(LockMethodLocalHash, &iterator);
 
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	while ((locallock = locallocktable_iterate(LockMethodLocalHash,
+											   &iterator)) != NULL)
 	{
 		LOCALLOCKOWNER *lockOwners = locallock->lockOwners;
 		bool		haveSessionLock;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d55ae016d0..85b1c52870 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3004,12 +3004,13 @@ void
 AssertPendingSyncs_RelationCache(void)
 {
 	HASH_SEQ_STATUS status;
-	LOCALLOCK  *locallock;
+	LOCALLOCK **locallocks;
 	Relation   *rels;
 	int			maxrels;
 	int			nrels;
 	RelIdCacheEnt *idhentry;
 	int			i;
+	uint32		nlocallocks;
 
 	/*
 	 * Open every relation that this transaction has locked.  If, for some
@@ -3022,9 +3023,10 @@ AssertPendingSyncs_RelationCache(void)
 	maxrels = 1;
 	rels = palloc(maxrels * sizeof(*rels));
 	nrels = 0;
-	hash_seq_init(&status, GetLockMethodLocalHash());
-	while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+	locallocks = GetLockMethodLocalLocks(&nlocallocks);
+	for (i = 0; i < nlocallocks; i++)
 	{
+		LOCALLOCK  *locallock = locallocks[i];
 		Oid			relid;
 		Relation	r;
 
@@ -3044,6 +3046,7 @@ AssertPendingSyncs_RelationCache(void)
 		}
 		rels[nrels++] = r;
 	}
+	pfree(locallocks);
 
 	hash_seq_init(&status, RelationIdCache);
 	while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
diff --git a/src/include/lib/generichash.h b/src/include/lib/generichash.h
new file mode 100644
index 0000000000..3e075d1676
--- /dev/null
+++ b/src/include/lib/generichash.h
@@ -0,0 +1,1409 @@
+/*
+ * generichash.h
+ *
+ *	  A hashtable implementation which can be included into .c files to
+ *	  provide a fast hash table implementation specific to the given type.
+ *
+ *	  GH_ELEMENT_TYPE defines the data type that the hashtable stores.  Each
+ *	  instance of GH_ELEMENT_TYPE which is stored in the hash table is done so
+ *	  inside a GH_SEGMENT.  These GH_SEGMENTs are allocated on demand and
+ *	  store GH_ITEMS_PER_SEGMENT each.  After items are removed from the hash
+ *	  table, the next inserted item's data will be stored in the earliest free
+ *	  item in the earliest free segment.  This helps keep the actual data
+ *	  compact even when the bucket array has become large.
+ *
+ *	  The bucket array is an array of GH_BUCKET and is dynamically allocated
+ *	  and may grow as more items are added to the table.  The GH_BUCKET type
+ *	  is very narrow and stores just 2 uint32 values.  One of these is the
+ *	  hash value and the other is the index into the segments which are used
+ *	  to directly look up the stored GH_ELEMENT_TYPE type.
+ *
+ *	  During inserts, hash table collisions are dealt with using linear
+ *	  probing, this means that instead of doing something like chaining with a
+ *	  linked list, we use the first free bucket which comes after the optimal
+ *	  bucket.  This is much more CPU cache efficient than traversing a linked
+ *	  list.  When we're unable to use the most optimal bucket, we may also
+ *	  move the contexts of subsequent buckets around so that we keep items as
+ *	  close to their most optimal position as possible.  This prevents
+ *	  excessively long linear probes during lookups.
+ *
+ *	  During hash table deletes, we must attempt to move the contents of
+ *	  buckets that are not in their optimal position up to either their
+ *	  optimal position, or as close as we can get to it.  During lookups, this
+ *	  means that we can stop searching for a non-existing item as soon as we
+ *	  find an empty bucket.
+ *
+ *	  Empty buckets are denoted by their 'index' field being set to
+ *	  GH_UNUSED_BUCKET_INDEX.  This is done rather than adding a special field
+ *	  so that we can keep the GH_BUCKET type as narrow as possible.
+ *	  Conveniently sizeof(GH_BUCKET) is 8, which allows 8 of these to fit on a
+ *	  single 64-byte cache line. It's important to keep this type as narrow as
+ *	  possible so that we can perform hash lookups by hitting as few
+ *	  cache lines as possible.
+ *
+ *	  The implementation here is similar to simplehash.h but has the following
+ *	  benefits:
+ *
+ *	  - Pointers to elements are stable and are not moved around like they are
+ *		in simplehash.h
+ *	  - Sequential scans of the hash table remain very fast even when the
+ *		table is sparsely populated.
+ *	  - Moving the contents of buckets around during inserts and deletes is
+ *		generally cheaper here due to GH_BUCKET being very narrow.
+ *
+ * If none of the above points are important for the given use case then,
+ * please consider using simplehash.h instead.
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/generichash.h
+ *
+ */
+
+#include "port/pg_bitutils.h"
+
+/* helpers */
+#define GH_MAKE_PREFIX(a) CppConcat(a,_)
+#define GH_MAKE_NAME(name) GH_MAKE_NAME_(GH_MAKE_PREFIX(GH_PREFIX),name)
+#define GH_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* type declarations */
+#define GH_TYPE GH_MAKE_NAME(hash)
+#define GH_BUCKET GH_MAKE_NAME(bucket)
+#define GH_SEGMENT GH_MAKE_NAME(segment)
+#define GH_ITERATOR GH_MAKE_NAME(iterator)
+
+/* function declarations */
+#define GH_CREATE GH_MAKE_NAME(create)
+#define GH_DESTROY GH_MAKE_NAME(destroy)
+#define GH_RESET GH_MAKE_NAME(reset)
+#define GH_INSERT GH_MAKE_NAME(insert)
+#define GH_INSERT_HASH GH_MAKE_NAME(insert_hash)
+#define GH_DELETE GH_MAKE_NAME(delete)
+#define GH_LOOKUP GH_MAKE_NAME(lookup)
+#define GH_LOOKUP_HASH GH_MAKE_NAME(lookup_hash)
+#define GH_GROW GH_MAKE_NAME(grow)
+#define GH_START_ITERATE GH_MAKE_NAME(start_iterate)
+#define GH_ITERATE GH_MAKE_NAME(iterate)
+
+/* internal helper functions (no externally visible prototypes) */
+#define GH_NEXT_ONEBIT GH_MAKE_NAME(next_onebit)
+#define GH_NEXT_ZEROBIT GH_MAKE_NAME(next_zerobit)
+#define GH_INDEX_TO_ELEMENT GH_MAKE_NAME(index_to_element)
+#define GH_MARK_SEGMENT_ITEM_USED GH_MAKE_NAME(mark_segment_item_used)
+#define GH_MARK_SEGMENT_ITEM_UNUSED GH_MAKE_NAME(mark_segment_item_unused)
+#define GH_GET_NEXT_UNUSED_ENTRY GH_MAKE_NAME(get_next_unused_entry)
+#define GH_REMOVE_ENTRY GH_MAKE_NAME(remove_entry)
+#define GH_SET_BUCKET_IN_USE GH_MAKE_NAME(set_bucket_in_use)
+#define GH_SET_BUCKET_EMPTY GH_MAKE_NAME(set_bucket_empty)
+#define GH_IS_BUCKET_IN_USE GH_MAKE_NAME(is_bucket_in_use)
+#define GH_COMPUTE_PARAMETERS GH_MAKE_NAME(compute_parameters)
+#define GH_NEXT GH_MAKE_NAME(next)
+#define GH_PREV GH_MAKE_NAME(prev)
+#define GH_DISTANCE_FROM_OPTIMAL GH_MAKE_NAME(distance)
+#define GH_INITIAL_BUCKET GH_MAKE_NAME(initial_bucket)
+#define GH_INSERT_HASH_INTERNAL GH_MAKE_NAME(insert_hash_internal)
+#define GH_LOOKUP_HASH_INTERNAL GH_MAKE_NAME(lookup_hash_internal)
+
+/*
+ * When allocating memory to store instances of GH_ELEMENT_TYPE, how many
+ * should we allocate at once?  This must be a power of 2 and at least
+ * GH_BITS_PER_WORD.
+ */
+#ifndef GH_ITEMS_PER_SEGMENT
+#define GH_ITEMS_PER_SEGMENT	256
+#endif
+
+/* A special index to set GH_BUCKET->index to when it's not in use */
+#define GH_UNUSED_BUCKET_INDEX	PG_UINT32_MAX
+
+/*
+ * Macros for translating a bucket's index into the segment and another to
+ * determine the item number within the segment.
+ */
+#define GH_INDEX_SEGMENT(i)	(i) / GH_ITEMS_PER_SEGMENT
+#define GH_INDEX_ITEM(i)	(i) % GH_ITEMS_PER_SEGMENT
+
+ /*
+  * How many elements do we need in the bitmap array to store a bit for each
+  * of GH_ITEMS_PER_SEGMENT.  Keep the word size native to the processor.
+  */
+#if SIZEOF_VOID_P >= 8
+
+#define GH_BITS_PER_WORD		64
+#define GH_BITMAP_WORD			uint64
+#define GH_RIGHTMOST_ONE_POS(x) pg_rightmost_one_pos64(x)
+
+#else
+
+#define GH_BITS_PER_WORD		32
+#define GH_BITMAP_WORD			uint32
+#define GH_RIGHTMOST_ONE_POS(x) pg_rightmost_one_pos32(x)
+
+#endif
+
+/* Sanity check on GH_ITEMS_PER_SEGMENT setting */
+#if GH_ITEMS_PER_SEGMENT < GH_BITS_PER_WORD
+#error "GH_ITEMS_PER_SEGMENT must be >= than GH_BITS_PER_WORD"
+#endif
+
+/* Ensure GH_ITEMS_PER_SEGMENT is a power of 2 */
+#if GH_ITEMS_PER_SEGMENT & (GH_ITEMS_PER_SEGMENT - 1) != 0
+#error "GH_ITEMS_PER_SEGMENT must be a power of 2"
+#endif
+
+#define GH_BITMAP_WORDS			(GH_ITEMS_PER_SEGMENT / GH_BITS_PER_WORD)
+#define GH_WORDNUM(x)			((x) / GH_BITS_PER_WORD)
+#define GH_BITNUM(x)			((x) % GH_BITS_PER_WORD)
+
+/* generate forward declarations necessary to use the hash table */
+#ifdef GH_DECLARE
+
+typedef struct GH_BUCKET
+{
+	uint32		hashvalue;		/* Hash value for this bucket */
+	uint32		index;			/* Index to the actual data */
+}			GH_BUCKET;
+
+typedef struct GH_SEGMENT
+{
+	uint32		nitems;			/* Number of items stored */
+	GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item
+												 * in the items array */
+	GH_ELEMENT_TYPE items[GH_ITEMS_PER_SEGMENT];	/* the actual data */
+}			GH_SEGMENT;
+
+/* type definitions */
+
+/*
+ * GH_TYPE
+ *		Hash table metadata type
+ */
+typedef struct GH_TYPE
+{
+	/*
+	 * Size of bucket array.  Note that the maximum number of elements is
+	 * lower (GH_MAX_FILLFACTOR)
+	 */
+	uint32		size;
+
+	/* mask for bucket and size calculations, based on size */
+	uint32		sizemask;
+
+	/* the number of elements stored */
+	uint32		members;
+
+	/* boundary after which to grow hashtable */
+	uint32		grow_threshold;
+
+	/* how many elements are there in the segments array */
+	uint32		nsegments;
+
+	/* the number of elements in the used_segments array */
+	uint32		used_segment_words;
+
+	/*
+	 * The first segment we should search in for an empty slot.  This will be
+	 * the first segment that GH_GET_NEXT_UNUSED_ENTRY will search in when
+	 * looking for an unused entry.  We'll increase the value of this when we
+	 * fill a segment and we'll lower it down when we delete an item from a
+	 * segment lower than this value.
+	 */
+	uint32		first_free_segment;
+
+	/* dynamically allocated array of hash buckets */
+	GH_BUCKET  *buckets;
+
+	/* an array of segment pointers to store data */
+	GH_SEGMENT **segments;
+
+	/*
+	 * A bitmap of non-empty segments.  A 1-bit denotes that the corresponding
+	 * segment is non-empty.
+	 */
+	GH_BITMAP_WORD *used_segments;
+
+#ifdef GH_HAVE_PRIVATE_DATA
+	/* user defined data, useful for callbacks */
+	void	   *private_data;
+#endif
+}			GH_TYPE;
+
+/*
+ * GH_ITERATOR
+ *		Used when looping over the contents of the hash table.
+ */
+typedef struct GH_ITERATOR
+{
+	int32		cursegidx;		/* current segment. -1 means not started */
+	int32		curitemidx;		/* current item within cursegidx, -1 means not
+								 * started */
+	uint32		found_members;	/* number of items visitied so far in the loop */
+	uint32		total_members;	/* number of items that existed at the start
+								 * iteration. */
+}			GH_ITERATOR;
+
+/* externally visible function prototypes */
+
+#ifdef GH_HAVE_PRIVATE_DATA
+/* <prefix>_hash <prefix>_create(uint32 nbuckets, void *private_data) */
+GH_SCOPE	GH_TYPE *GH_CREATE(uint32 nbuckets, void *private_data);
+#else
+/* <prefix>_hash <prefix>_create(uint32 nbuckets) */
+GH_SCOPE	GH_TYPE *GH_CREATE(uint32 nbuckets);
+#endif
+
+/* void <prefix>_destroy(<prefix>_hash *tb) */
+GH_SCOPE void GH_DESTROY(GH_TYPE * tb);
+
+/* void <prefix>_reset(<prefix>_hash *tb) */
+GH_SCOPE void GH_RESET(GH_TYPE * tb);
+
+/* void <prefix>_grow(<prefix>_hash *tb) */
+GH_SCOPE void GH_GROW(GH_TYPE * tb, uint32 newsize);
+
+/* <element> *<prefix>_insert(<prefix>_hash *tb, <key> key, bool *found) */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_INSERT(GH_TYPE * tb, GH_KEY_TYPE key,
+									   bool *found);
+
+/*
+ * <element> *<prefix>_insert_hash(<prefix>_hash *tb, <key> key, uint32 hash,
+ * 								   bool *found)
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_INSERT_HASH(GH_TYPE * tb, GH_KEY_TYPE key,
+											uint32 hash, bool *found);
+
+/* <element> *<prefix>_lookup(<prefix>_hash *tb, <key> key) */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_LOOKUP(GH_TYPE * tb, GH_KEY_TYPE key);
+
+/* <element> *<prefix>_lookup_hash(<prefix>_hash *tb, <key> key, uint32 hash) */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_LOOKUP_HASH(GH_TYPE * tb, GH_KEY_TYPE key,
+											uint32 hash);
+
+/* bool <prefix>_delete(<prefix>_hash *tb, <key> key) */
+GH_SCOPE bool GH_DELETE(GH_TYPE * tb, GH_KEY_TYPE key);
+
+/* void <prefix>_start_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
+GH_SCOPE void GH_START_ITERATE(GH_TYPE * tb, GH_ITERATOR * iter);
+
+/* <element> *<prefix>_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
+GH_SCOPE	GH_ELEMENT_TYPE *GH_ITERATE(GH_TYPE * tb, GH_ITERATOR * iter);
+
+#endif							/* GH_DECLARE */
+
+/* generate implementation of the hash table */
+#ifdef GH_DEFINE
+
+/*
+ * The maximum size for the hash table.  This must be a power of 2.  We cannot
+ * make this PG_UINT32_MAX + 1 because we use GH_UNUSED_BUCKET_INDEX denote an
+ * empty bucket.  Doing so would mean we could accidentally set a used
+ * bucket's index to GH_UNUSED_BUCKET_INDEX.
+ */
+#define GH_MAX_SIZE ((uint32) PG_INT32_MAX + 1)
+
+/* normal fillfactor, unless already close to maximum */
+#ifndef GH_FILLFACTOR
+#define GH_FILLFACTOR (0.9)
+#endif
+/* increase fillfactor if we otherwise would error out */
+#define GH_MAX_FILLFACTOR (0.98)
+/* grow if actual and optimal location bigger than */
+#ifndef GH_GROW_MAX_DIB
+#define GH_GROW_MAX_DIB 25
+#endif
+/*
+ * Grow if more than this number of buckets needs to be moved when inserting.
+ */
+#ifndef GH_GROW_MAX_MOVE
+#define GH_GROW_MAX_MOVE 150
+#endif
+#ifndef GH_GROW_MIN_FILLFACTOR
+/* but do not grow due to GH_GROW_MAX_* if below */
+#define GH_GROW_MIN_FILLFACTOR 0.1
+#endif
+
+/*
+ * Wrap the following definitions in include guards, to avoid multiple
+ * definition errors if this header is included more than once.  The rest of
+ * the file deliberately has no include guards, because it can be included
+ * with different parameters to define functions and types with non-colliding
+ * names.
+ */
+#ifndef GENERICHASH_H
+#define GENERICHASH_H
+
+#ifdef FRONTEND
+#define gh_error(...) pg_log_error(__VA_ARGS__)
+#define gh_log(...) pg_log_info(__VA_ARGS__)
+#else
+#define gh_error(...) elog(ERROR, __VA_ARGS__)
+#define gh_log(...) elog(LOG, __VA_ARGS__)
+#endif
+
+#endif							/* GENERICHASH_H */
+
+/*
+ * Gets the position of the first 1-bit which comes after 'prevbit' in the
+ * 'words' array.  'nwords' is the size of the 'words' array.
+ */
+static inline int32
+GH_NEXT_ONEBIT(GH_BITMAP_WORD * words, uint32 nwords, int32 prevbit)
+{
+	uint32		wordnum;
+
+	prevbit++;
+
+	wordnum = GH_WORDNUM(prevbit);
+	if (wordnum < nwords)
+	{
+		GH_BITMAP_WORD mask = (~(GH_BITMAP_WORD) 0) << GH_BITNUM(prevbit);
+		GH_BITMAP_WORD word = words[wordnum] & mask;
+
+		if (word != 0)
+			return wordnum * GH_BITS_PER_WORD + GH_RIGHTMOST_ONE_POS(word);
+
+		for (++wordnum; wordnum < nwords; wordnum++)
+		{
+			word = words[wordnum];
+
+			if (word != 0)
+			{
+				int32		result = wordnum * GH_BITS_PER_WORD;
+
+				result += GH_RIGHTMOST_ONE_POS(word);
+				return result;
+			}
+		}
+	}
+	return -1;
+}
+
+/*
+ * Gets the position of the first 0-bit which comes after 'prevbit' in the
+ * 'words' array.  'nwords' is the size of the 'words' array.
+ *
+ * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
+ * each GH_BITMAP_WORD.
+ */
+static inline int32
+GH_NEXT_ZEROBIT(GH_BITMAP_WORD * words, uint32 nwords, int32 prevbit)
+{
+	uint32		wordnum;
+
+	prevbit++;
+
+	wordnum = GH_WORDNUM(prevbit);
+	if (wordnum < nwords)
+	{
+		GH_BITMAP_WORD mask = (~(GH_BITMAP_WORD) 0) << GH_BITNUM(prevbit);
+		GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */
+
+		if (word != 0)
+			return wordnum * GH_BITS_PER_WORD + GH_RIGHTMOST_ONE_POS(word);
+
+		for (++wordnum; wordnum < nwords; wordnum++)
+		{
+			word = ~words[wordnum]; /* flip bits */
+
+			if (word != 0)
+			{
+				int32		result = wordnum * GH_BITS_PER_WORD;
+
+				result += GH_RIGHTMOST_ONE_POS(word);
+				return result;
+			}
+		}
+	}
+	return -1;
+}
+
+/*
+ * Finds the hash table entry for a given GH_BUCKET's 'index'.
+ */
+static inline GH_ELEMENT_TYPE *
+GH_INDEX_TO_ELEMENT(GH_TYPE * tb, uint32 index)
+{
+	GH_SEGMENT *seg;
+	uint32		segidx;
+	uint32		item;
+
+	segidx = GH_INDEX_SEGMENT(index);
+	item = GH_INDEX_ITEM(index);
+
+	Assert(segidx < tb->nsegments);
+
+	seg = tb->segments[segidx];
+
+	Assert(seg != NULL);
+
+	/* ensure this segment is marked as used */
+	Assert(seg->used_items[GH_WORDNUM(item)] & (((GH_BITMAP_WORD) 1) << GH_BITNUM(item)));
+
+	return &seg->items[item];
+}
+
+static inline void
+GH_MARK_SEGMENT_ITEM_USED(GH_TYPE * tb, GH_SEGMENT * seg, uint32 segidx,
+						  uint32 segitem)
+{
+	uint32		word = GH_WORDNUM(segitem);
+	uint32		bit = GH_BITNUM(segitem);
+
+	/* ensure this item is not marked as used */
+	Assert((seg->used_items[word] & (((GH_BITMAP_WORD) 1) << bit)) == 0);
+
+	/* switch on the used bit */
+	seg->used_items[word] |= (((GH_BITMAP_WORD) 1) << bit);
+
+	/* if the segment was previously empty then mark it as used */
+	if (seg->nitems == 0)
+	{
+		word = GH_WORDNUM(segidx);
+		bit = GH_BITNUM(segidx);
+
+		/* switch on the used bit for this segment */
+		tb->used_segments[word] |= (((GH_BITMAP_WORD) 1) << bit);
+	}
+	seg->nitems++;
+}
+
+static inline void
+GH_MARK_SEGMENT_ITEM_UNUSED(GH_TYPE * tb, GH_SEGMENT * seg, uint32 segidx,
+							uint32 segitem)
+{
+	uint32		word = GH_WORDNUM(segitem);
+	uint32		bit = GH_BITNUM(segitem);
+
+	/* ensure this item is marked as used */
+	Assert((seg->used_items[word] & (((GH_BITMAP_WORD) 1) << bit)) != 0);
+
+	/* switch off the used bit */
+	seg->used_items[word] &= ~(((GH_BITMAP_WORD) 1) << bit);
+
+	/* when removing the last item mark the segment as unused */
+	if (seg->nitems == 1)
+	{
+		word = GH_WORDNUM(segidx);
+		bit = GH_BITNUM(segidx);
+
+		/* switch off the used bit for this segment */
+		tb->used_segments[word] &= ~(((GH_BITMAP_WORD) 1) << bit);
+	}
+
+	seg->nitems--;
+}
+
+/*
+ * Returns the first unused entry from the first non-full segment and set
+ * *index to the index of the returned entry.
+ */
+static inline GH_ELEMENT_TYPE *
+GH_GET_NEXT_UNUSED_ENTRY(GH_TYPE * tb, uint32 *index)
+{
+	GH_SEGMENT *seg;
+	uint32		segidx = tb->first_free_segment;
+	uint32		itemidx;
+
+	seg = tb->segments[segidx];
+
+	/* find the first segment with an unused item */
+	while (seg != NULL && seg->nitems == GH_ITEMS_PER_SEGMENT)
+		seg = tb->segments[++segidx];
+
+	tb->first_free_segment = segidx;
+
+	/* allocate the segment if it's not already */
+	if (seg == NULL)
+	{
+		seg = GH_ALLOCATE(sizeof(GH_SEGMENT));
+		tb->segments[segidx] = seg;
+
+		seg->nitems = 0;
+		memset(seg->used_items, 0, sizeof(seg->used_items));
+		/* no need to zero the items array */
+
+		/* use the first slot in this segment */
+		itemidx = 0;
+	}
+	else
+	{
+		/* find the first unused item in this segment */
+		itemidx = GH_NEXT_ZEROBIT(seg->used_items, GH_BITMAP_WORDS, -1);
+		Assert(itemidx >= 0);
+	}
+
+	/* this is a good spot to ensure nitems matches the bits in used_items */
+	Assert(seg->nitems == pg_popcount((const char *) seg->used_items, GH_ITEMS_PER_SEGMENT / 8));
+
+	GH_MARK_SEGMENT_ITEM_USED(tb, seg, segidx, itemidx);
+
+	*index = segidx * GH_ITEMS_PER_SEGMENT + itemidx;
+	return &seg->items[itemidx];
+
+}
+
+/*
+ * Remove the entry denoted by 'index' from its segment.
+ */
+static inline void
+GH_REMOVE_ENTRY(GH_TYPE * tb, uint32 index)
+{
+	GH_SEGMENT *seg;
+	uint32		segidx = GH_INDEX_SEGMENT(index);
+	uint32		item = GH_INDEX_ITEM(index);
+
+	Assert(segidx < tb->nsegments);
+	seg = tb->segments[segidx];
+	Assert(seg != NULL);
+
+	GH_MARK_SEGMENT_ITEM_UNUSED(tb, seg, segidx, item);
+
+	/*
+	 * Lower the first free segment index to point to this segment so that the
+	 * next insert will store in this segment.  If it's already pointing to an
+	 * earlier segment, then leave it be.
+	 */
+	if (tb->first_free_segment > segidx)
+		tb->first_free_segment = segidx;
+}
+
+/*
+ * Set 'bucket' as in use by 'index'.
+ */
+static inline void
+GH_SET_BUCKET_IN_USE(GH_BUCKET * bucket, uint32 index)
+{
+	bucket->index = index;
+}
+
+/*
+ * Mark 'bucket' as unused.
+ */
+static inline void
+GH_SET_BUCKET_EMPTY(GH_BUCKET * bucket)
+{
+	bucket->index = GH_UNUSED_BUCKET_INDEX;
+}
+
+/*
+ * Return true if 'bucket' is in use.
+ */
+static inline bool
+GH_IS_BUCKET_IN_USE(GH_BUCKET * bucket)
+{
+	return bucket->index != GH_UNUSED_BUCKET_INDEX;
+}
+
+ /*
+  * Compute sizing parameters for hashtable.  Called when creating and growing
+  * the hashtable.
+  */
+static inline void
+GH_COMPUTE_PARAMETERS(GH_TYPE * tb, uint32 newsize)
+{
+	uint32		size;
+
+	/*
+	 * Ensure the bucket array size has not exceeded GH_MAX_SIZE or wrapped
+	 * back to zero.
+	 */
+	if (newsize == 0 || newsize > GH_MAX_SIZE)
+		gh_error("hash table too large");
+
+	/*
+	 * Ensure we don't build a table that can't store an entire single segment
+	 * worth of data.
+	 */
+	size = Max(newsize, GH_ITEMS_PER_SEGMENT);
+
+	/* round up size to the next power of 2 */
+	size = pg_nextpower2_32(size);
+
+	/* now set size */
+	tb->size = size;
+	tb->sizemask = tb->size - 1;
+
+	/* calculate how many segments we'll need to store 'size' items */
+	tb->nsegments = pg_nextpower2_32(size / GH_ITEMS_PER_SEGMENT);
+
+	/*
+	 * Calculate the number of bitmap words needed to store a bit for each
+	 * segment.
+	 */
+	tb->used_segment_words = (tb->nsegments + GH_BITS_PER_WORD - 1) / GH_BITS_PER_WORD;
+
+	/*
+	 * Compute the next threshold at which we need to grow the hash table
+	 * again.
+	 */
+	if (tb->size == GH_MAX_SIZE)
+		tb->grow_threshold = (uint32) (((double) tb->size) * GH_MAX_FILLFACTOR);
+	else
+		tb->grow_threshold = (uint32) (((double) tb->size) * GH_FILLFACTOR);
+}
+
+/* return the optimal bucket for the hash */
+static inline uint32
+GH_INITIAL_BUCKET(GH_TYPE * tb, uint32 hash)
+{
+	return hash & tb->sizemask;
+}
+
+/* return the next bucket after the current, handling wraparound */
+static inline uint32
+GH_NEXT(GH_TYPE * tb, uint32 curelem, uint32 startelem)
+{
+	curelem = (curelem + 1) & tb->sizemask;
+
+	Assert(curelem != startelem);
+
+	return curelem;
+}
+
+/* return the bucket before the current, handling wraparound */
+static inline uint32
+GH_PREV(GH_TYPE * tb, uint32 curelem, uint32 startelem)
+{
+	curelem = (curelem - 1) & tb->sizemask;
+
+	Assert(curelem != startelem);
+
+	return curelem;
+}
+
+/* return the distance between a bucket and its optimal position */
+static inline uint32
+GH_DISTANCE_FROM_OPTIMAL(GH_TYPE * tb, uint32 optimal, uint32 bucket)
+{
+	if (optimal <= bucket)
+		return bucket - optimal;
+	else
+		return (tb->size + bucket) - optimal;
+}
+
+/*
+ * Create a hash table with 'nbuckets' buckets.
+ */
+GH_SCOPE	GH_TYPE *
+#ifdef GH_HAVE_PRIVATE_DATA
+GH_CREATE(uint32 nbuckets, void *private_data)
+#else
+GH_CREATE(uint32 nbuckets)
+#endif
+{
+	GH_TYPE    *tb;
+	uint32		size;
+	uint32		i;
+
+	tb = GH_ALLOCATE_ZERO(sizeof(GH_TYPE));
+
+#ifdef GH_HAVE_PRIVATE_DATA
+	tb->private_data = private_data;
+#endif
+
+	/* increase nelements by fillfactor, want to store nelements elements */
+	size = (uint32) Min((double) GH_MAX_SIZE, ((double) nbuckets) / GH_FILLFACTOR);
+
+	GH_COMPUTE_PARAMETERS(tb, size);
+
+	tb->buckets = GH_ALLOCATE(sizeof(GH_BUCKET) * tb->size);
+
+	/* ensure all the buckets are set to empty */
+	for (i = 0; i < tb->size; i++)
+		GH_SET_BUCKET_EMPTY(&tb->buckets[i]);
+
+	tb->segments = GH_ALLOCATE_ZERO(sizeof(GH_SEGMENT *) * tb->nsegments);
+	tb->used_segments = GH_ALLOCATE_ZERO(sizeof(GH_BITMAP_WORD) * tb->used_segment_words);
+	return tb;
+}
+
+/* destroy a previously created hash table */
+GH_SCOPE void
+GH_DESTROY(GH_TYPE * tb)
+{
+	GH_FREE(tb->buckets);
+
+	/* Free each segment one by one */
+	for (uint32 n = 0; n < tb->nsegments; n++)
+	{
+		if (tb->segments[n] != NULL)
+			GH_FREE(tb->segments[n]);
+	}
+
+	GH_FREE(tb->segments);
+	GH_FREE(tb->used_segments);
+
+	pfree(tb);
+}
+
+/* reset the contents of a previously created hash table */
+GH_SCOPE void
+GH_RESET(GH_TYPE * tb)
+{
+	int32		i = -1;
+	uint32		x;
+
+	/* reset each used segment one by one */
+	while ((i = GH_NEXT_ONEBIT(tb->used_segments, tb->used_segment_words,
+							   i)) >= 0)
+	{
+		GH_SEGMENT *seg = tb->segments[i];
+
+		Assert(seg != NULL);
+
+		seg->nitems = 0;
+		memset(seg->used_items, 0, sizeof(seg->used_items));
+	}
+
+	/* empty every bucket */
+	for (x = 0; x < tb->size; x++)
+		GH_SET_BUCKET_EMPTY(&tb->buckets[x]);
+
+	/* zero the used segment bits */
+	memset(tb->used_segments, 0, sizeof(GH_BITMAP_WORD) * tb->used_segment_words);
+
+	/* and mark the table as having zero members */
+	tb->members = 0;
+
+	/* ensure we start putting any new items in the first segment */
+	tb->first_free_segment = 0;
+}
+
+/*
+ * Grow a hash table to at least 'newsize' buckets.
+ *
+ * Usually this will automatically be called by insertions/deletions, when
+ * necessary. But resizing to the exact input size can be advantageous
+ * performance-wise, when known at some point.
+ */
+GH_SCOPE void
+GH_GROW(GH_TYPE * tb, uint32 newsize)
+{
+	uint32		oldsize = tb->size;
+	uint32		oldnsegments = tb->nsegments;
+	uint32		oldusedsegmentwords = tb->used_segment_words;
+	GH_BUCKET  *oldbuckets = tb->buckets;
+	GH_SEGMENT **oldsegments = tb->segments;
+	GH_BITMAP_WORD *oldusedsegments = tb->used_segments;
+	GH_BUCKET  *newbuckets;
+	uint32		i;
+	uint32		startelem = 0;
+	uint32		copyelem;
+
+	Assert(oldsize == pg_nextpower2_32(oldsize));
+
+	/* compute parameters for new table */
+	GH_COMPUTE_PARAMETERS(tb, newsize);
+
+	tb->buckets = GH_ALLOCATE(sizeof(GH_ELEMENT_TYPE) * tb->size);
+
+	/* Ensure all the buckets are set to empty */
+	for (i = 0; i < tb->size; i++)
+		GH_SET_BUCKET_EMPTY(&tb->buckets[i]);
+
+	newbuckets = tb->buckets;
+
+	/*
+	 * Copy buckets from the old buckets to newbuckets. We theoretically could
+	 * use GH_INSERT here, to avoid code duplication, but that's more general
+	 * than we need. We neither want tb->members increased, nor do we need to
+	 * do deal with deleted elements, nor do we need to compare keys. So a
+	 * special-cased implementation is a lot faster.  Resizing can be time
+	 * consuming and frequent, that's worthwhile to optimize.
+	 *
+	 * To be able to simply move buckets over, we have to start not at the
+	 * first bucket (i.e oldbuckets[0]), but find the first bucket that's
+	 * either empty or is occupied by an entry at its optimal position. Such a
+	 * bucket has to exist in any table with a load factor under 1, as not all
+	 * buckets are occupied, i.e. there always has to be an empty bucket.  By
+	 * starting at such a bucket we can move the entries to the larger table,
+	 * without having to deal with conflicts.
+	 */
+
+	/* search for the first element in the hash that's not wrapped around */
+	for (i = 0; i < oldsize; i++)
+	{
+		GH_BUCKET  *oldbucket = &oldbuckets[i];
+		uint32		hash;
+		uint32		optimal;
+
+		if (!GH_IS_BUCKET_IN_USE(oldbucket))
+		{
+			startelem = i;
+			break;
+		}
+
+		hash = oldbucket->hashvalue;
+		optimal = GH_INITIAL_BUCKET(tb, hash);
+
+		if (optimal == i)
+		{
+			startelem = i;
+			break;
+		}
+	}
+
+	/* and copy all elements in the old table */
+	copyelem = startelem;
+	for (i = 0; i < oldsize; i++)
+	{
+		GH_BUCKET  *oldbucket = &oldbuckets[copyelem];
+
+		if (GH_IS_BUCKET_IN_USE(oldbucket))
+		{
+			uint32		hash;
+			uint32		startelem;
+			uint32		curelem;
+			GH_BUCKET  *newbucket;
+
+			hash = oldbucket->hashvalue;
+			startelem = GH_INITIAL_BUCKET(tb, hash);
+			curelem = startelem;
+
+			/* find empty element to put data into */
+			for (;;)
+			{
+				newbucket = &newbuckets[curelem];
+
+				if (!GH_IS_BUCKET_IN_USE(newbucket))
+					break;
+
+				curelem = GH_NEXT(tb, curelem, startelem);
+			}
+
+			/* copy entry to new slot */
+			memcpy(newbucket, oldbucket, sizeof(GH_BUCKET));
+		}
+
+		/* can't use GH_NEXT here, would use new size */
+		copyelem++;
+		if (copyelem >= oldsize)
+			copyelem = 0;
+	}
+
+	GH_FREE(oldbuckets);
+
+	/*
+	 * Enlarge the segment array so we can store enough segments for the new
+	 * hash table capacity.
+	 */
+	tb->segments = GH_ALLOCATE(sizeof(GH_SEGMENT *) * tb->nsegments);
+	memcpy(tb->segments, oldsegments, sizeof(GH_SEGMENT *) * oldnsegments);
+	/* zero the newly extended part of the array */
+	memset(&tb->segments[oldnsegments], 0, sizeof(GH_SEGMENT *) *
+		   (tb->nsegments - oldnsegments));
+	GH_FREE(oldsegments);
+
+	/*
+	 * The majority of tables will only ever need one bitmap word to store
+	 * used segments, so we only bother to reallocate the used_segments array
+	 * if the number of bitmap words has actually changed.
+	 */
+	if (tb->used_segment_words != oldusedsegmentwords)
+	{
+		tb->used_segments = GH_ALLOCATE(sizeof(GH_BITMAP_WORD) *
+										tb->used_segment_words);
+		memcpy(tb->used_segments, oldusedsegments, sizeof(GH_BITMAP_WORD) *
+			   oldusedsegmentwords);
+		memset(&tb->used_segments[oldusedsegmentwords], 0,
+			   sizeof(GH_BITMAP_WORD) * (tb->used_segment_words -
+										 oldusedsegmentwords));
+
+		GH_FREE(oldusedsegments);
+	}
+}
+
+/*
+ * This is a separate static inline function, so it can be reliably be inlined
+ * into its wrapper functions even if GH_SCOPE is extern.
+ */
+static inline GH_ELEMENT_TYPE *
+GH_INSERT_HASH_INTERNAL(GH_TYPE * tb, GH_KEY_TYPE key, uint32 hash, bool *found)
+{
+	uint32		startelem;
+	uint32		curelem;
+	GH_BUCKET  *buckets;
+	uint32		insertdist;
+
+restart:
+	insertdist = 0;
+
+	/*
+	 * To avoid doing the grow check inside the loop, we do the grow check
+	 * regardless of if the key is present.  This also lets us avoid having to
+	 * re-find our position in the hashtable after resizing.
+	 *
+	 * Note that this also reached when resizing the table due to
+	 * GH_GROW_MAX_DIB / GH_GROW_MAX_MOVE.
+	 */
+	if (unlikely(tb->members >= tb->grow_threshold))
+	{
+		/* this may wrap back to 0 when we're already at GH_MAX_SIZE */
+		GH_GROW(tb, tb->size * 2);
+	}
+
+	/* perform the insert starting the bucket search at optimal location */
+	buckets = tb->buckets;
+	startelem = GH_INITIAL_BUCKET(tb, hash);
+	curelem = startelem;
+	for (;;)
+	{
+		GH_BUCKET  *bucket = &buckets[curelem];
+		GH_ELEMENT_TYPE *entry;
+		uint32		curdist;
+		uint32		curhash;
+		uint32		curoptimal;
+
+		/* any empty bucket can directly be used */
+		if (!GH_IS_BUCKET_IN_USE(bucket))
+		{
+			uint32		index;
+
+			/* and add the new entry */
+			tb->members++;
+
+			entry = GH_GET_NEXT_UNUSED_ENTRY(tb, &index);
+			entry->GH_KEY = key;
+			bucket->hashvalue = hash;
+			GH_SET_BUCKET_IN_USE(bucket, index);
+			*found = false;
+			return entry;
+		}
+
+		curhash = bucket->hashvalue;
+
+		if (curhash == hash)
+		{
+			/*
+			 * The hash value matches so we just need to ensure the key
+			 * matches too.  To do that, we need to lookup the entry in the
+			 * segments using the index stored in the bucket.
+			 */
+			entry = GH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			/* if we find a match, we're done */
+			if (GH_EQUAL(tb, key, entry->GH_KEY))
+			{
+				Assert(GH_IS_BUCKET_IN_USE(bucket));
+				*found = true;
+				return entry;
+			}
+		}
+
+		/*
+		 * For non-empty, non-matching buckets we have to decide whether to
+		 * skip over or move the colliding entry.  When the colliding
+		 * element's distance to its optimal position is smaller than the
+		 * to-be-inserted entry's, we shift the colliding entry (and its
+		 * followers) one bucket closer to their optimal position.
+		 */
+		curoptimal = GH_INITIAL_BUCKET(tb, curhash);
+		curdist = GH_DISTANCE_FROM_OPTIMAL(tb, curoptimal, curelem);
+
+		if (insertdist > curdist)
+		{
+			GH_ELEMENT_TYPE *entry;
+			GH_BUCKET  *lastbucket = bucket;
+			uint32		emptyelem = curelem;
+			uint32		moveelem;
+			int32		emptydist = 0;
+			uint32		index;
+
+			/* find next empty bucket */
+			for (;;)
+			{
+				GH_BUCKET  *emptybucket;
+
+				emptyelem = GH_NEXT(tb, emptyelem, startelem);
+				emptybucket = &buckets[emptyelem];
+
+				if (!GH_IS_BUCKET_IN_USE(emptybucket))
+				{
+					lastbucket = emptybucket;
+					break;
+				}
+
+				/*
+				 * To avoid negative consequences from overly imbalanced
+				 * hashtables, grow the hashtable if collisions would require
+				 * us to move a lot of entries.  The most likely cause of such
+				 * imbalance is filling a (currently) small table, from a
+				 * currently big one, in hashtable order.  Don't grow if the
+				 * hashtable would be too empty, to prevent quick space
+				 * explosion for some weird edge cases.
+				 */
+				if (unlikely(++emptydist > GH_GROW_MAX_MOVE) &&
+					((double) tb->members / tb->size) >= GH_GROW_MIN_FILLFACTOR)
+				{
+					tb->grow_threshold = 0;
+					goto restart;
+				}
+			}
+
+			/* shift forward, starting at last occupied element */
+
+			/*
+			 * TODO: This could be optimized to be one memcpy in many cases,
+			 * excepting wrapping around at the end of ->data. Hasn't shown up
+			 * in profiles so far though.
+			 */
+			moveelem = emptyelem;
+			while (moveelem != curelem)
+			{
+				GH_BUCKET  *movebucket;
+
+				moveelem = GH_PREV(tb, moveelem, startelem);
+				movebucket = &buckets[moveelem];
+
+				memcpy(lastbucket, movebucket, sizeof(GH_BUCKET));
+				lastbucket = movebucket;
+			}
+
+			/* and add the new entry */
+			tb->members++;
+
+			entry = GH_GET_NEXT_UNUSED_ENTRY(tb, &index);
+			entry->GH_KEY = key;
+			bucket->hashvalue = hash;
+			GH_SET_BUCKET_IN_USE(bucket, index);
+			*found = false;
+			return entry;
+		}
+
+		curelem = GH_NEXT(tb, curelem, startelem);
+		insertdist++;
+
+		/*
+		 * To avoid negative consequences from overly imbalanced hashtables,
+		 * grow the hashtable if collisions lead to large runs. The most
+		 * likely cause of such imbalance is filling a (currently) small
+		 * table, from a currently big one, in hashtable order.  Don't grow if
+		 * the hashtable would be too empty, to prevent quick space explosion
+		 * for some weird edge cases.
+		 */
+		if (unlikely(insertdist > GH_GROW_MAX_DIB) &&
+			((double) tb->members / tb->size) >= GH_GROW_MIN_FILLFACTOR)
+		{
+			tb->grow_threshold = 0;
+			goto restart;
+		}
+	}
+}
+
+/*
+ * Insert the key into the hashtable, set *found to true if the key already
+ * exists, false otherwise. Returns the hashtable entry in either case.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_INSERT(GH_TYPE * tb, GH_KEY_TYPE key, bool *found)
+{
+	uint32		hash = GH_HASH_KEY(tb, key);
+
+	return GH_INSERT_HASH_INTERNAL(tb, key, hash, found);
+}
+
+/*
+ * Insert the key into the hashtable using an already-calculated hash. Set
+ * *found to true if the key already exists, false otherwise. Returns the
+ * hashtable entry in either case.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_INSERT_HASH(GH_TYPE * tb, GH_KEY_TYPE key, uint32 hash, bool *found)
+{
+	return GH_INSERT_HASH_INTERNAL(tb, key, hash, found);
+}
+
+/*
+ * This is a separate static inline function, so it can be reliably be inlined
+ * into its wrapper functions even if GH_SCOPE is extern.
+ */
+static inline GH_ELEMENT_TYPE *
+GH_LOOKUP_HASH_INTERNAL(GH_TYPE * tb, GH_KEY_TYPE key, uint32 hash)
+{
+	const uint32 startelem = GH_INITIAL_BUCKET(tb, hash);
+	uint32		curelem = startelem;
+
+	for (;;)
+	{
+		GH_BUCKET  *bucket = &tb->buckets[curelem];
+
+		if (!GH_IS_BUCKET_IN_USE(bucket))
+			return NULL;
+
+		if (bucket->hashvalue == hash)
+		{
+			GH_ELEMENT_TYPE *entry;
+
+			/*
+			 * The hash value matches so we just need to ensure the key
+			 * matches too.  To do that, we need to lookup the entry in the
+			 * segments using the index stored in the bucket.
+			 */
+			entry = GH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			/* if we find a match, we're done */
+			if (GH_EQUAL(tb, key, entry->GH_KEY))
+				return entry;
+		}
+
+		/*
+		 * TODO: we could stop search based on distance. If the current
+		 * buckets's distance-from-optimal is smaller than what we've skipped
+		 * already, the entry doesn't exist.
+		 */
+
+		curelem = GH_NEXT(tb, curelem, startelem);
+	}
+}
+
+/*
+ * Lookup an entry in the hash table.  Returns NULL if key not present.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_LOOKUP(GH_TYPE * tb, GH_KEY_TYPE key)
+{
+	uint32		hash = GH_HASH_KEY(tb, key);
+
+	return GH_LOOKUP_HASH_INTERNAL(tb, key, hash);
+}
+
+/*
+ * Lookup an entry in the hash table using an already-calculated hash.
+ *
+ * Returns NULL if key not present.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_LOOKUP_HASH(GH_TYPE * tb, GH_KEY_TYPE key, uint32 hash)
+{
+	return GH_LOOKUP_HASH_INTERNAL(tb, key, hash);
+}
+
+/*
+ * Delete an entry from hash table by key.  Returns whether to-be-deleted key
+ * was present.
+ */
+GH_SCOPE bool
+GH_DELETE(GH_TYPE * tb, GH_KEY_TYPE key)
+{
+	uint32		hash = GH_HASH_KEY(tb, key);
+	uint32		startelem = GH_INITIAL_BUCKET(tb, hash);
+	uint32		curelem = startelem;
+
+	for (;;)
+	{
+		GH_BUCKET  *bucket = &tb->buckets[curelem];
+
+		if (!GH_IS_BUCKET_IN_USE(bucket))
+			return false;
+
+		if (bucket->hashvalue == hash)
+		{
+			GH_ELEMENT_TYPE *entry;
+
+			entry = GH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			if (GH_EQUAL(tb, key, entry->GH_KEY))
+			{
+				GH_BUCKET  *lastbucket = bucket;
+
+				/* mark the entry as unused */
+				GH_REMOVE_ENTRY(tb, bucket->index);
+				/* and mark the bucket unused */
+				GH_SET_BUCKET_EMPTY(bucket);
+
+				tb->members--;
+
+				/*
+				 * Backward shift following buckets till either an empty
+				 * bucket or a bucket at its optimal position is encountered.
+				 *
+				 * While that sounds expensive, the average chain length is
+				 * short, and deletions would otherwise require tombstones.
+				 */
+				for (;;)
+				{
+					GH_BUCKET  *curbucket;
+					uint32		curhash;
+					uint32		curoptimal;
+
+					curelem = GH_NEXT(tb, curelem, startelem);
+					curbucket = &tb->buckets[curelem];
+
+					if (!GH_IS_BUCKET_IN_USE(curbucket))
+						break;
+
+					curhash = curbucket->hashvalue;
+					curoptimal = GH_INITIAL_BUCKET(tb, curhash);
+
+					/* current is at optimal position, done */
+					if (curoptimal == curelem)
+					{
+						GH_SET_BUCKET_EMPTY(lastbucket);
+						break;
+					}
+
+					/* shift */
+					memcpy(lastbucket, curbucket, sizeof(GH_BUCKET));
+					GH_SET_BUCKET_EMPTY(curbucket);
+
+					lastbucket = curbucket;
+				}
+
+				return true;
+			}
+		}
+		/* TODO: return false; if the distance is too big */
+
+		curelem = GH_NEXT(tb, curelem, startelem);
+	}
+}
+
+/*
+ * Initialize iterator.
+ */
+GH_SCOPE void
+GH_START_ITERATE(GH_TYPE * tb, GH_ITERATOR * iter)
+{
+	iter->cursegidx = -1;
+	iter->curitemidx = -1;
+	iter->found_members = 0;
+	iter->total_members = tb->members;
+}
+
+/*
+ * Iterate over all entries in the hashtable. Return the next occupied entry,
+ * or NULL if there are no more entries.
+ *
+ * During iteration the only current entry in the hash table and any entry
+ * which was previously visited in the loop may be deleted.  Deletion of items
+ * not yet visited is prohibited as are insertions of new entries.
+ */
+GH_SCOPE	GH_ELEMENT_TYPE *
+GH_ITERATE(GH_TYPE * tb, GH_ITERATOR * iter)
+{
+	/*
+	 * Bail if we've already visited all members.  This check allows us to
+	 * exit quickly in cases where the table is large but it only contains a
+	 * small number of records.  This also means that inserts into the table
+	 * are not possible during iteration.  If that is done then we may not
+	 * visit all items in the table.  Rather than ever removing this check to
+	 * allow table insertions during iteration, we should add another iterator
+	 * where insertions are safe.
+	 */
+	if (iter->found_members == iter->total_members)
+		return NULL;
+
+	for (;;)
+	{
+		GH_SEGMENT *seg;
+
+		/* need a new segment? */
+		if (iter->curitemidx == -1)
+		{
+			iter->cursegidx = GH_NEXT_ONEBIT(tb->used_segments,
+											 tb->used_segment_words,
+											 iter->cursegidx);
+
+			/* no more segments with items? We're done */
+			if (iter->cursegidx == -1)
+				return NULL;
+		}
+
+		seg = tb->segments[iter->cursegidx];
+
+		/* if the segment has items then it certainly shouldn't be NULL */
+		Assert(seg != NULL);
+		/* advance to the next used item in this segment */
+		iter->curitemidx = GH_NEXT_ONEBIT(seg->used_items, GH_BITMAP_WORDS,
+										  iter->curitemidx);
+		if (iter->curitemidx >= 0)
+		{
+			iter->found_members++;
+			return &seg->items[iter->curitemidx];
+		}
+
+		/*
+		 * GH_NEXT_ONEBIT returns -1 when there are no more bits.  We just
+		 * loop again to fetch the next segment.
+		 */
+	}
+}
+
+#endif							/* GH_DEFINE */
+
+/* undefine external parameters, so next hash table can be defined */
+#undef GH_PREFIX
+#undef GH_KEY_TYPE
+#undef GH_KEY
+#undef GH_ELEMENT_TYPE
+#undef GH_HASH_KEY
+#undef GH_SCOPE
+#undef GH_DECLARE
+#undef GH_DEFINE
+#undef GH_EQUAL
+#undef GH_ALLOCATE
+#undef GH_ALLOCATE_ZERO
+#undef GH_FREE
+
+/* undefine locally declared macros */
+#undef GH_MAKE_PREFIX
+#undef GH_MAKE_NAME
+#undef GH_MAKE_NAME_
+#undef GH_ITEMS_PER_SEGMENT
+#undef GH_UNUSED_BUCKET_INDEX
+#undef GH_INDEX_SEGMENT
+#undef GH_INDEX_ITEM
+#undef GH_BITS_PER_WORD
+#undef GH_BITMAP_WORD
+#undef GH_RIGHTMOST_ONE_POS
+#undef GH_BITMAP_WORDS
+#undef GH_WORDNUM
+#undef GH_BITNUM
+#undef GH_RAW_ALLOCATOR
+#undef GH_MAX_SIZE
+#undef GH_FILLFACTOR
+#undef GH_MAX_FILLFACTOR
+#undef GH_GROW_MAX_DIB
+#undef GH_GROW_MAX_MOVE
+#undef GH_GROW_MIN_FILLFACTOR
+
+/* types */
+#undef GH_TYPE
+#undef GH_BUCKET
+#undef GH_SEGMENT
+#undef GH_ITERATOR
+
+/* external function names */
+#undef GH_CREATE
+#undef GH_DESTROY
+#undef GH_RESET
+#undef GH_INSERT
+#undef GH_INSERT_HASH
+#undef GH_DELETE
+#undef GH_LOOKUP
+#undef GH_LOOKUP_HASH
+#undef GH_GROW
+#undef GH_START_ITERATE
+#undef GH_ITERATE
+
+/* internal function names */
+#undef GH_NEXT_ONEBIT
+#undef GH_NEXT_ZEROBIT
+#undef GH_INDEX_TO_ELEMENT
+#undef GH_MARK_SEGMENT_ITEM_USED
+#undef GH_MARK_SEGMENT_ITEM_UNUSED
+#undef GH_GET_NEXT_UNUSED_ENTRY
+#undef GH_REMOVE_ENTRY
+#undef GH_SET_BUCKET_IN_USE
+#undef GH_SET_BUCKET_EMPTY
+#undef GH_IS_BUCKET_IN_USE
+#undef GH_COMPUTE_PARAMETERS
+#undef GH_NEXT
+#undef GH_PREV
+#undef GH_DISTANCE_FROM_OPTIMAL
+#undef GH_INITIAL_BUCKET
+#undef GH_INSERT_HASH_INTERNAL
+#undef GH_LOOKUP_HASH_INTERNAL
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 9b2a421c32..a268879b1c 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -561,7 +561,7 @@ extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
 #ifdef USE_ASSERT_CHECKING
-extern HTAB *GetLockMethodLocalHash(void);
+extern LOCALLOCK **GetLockMethodLocalLocks(uint32 *size);
 #endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
 						   LOCKMODE lockmode, bool sessionLock);
-- 
2.27.0

v1-0002-Use-generichash.h-hashtables-in-SMgr.patchapplication/octet-stream; name=v1-0002-Use-generichash.h-hashtables-in-SMgr.patchDownload

From d3689d5e3bdd1b9aac57a104540ccfabc4f33997 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 22 Jun 2021 00:31:53 +1200
Subject: [PATCH v1 2/2] Use generichash.h hashtables in SMgr

The hash table lookups done in SMgr can quite often be a bottleneck during
crash recovery.  Traditionally these use dynahash. Here we swap dynahash
out and use generichash instead.  This improves lookup performance.
---
 src/backend/storage/smgr/smgr.c | 82 +++++++++++++++++++++++----------
 1 file changed, 58 insertions(+), 24 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..77dd402479 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,14 +18,30 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
+#include "common/hashfn.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
 #include "storage/smgr.h"
-#include "utils/hsearch.h"
 #include "utils/inval.h"
-
+#include "utils/memutils.h"
+
+static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
+
+#define GH_PREFIX		smgrtable
+#define GH_ELEMENT_TYPE	SMgrRelationData
+#define GH_KEY_TYPE		RelFileNodeBackend
+#define GH_KEY			smgr_rnode
+#define GH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define GH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
+#define GH_ALLOCATE(b)			MemoryContextAllocExtended(TopMemoryContext, b, MCXT_ALLOC_HUGE)
+#define GH_ALLOCATE_ZERO(b)		MemoryContextAllocExtended(TopMemoryContext, b, MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO)
+#define GH_FREE(p)				pfree(p)
+#define GH_SCOPE		static inline
+#define GH_DECLARE
+#define GH_DEFINE
+#include "lib/generichash.h"
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -91,13 +107,43 @@ static const int NSmgr = lengthof(smgrsw);
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
  */
-static HTAB *SMgrRelationHash = NULL;
+static smgrtable_hash *SMgrRelationHash = NULL;
 
 static dlist_head unowned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/*
+ * relfilenodebackend_hash
+ *		Custom rolled hash function for simplehash table.
+ *
+ * smgropen() is often a bottleneck in CPU bound workloads during crash
+ * recovery.  We make use of this custom hash function rather than using
+ * hash_bytes as it gives us a little bit more performance.
+ *
+ * XXX What if sizeof(Oid) is not 4?
+ */
+static inline uint32
+relfilenodebackend_hash(RelFileNodeBackend *rnode)
+{
+	uint32		hashkey;
+
+	hashkey = murmurhash32((uint32) rnode->node.spcNode);
+
+	/* rotate hashkey left 1 bit at each step */
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->backend);
+
+	return hashkey;
+}
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -149,29 +195,22 @@ smgropen(RelFileNode rnode, BackendId backend)
 	SMgrRelation reln;
 	bool		found;
 
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 	{
 		/* First time through: initialize the hash table */
-		HASHCTL		ctl;
-
-		ctl.keysize = sizeof(RelFileNodeBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
-		SMgrRelationHash = hash_create("smgr relation table", 400,
-									   &ctl, HASH_ELEM | HASH_BLOBS);
+		SMgrRelationHash = smgrtable_create(400);
 		dlist_init(&unowned_relns);
 	}
 
 	/* Look up or create an entry */
 	brnode.node = rnode;
 	brnode.backend = backend;
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &brnode,
-									  HASH_ENTER, &found);
+	reln = smgrtable_insert(SMgrRelationHash, brnode, &found);
 
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		/* hash_search already filled in the lookup key */
+		/* smgrtable_insert already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
@@ -266,9 +305,7 @@ smgrclose(SMgrRelation reln)
 	if (!owner)
 		dlist_delete(&reln->node);
 
-	if (hash_search(SMgrRelationHash,
-					(void *) &(reln->smgr_rnode),
-					HASH_REMOVE, NULL) == NULL)
+	if (!smgrtable_delete(SMgrRelationHash, reln->smgr_rnode))
 		elog(ERROR, "SMgrRelation hashtable corrupted");
 
 	/*
@@ -285,16 +322,16 @@ smgrclose(SMgrRelation reln)
 void
 smgrcloseall(void)
 {
-	HASH_SEQ_STATUS status;
+	smgrtable_iterator iterator;
 	SMgrRelation reln;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	smgrtable_start_iterate(SMgrRelationHash, &iterator);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
+	while ((reln = smgrtable_iterate(SMgrRelationHash, &iterator)) != NULL)
 		smgrclose(reln);
 }
 
@@ -314,10 +351,7 @@ smgrclosenode(RelFileNodeBackend rnode)
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
-
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &rnode,
-									  HASH_FIND, NULL);
+	reln = smgrtable_lookup(SMgrRelationHash, rnode);
 	if (reln != NULL)
 		smgrclose(reln);
 }
-- 
2.27.0

#22

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: David Rowley (#21)

Re: Use simplehash.h instead of dynahash in SMgr

On Mon, Jun 21, 2021 at 10:15 AM David Rowley <dgrowleyml@gmail.com> wrote:

I've come up with a new hash table implementation that I've called
generichash.

At the risk of kibitzing the least-important detail of this proposal,
I'm not very happy with the names of our hash implementations.
simplehash is not especially simple, and dynahash is not particularly
dynamic, especially now that the main place we use it is for
shared-memory hash tables that can't be resized. Likewise, generichash
doesn't really give any kind of clue about how this hash table is
different from any of the others. I don't know how possible it is to
do better here; naming things is one of the two hard problems in
computer science. In a perfect world, though, our hash table
implementations would be named in such a way that somebody might be
able to look at the names and guess on that basis which one is
best-suited to a given task.

--
Robert Haas
EDB: http://www.enterprisedb.com

#23

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#22)

Re: Use simplehash.h instead of dynahash in SMgr

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 21, 2021 at 10:15 AM David Rowley <dgrowleyml@gmail.com> wrote:

I've come up with a new hash table implementation that I've called
generichash.

At the risk of kibitzing the least-important detail of this proposal,
I'm not very happy with the names of our hash implementations.

I kind of wonder if we really need four different hash table
implementations (this being the third "generic" one, plus hash join
has its own, and I may have forgotten others). Should we instead
think about revising simplehash to gain the benefits of this patch?

regards, tom lane

#24

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Robert Haas (#22)

Re: Use simplehash.h instead of dynahash in SMgr

On Tue, 22 Jun 2021 at 02:53, Robert Haas <robertmhaas@gmail.com> wrote:

At the risk of kibitzing the least-important detail of this proposal,
I'm not very happy with the names of our hash implementations.
simplehash is not especially simple, and dynahash is not particularly
dynamic, especially now that the main place we use it is for
shared-memory hash tables that can't be resized. Likewise, generichash
doesn't really give any kind of clue about how this hash table is
different from any of the others. I don't know how possible it is to
do better here; naming things is one of the two hard problems in
computer science. In a perfect world, though, our hash table
implementations would be named in such a way that somebody might be
able to look at the names and guess on that basis which one is
best-suited to a given task.

I'm certainly open to better names. I did almost call it stablehash,
in regards to the pointers to elements not moving around like they do
with simplehash.

I think more generally, hash table implementations are complex enough
that it's pretty much impossible to give them a short enough
meaningful name. Most papers just end up assigning a name to some
technique. e.g Robinhood, Cuckoo etc.

Both simplehash and generichash use a variant of Robinhood hashing.
simplehash uses open addressing and generichash does not. Instead of
Andres naming it simplehash, if he'd instead called it
"robinhoodhash", then someone might come along and complain that his
implementation is broken because it does not implement tombstoning.
Maybe Andres thought he'd avoid that by not claiming that it's an
implementation of a Robinhood hash table. That seems pretty wise to
me. Naming it simplehash was a pretty simple way of avoiding that
problem.

Anyway, I'm open to better names, but I don't think the name should
drive the implementation. If the implementation does not fit the name
perfectly, then the name should change rather than the implementation.

Personally, I think we should call it RowleyHash, but I think others
might object. ;-)

David

#25

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Tom Lane (#23)

Re: Use simplehash.h instead of dynahash in SMgr

On Tue, 22 Jun 2021 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I kind of wonder if we really need four different hash table
implementations (this being the third "generic" one, plus hash join
has its own, and I may have forgotten others). Should we instead
think about revising simplehash to gain the benefits of this patch?

hmm, yeah. I definitely agree with trying to have as much reusable
code as we can when we can. It certainly reduces maintenance and bugs
tend to be found more quickly too. It's a very worthy cause.

I did happen to think of this when I was copying swathes of code out
of simplehash.h. However, I decided that the two implementations are
sufficiently different that if I tried to merge them both into one .h
file, we'd have some unreadable and unmaintainable mess. I just don't
think their DNA is compatible enough for the two to be mated
successfully. For example, simplehash uses open addressing and
generichash does not. This means that things like iterating over the
table works completely differently. Lookups in generichash need to
perform an extra step to fetch the actual data from the segment
arrays. I think it would certainly be possible to merge the two, but
I just don't think it would be easy code to work on if we did that.

The good thing is that that the API between the two is very similar
and it's quite easy to swap one for the other. I did make changes
around memory allocation due to me being too cheap to zero memory when
I didn't need to and simplehash not having any means of allocating
memory without zeroing it.

I also think that there's just no one-size-fits-all hash table type.
simplehash will not perform well when the size of the stored element
is very large. There's simply too much memcpying to move data around
during insert/delete. simplehash will also have terrible iteration
performance in sparsely populated tables. However, simplehash will be
pretty much unbeatable for lookups where the element type is very
small, e.g single Datum, or an int. The CPU cache efficiency there
will be pretty much unbeatable.

I tried to document the advantages of each in the file header
comments. I should probably also add something to simplehash.h's
comments to mention generichash.h

David

#26

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: David Rowley (#25)

Re: Use simplehash.h instead of dynahash in SMgr

On Tue, Jun 22, 2021 at 1:55 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 22 Jun 2021 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I kind of wonder if we really need four different hash table
implementations (this being the third "generic" one, plus hash join
has its own, and I may have forgotten others). Should we instead
think about revising simplehash to gain the benefits of this patch?

hmm, yeah. I definitely agree with trying to have as much reusable
code as we can when we can. It certainly reduces maintenance and bugs
tend to be found more quickly too. It's a very worthy cause.

It is indeed really hard to decide when you have a new thing, and when
you need a new way to parameterise the existing generic thing. I
wondered about this how-many-hash-tables-does-it-take question a lot
when writing dshash.c (a chaining hash table that can live in weird
dsa.c memory, backed by DSM segments created on the fly that may be
mapped at different addresses in each backend, and has dynahash-style
partition locking), and this was around the time Andres was talking
about simplehash. In retrospect, I'd probably kick out the built-in
locking and partitions and instead let callers create their own
partitioning scheme on top from N tables, and that'd remove one quirk,
leaving only the freaky pointers and allocator. I recall from a
previous life that Boost's unordered_map template is smart enough to
support running in shared memory mapped at different addresses just
through parameterisation that controls the way it deals with internal
pointers (boost::unordered_map<..., ShmemAllocator>), which seemed
pretty clever to me, and it might be achievable to do the same with a
generic hash table for us that could take over dshash's specialness.

One idea I had at the time is that the right number of hash table
implementations in our tree is two: one for chaining (like dynahash)
and one for open addressing/probing (like simplehash), and that
everything else should be hoisted out (locking, partitioning) or made
into template parameters through the generic programming technique
that simplehash.h has demonstrated (allocators, magic pointer type for
internal pointers, plus of course the inlinable ops). But that was
before we'd really fully adopted the idea of this style of template
code. (I also assumed the weird memory stuff would be temporary and
we'd move to threads, but that's another topic for another thread.)
It seems like you'd disagree with this, and you'd say the right number
is three. But it's also possible to argue for one...

A more superficial comment: I don't like calling hash tables "hash".
I blame perl.

#27

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Thomas Munro (#26)

6 attachment(s)

Re: Use simplehash.h instead of dynahash in SMgr

On Tue, 22 Jun 2021 at 14:49, Thomas Munro <thomas.munro@gmail.com> wrote:

One idea I had at the time is that the right number of hash table
implementations in our tree is two: one for chaining (like dynahash)
and one for open addressing/probing (like simplehash), and that
everything else should be hoisted out (locking, partitioning) or made
into template parameters through the generic programming technique
that simplehash.h has demonstrated (allocators, magic pointer type for
internal pointers, plus of course the inlinable ops). But that was
before we'd really fully adopted the idea of this style of template
code. (I also assumed the weird memory stuff would be temporary and
we'd move to threads, but that's another topic for another thread.)
It seems like you'd disagree with this, and you'd say the right number
is three. But it's also possible to argue for one...

I guess we could also ask ourselves how many join algorithms we need.
We have 3.something. None of which is perfect for every job. That's
why we have multiple. I wonder why this is different.

Just for anyone who missed it, the reason I wrote generichash and
didn't just use simplehash is that it's not possible to point any
other pointers to a simplehash table because these get shuffled around
during insert/delete. For the locallock stuff over on [1]/messages/by-id/CAApHDvoKqWRxw5nnUPZ8+mAJKHPOPxYGoY1gQdh0WeS4+biVhg@mail.gmail.com we need the
LOCALLOCK object to be stable as we point to these from the resource
manager. Likewise here for SMgr, we point to SMgrRelationData objects
from RelationData. We can't have the hash table implementation swap
these out from under us.

Additionally, I coded generichash to fix the very slow hash seq scan
problem that we have in LockReleaseAll() when a transaction has run in
the backend that took lots of locks and caused the locallock hash
table to bloat. Later when we run transactions that just grab a few
locks it takes us a relatively long time to do LockReleaseAll()
because we have to skip all those empty hash table buckets in the
bloated table. (See iterate_sparse_table.png and
iterate_very_sparse_table.png)

I just finished writing a benchmark suite for comparing simplehash to
generichash. I did this as a standalone C program. See the attached
hashbench.tar.gz. You can run the tests with just ./test.sh. Just be
careful if compiling manually as test.sh passes -DHAVE__BUILTIN_CTZ
-DHAVE_LONG_INT_64 which have quite a big effect on the performance of
generichash due to it using pg_rightmost_one_pos64() when searching
the bitmaps for used items.

I've attached graphs showing the results I got from running test.sh on
my AMD 3990x machine. Because the size of the struct being hashed
matters a lot to the performance of simplehash, I ran tests with 8,
16, 32, 64, 128, 256-byte structs. This matters because simplehash
does memcpy() on this when moving stuff around during insert/delere.
The size of the "payload" matters a bit less to generichash.

You can see that the lookup performance of generichash very similar to
simplehash. The insert/delete test shows the generichash is very
slightly slower from 8-128 bytes but wins when simplehash has to
tackle 256 bytes of data.

The seq scan tests show that simplehash is better when the table is
full of items, but it's terrible when bucket array is only sparsely
populated. I needed generichash to be fast at this for
LockReleaseAll(). I might be able to speed up generichash iteration
when the table is full a bit more by checking if the segment is full
and skipping to the next item rather than consulting the bitmap. That
will slow down the sparse case a bit though. Not sure if it's worth
it.

Anyway, what I hope to show here is that there is no one-size-fits-all
hash table.

David

[1]: /messages/by-id/CAApHDvoKqWRxw5nnUPZ8+mAJKHPOPxYGoY1gQdh0WeS4+biVhg@mail.gmail.com

#28

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: David Rowley (#27)

Re: Use simplehash.h instead of dynahash in SMgr

On Tue, Jun 22, 2021 at 6:51 PM David Rowley <dgrowleyml@gmail.com> wrote:

I guess we could also ask ourselves how many join algorithms we need.

David and I discussed this a bit off-list, and I just wanted to share
how I understand the idea so far in case it helps someone else. There
are essentially three subcomponents working together:

1. A data structure similar in some ways to a C++ std::deque<T>,
which gives O(1) access to elements by index, is densely packed to
enable cache-friendly scanning of all elements, has stable addresses
(as long as you only add new elements at the end or overwrite existing
slots), and is internally backed by an array of pointers to a set of
chunks.

2. A bitmapset that tracks unused elements in 1, making it easy to
find the lowest-index hole when looking for a place to put a new one
by linear search for a 1 bit, so that we tend towards maximum density
despite having random frees from time to time (seems good, the same
idea is used in kernels to allocate the lowest unused file descriptor
number).

3. A hash table that has as elements indexes into 1. It somehow hides
the difference between keys (what callers look things up with) and
keys reachable by following an index into 1 (where elements' keys
live).

One thought is that you could do 1 as a separate component as the
"primary" data structure, and use a plain old simplehash for 3 as a
kind of index into it, but use pointers (rather than indexes) to
objects in 1 as elements. I don't know if it's better.

#29

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Thomas Munro (#28)

Re: Use simplehash.h instead of dynahash in SMgr

On Wed, 23 Jun 2021 at 12:17, Thomas Munro <thomas.munro@gmail.com> wrote:

David and I discussed this a bit off-list, and I just wanted to share
how I understand the idea so far in case it helps someone else. There
are essentially three subcomponents working together:

Thanks for taking an interest in this. I started looking at your idea
and I've now changed my mind from just not liking it to thinking that
the whole idea is just completely horrible :-(

It gets really messy with all the nested pre-processor stuff around
fetching the element from the segmented array inside simplehash. One
problem is that simplehash needs the address of the segments despite
simplehash not knowing anything about segments. I've tried to make
that work by passing in the generic hash struct as simplehash's
private_data. This ends up with deeply nested macros all defined in
different files. I pitty the future person debugging that.

There is also a problem of how to reference simplehash functions
inside the generichash code. It's not possible to do things like
SH_CREATE() because all those macros are undefined at the end of
simplehash.h. It's no good to hardcode the names either as GH_PREFIX
must be used, else it wouldn't be possible to use more than 1
differenrly defined hash table per .c file. Fixing this means either
modifying simplehash.h to not undefine all the name macros at the end
maybe with SH_NOUNDEF or creating another set of macros to build the
names for the simplehash functions inside the generic hash code. I
don't like either of those ideas.

There are also a bunch of changes / API breakages that need to be done
to make this work with simplehash.h.

1) Since I really need 8-byte buckets in the hash table to make this
as fast as possible, I want to use the array index for the hash status
and that means changing the simplehash API to allow that to work.
This requires something like SH_IS_BUCKET_INUSE, SH_SET_BUCKET_INUSE,
SH_SET_BUCKET_EMPTY.
2) I need to add a new memory allocation function to not zero the
memory. At the moment all hash buckets are emptied when creating a
table by zeroing the bucket memory. If someone defines
SH_SET_BUCKET_EMPTY to do something that says 0 memory is not empty,
then that won't work. So I need to allocate the bucket memory then
call SH_SET_BUCKET_EMPTY on each bucket.
3) I'll need to replace SH_KEY with something more complex. Since the
simplehash bucket will just have a uint32 hashvalue and uint32 index,
the hash key is not stored in the bucket, it's stored over in the
segment. I'll need to replace SH_KEY with SH_GETKEY and SH_SETKEY.
These will need to consult the simplehash's private_data so that the
element can be found in the segmented array.

Also, simplehash internally manages when the hash table needs to grow.
I'll need to perform separate checks to see if the segmented array
also must grow. It's a bit annoying to double up those checks as
they're in a very hot path as they're done everytime someone inserts
into the table.

2. A bitmapset that tracks unused elements in 1, making it easy to
find the lowest-index hole when looking for a place to put a new one
by linear search for a 1 bit, so that we tend towards maximum density
despite having random frees from time to time (seems good, the same
idea is used in kernels to allocate the lowest unused file descriptor
number).

I didn't use Bitmapsets. I wanted the bitmaps to be allocated in the
same chunk of memory as the segments of the array. Also, because
bitmapset's nwords is variable, then they can't really do any loop
unrolling. Since in my implementation the number of bitmap words are
known at compile-time, the compiler has the flexibility to do loop
unrolling. The bitmap manipulation is one of the biggest overheads in
generichash.h. I'd prefer to keep that as fast as possible.

3. A hash table that has as elements indexes into 1. It somehow hides
the difference between keys (what callers look things up with) and
keys reachable by following an index into 1 (where elements' keys
live).

I think that can be done, but it would require looking up the
segmented array twice instead of once. The first time would be when
we compare the keys after seeing the hash values match. The final time
would be in the calling code to translate the index to the pointer.
Hopefully the compiler would be able to optimize that to a single
lookup.

One thought is that you could do 1 as a separate component as the
"primary" data structure, and use a plain old simplehash for 3 as a
kind of index into it, but use pointers (rather than indexes) to
objects in 1 as elements. I don't know if it's better.

Using pointers would double the bucket width on a 64 bit machine. I
don't want to do that. Also, to be able to determine the segment from
the pointer it would require looping over each segment to check if the
pointer belongs there. With the index we can determine the segment
directly with bit-shifting the index.

So, with all that. I really don't think it's a great idea to try and
have this use simplehash.h code. I plan to pursue the idea I proposed
with having seperate hash table code that is coded properly to have
stable pointers into the data rather than trying to contort
simplehash's code into working that way.

David

#30

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: David Rowley (#29)

Re: Use simplehash.h instead of dynahash in SMgr

On Wed, Jun 30, 2021 at 11:14 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 23 Jun 2021 at 12:17, Thomas Munro <thomas.munro@gmail.com> wrote:
Thanks for taking an interest in this. I started looking at your idea
and I've now changed my mind from just not liking it to thinking that
the whole idea is just completely horrible :-(

Hah.

I accept that trying to make a thing that "wraps" these data
structures and provides a simple interface is probably really quite
horrible with preprocessor voodoo.

I was mainly questioning how bad it would be if we had a generic
segmented array component (seems like a great idea, which I'm sure
would find other uses, I recall wanting to write that myself before),
and then combined that with the presence map idea to make a dense
object pool (ditto), but then, in each place where we need something
like this, just used a plain old hash table to point directly to
objects in it whenever we needed that, open coding the logic to keep
it in sync (I mean, just the way that people usually use hash tables).
That way, the object pool can give you very fast scans over all
objects in cache friendly order (no linked lists), and the hash table
doesn't know/care about its existence. In other words, small reusable
components that each do one thing well and are not coupled together.

I think I understand now that you really, really want small index
numbers and not 64 bit pointers in the hash table. Hmm.

It gets really messy with all the nested pre-processor stuff around
fetching the element from the segmented array inside simplehash. One
problem is that simplehash needs the address of the segments despite
simplehash not knowing anything about segments. I've tried to make
that work by passing in the generic hash struct as simplehash's
private_data. This ends up with deeply nested macros all defined in
different files. I pitty the future person debugging that.

Yeah, that sounds terrible.

There are also a bunch of changes / API breakages that need to be done
to make this work with simplehash.h.

1) Since I really need 8-byte buckets in the hash table to make this
as fast as possible, I want to use the array index for the hash status
and that means changing the simplehash API to allow that to work.
This requires something like SH_IS_BUCKET_INUSE, SH_SET_BUCKET_INUSE,
SH_SET_BUCKET_EMPTY.

+1 for doing customisable "is in use" checks on day anyway, as a
separate project. Not sure if any current users could shrink their
structs in practice because, at a glance, the same amount of space
might be used by padding anyway, but when a case like that shows up...

2. A bitmapset that tracks unused elements in 1, making it easy to
find the lowest-index hole when looking for a place to put a new one
by linear search for a 1 bit, so that we tend towards maximum density
despite having random frees from time to time (seems good, the same
idea is used in kernels to allocate the lowest unused file descriptor
number).

I didn't use Bitmapsets. I wanted the bitmaps to be allocated in the
same chunk of memory as the segments of the array. Also, because
bitmapset's nwords is variable, then they can't really do any loop
unrolling. Since in my implementation the number of bitmap words are
known at compile-time, the compiler has the flexibility to do loop
unrolling. The bitmap manipulation is one of the biggest overheads in
generichash.h. I'd prefer to keep that as fast as possible.

I think my hands autocompleted "bitmapset", I really meant to write
just "bitmap" as I didn't think you were using the actual thing called
bitmapset, but point taken.

So, with all that. I really don't think it's a great idea to try and
have this use simplehash.h code. I plan to pursue the idea I proposed
with having seperate hash table code that is coded properly to have
stable pointers into the data rather than trying to contort
simplehash's code into working that way.

Fair enough.

It's not that I don't believe it's a good idea to be able to perform
cache-friendly iteration over densely packed objects... that part
sounds great... it's just that it's not obvious to me that it should
be a *hashtable's* job to provide that access path. Perhaps I lack
imagination and we'll have to agree to differ.

#31

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Thomas Munro (#30)

Re: Use simplehash.h instead of dynahash in SMgr

On Thu, 1 Jul 2021 at 13:00, Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Jun 30, 2021 at 11:14 PM David Rowley <dgrowleyml@gmail.com> wrote:

1) Since I really need 8-byte buckets in the hash table to make this
as fast as possible, I want to use the array index for the hash status
and that means changing the simplehash API to allow that to work.
This requires something like SH_IS_BUCKET_INUSE, SH_SET_BUCKET_INUSE,
SH_SET_BUCKET_EMPTY.

+1 for doing customisable "is in use" checks on day anyway, as a
separate project. Not sure if any current users could shrink their
structs in practice because, at a glance, the same amount of space
might be used by padding anyway, but when a case like that shows up...

Yeah, I did look at that when messing with simplehash when working on
Result Cache a few months ago. I found all current usages have at
least a free byte, so I wasn't motivated to allow custom statuses to
be defined.

There's probably a small tidy up to do in simplehash maybe along with
that patch. If you look at SH_GROW, for example, you'll see various
formations of:

if (oldentry->status != SH_STATUS_IN_USE)
if (oldentry->status == SH_STATUS_IN_USE)
if (newentry->status == SH_STATUS_EMPTY)

I'm not all that sure why there's a need to distinguish !=
SH_STATUS_IN_USE from == SH_STATUS_EMPTY. I can only imagine that
Andres was messing around with tombstoning and at one point had a 3rd
status in a development version. There are some minor inefficiencies
as a result of this, e.g in SH_DELETE, the code does:

if (entry->status == SH_STATUS_EMPTY)
return false;

if (entry->status == SH_STATUS_IN_USE &&
SH_COMPARE_KEYS(tb, hash, key, entry))

That SH_STATUS_IN_USE check is always true.

David

#32

Jaime Casanova

jcasanov@systemguards.com.ec

over 4 years ago

In reply to: David Rowley (#21)

Re: Use simplehash.h instead of dynahash in SMgr

On Tue, Jun 22, 2021 at 02:15:26AM +1200, David Rowley wrote:
[...]

I've come up with a new hash table implementation that I've called
generichash. It works similarly to simplehash in regards to the

Hi David,

Are you planning to work on this in this CF?
This is marked as "Ready for committer" but it doesn't apply anymore.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#33

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Jaime Casanova (#32)

1 attachment(s)

Re: Use simplehash.h instead of dynahash in SMgr

On Fri, 24 Sept 2021 at 20:26, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:

Are you planning to work on this in this CF?
This is marked as "Ready for committer" but it doesn't apply anymore.

I've attached an updated patch. Since this patch is pretty different
from the one that was marked as ready for committer, I'll move this to
needs review.

However, I'm a bit disinclined to go ahead with this patch at all.
Thomas made it quite clear it's not for the patch, and on discussing
the patch with Andres, it turned out he does not like the idea either.
Andres' argument was along the lines of bitmaps being slow. The hash
table uses bitmaps to record which items in each segment are in use. I
don't really agree with him about that, so we'd likely need some more
comments to help reach a consensus about if we want this or not.

Maybe Andres has more comments, so I've included him here.

David

Attachments:

v2-0001-Use-densehash.h-hashtables-in-SMgr.patchapplication/octet-stream; name=v2-0001-Use-densehash.h-hashtables-in-SMgr.patchDownload

From 50268dcf6484af095fb7485758de1b44e9375a51 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 3 Aug 2021 16:10:29 +1200
Subject: [PATCH v2] Use densehash.h hashtables in SMgr

The hash table lookups done in SMgr can quite often be a bottleneck during
crash recovery.  Traditionally these use dynahash. Here we swap dynahash
out and use densehash instead.  This improves lookup performance.
---
 src/backend/storage/smgr/smgr.c |   82 +-
 src/include/lib/densehash.h     | 1436 +++++++++++++++++++++++++++++++
 2 files changed, 1496 insertions(+), 22 deletions(-)
 create mode 100644 src/include/lib/densehash.h

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..3fa9c21c4b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
 
 #include "access/xlogutils.h"
+#include "common/hashfn.h"
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -25,6 +26,25 @@
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
+#include "utils/memutils.h"
+
+static inline uint32 relfilenodebackend_hash(RelFileNodeBackend *rnode);
+
+#define DH_PREFIX				smgrtable
+#define DH_ELEMENT_TYPE			SMgrRelationData
+#define DH_KEY_TYPE				RelFileNodeBackend
+#define DH_KEY					smgr_rnode
+#define DH_HASH_KEY(tb, key)	relfilenodebackend_hash(&key)
+#define DH_EQUAL(tb, a, b)		(memcmp(&a, &b, sizeof(RelFileNodeBackend)) == 0)
+#define DH_SCOPE				static inline
+#define DH_STORE_HASH
+#define DH_GET_HASH(tb, a)		a->hash
+#define DH_ALLOCATE(b)			MemoryContextAlloc(TopMemoryContext, (b))
+#define DH_ALLOCATE_ZERO(b)		MemoryContextAllocZero(TopMemoryContext, (b))
+#define DH_FREE(p)				pfree(p)
+#define DH_DEFINE
+#define DH_DECLARE
+#include "lib/densehash.h"
 
 
 /*
@@ -91,13 +111,43 @@ static const int NSmgr = lengthof(smgrsw);
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
  */
-static HTAB *SMgrRelationHash = NULL;
+static smgrtable_hash *SMgrRelationHash = NULL;
 
 static dlist_head unowned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+/*
+ * relfilenodebackend_hash
+ *		Custom rolled hash function for simplehash table.
+ *
+ * smgropen() is often a bottleneck in CPU bound workloads during crash
+ * recovery.  We make use of this custom hash function rather than using
+ * hash_bytes as it gives us a little bit more performance.
+ *
+ * XXX What if sizeof(Oid) is not 4?
+ */
+static inline uint32
+relfilenodebackend_hash(RelFileNodeBackend *rnode)
+{
+	uint32		hashkey;
+
+	hashkey = murmurhash32((uint32) rnode->node.spcNode);
+
+	/* rotate hashkey left 1 bit at each step */
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
+
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->node.relNode);
+
+	hashkey = pg_rotate_right32(hashkey, 31);
+	hashkey ^= murmurhash32((uint32) rnode->backend);
+
+	return hashkey;
+}
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -149,29 +199,22 @@ smgropen(RelFileNode rnode, BackendId backend)
 	SMgrRelation reln;
 	bool		found;
 
-	if (SMgrRelationHash == NULL)
+	if (unlikely(SMgrRelationHash == NULL))
 	{
 		/* First time through: initialize the hash table */
-		HASHCTL		ctl;
-
-		ctl.keysize = sizeof(RelFileNodeBackend);
-		ctl.entrysize = sizeof(SMgrRelationData);
-		SMgrRelationHash = hash_create("smgr relation table", 400,
-									   &ctl, HASH_ELEM | HASH_BLOBS);
+		SMgrRelationHash = smgrtable_create(400);
 		dlist_init(&unowned_relns);
 	}
 
 	/* Look up or create an entry */
 	brnode.node = rnode;
 	brnode.backend = backend;
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &brnode,
-									  HASH_ENTER, &found);
+	reln = smgrtable_insert(SMgrRelationHash, brnode, &found);
 
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		/* hash_search already filled in the lookup key */
+		/* smgrtable_insert already filled in the lookup key */
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
@@ -266,9 +309,7 @@ smgrclose(SMgrRelation reln)
 	if (!owner)
 		dlist_delete(&reln->node);
 
-	if (hash_search(SMgrRelationHash,
-					(void *) &(reln->smgr_rnode),
-					HASH_REMOVE, NULL) == NULL)
+	if (!smgrtable_delete(SMgrRelationHash, reln->smgr_rnode))
 		elog(ERROR, "SMgrRelation hashtable corrupted");
 
 	/*
@@ -285,16 +326,16 @@ smgrclose(SMgrRelation reln)
 void
 smgrcloseall(void)
 {
-	HASH_SEQ_STATUS status;
+	smgrtable_iterator iterator;
 	SMgrRelation reln;
 
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	smgrtable_start_iterate(SMgrRelationHash, &iterator);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
+	while ((reln = smgrtable_iterate(SMgrRelationHash, &iterator)) != NULL)
 		smgrclose(reln);
 }
 
@@ -314,10 +355,7 @@ smgrclosenode(RelFileNodeBackend rnode)
 	/* Nothing to do if hashtable not set up */
 	if (SMgrRelationHash == NULL)
 		return;
-
-	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &rnode,
-									  HASH_FIND, NULL);
+	reln = smgrtable_lookup(SMgrRelationHash, rnode);
 	if (reln != NULL)
 		smgrclose(reln);
 }
diff --git a/src/include/lib/densehash.h b/src/include/lib/densehash.h
new file mode 100644
index 0000000000..26fab94479
--- /dev/null
+++ b/src/include/lib/densehash.h
@@ -0,0 +1,1436 @@
+/*
+ * densehash.h
+ *
+ *	  A hashtable implementation which can be included into .c files to
+ *	  provide a fast hash table implementation specific to the given type.
+ *
+ *	  DH_ELEMENT_TYPE defines the data type that the hashtable stores.  These
+ *	  are allocated DH_ITEMS_PER_SEGMENT at a time and stored inside a
+ *	  DH_SEGMENT.  Each DH_SEGMENT is allocated on demand only when there are
+ *	  no free slots to store another DH_ELEMENT_TYPE in an existing segment.
+ *	  After items are removed from the hash table, the next inserted item's
+ *	  data will be stored in the earliest free item in the earliest segment
+ *	  with a free slot.  This helps keep the actual data compact, or "dense"
+ *	  even when the bucket array has become large.
+ *
+ *	  The bucket array is an array of DH_BUCKET and is dynamically allocated
+ *	  and may grow as more items are added to the table.  The DH_BUCKET type
+ *	  is very narrow and stores just 2 uint32 values.  One of these is the
+ *	  hash value and the other is the index into the segments which are used
+ *	  to directly look up the stored DH_ELEMENT_TYPE type.
+ *
+ *	  During inserts, hash table collisions are dealt with using linear
+ *	  probing, this means that instead of doing something like chaining with a
+ *	  linked list, we use the first free bucket which comes after the optimal
+ *	  bucket.  This is much more CPU cache efficient than traversing a linked
+ *	  list.  When we're unable to use the most optimal bucket, we may also
+ *	  move the contents of subsequent buckets around so that we keep items as
+ *	  close to their most optimal position as possible.  This prevents
+ *	  excessively long linear probes during lookups.
+ *
+ *	  During hash table deletes, we must attempt to move the contents of
+ *	  buckets that are not in their optimal position up to either their
+ *	  optimal position, or as close as we can get to it.  During lookups, this
+ *	  means that we can stop searching for a non-existing item as soon as we
+ *	  find an empty bucket.
+ *
+ *	  Empty buckets are denoted by their 'index' field being set to
+ *	  DH_UNUSED_BUCKET_INDEX.  This is done rather than adding a special field
+ *	  so that we can keep the DH_BUCKET type as narrow as possible.
+ *	  Conveniently sizeof(DH_BUCKET) is 8, which allows 8 of these to fit on a
+ *	  single 64-byte cache line. It's important to keep this type as narrow as
+ *	  possible so that we can perform hash lookups by hitting as few
+ *	  cache lines as possible.
+ *
+ *	  The implementation here is similar to simplehash.h but has the following
+ *	  benefits:
+ *
+ *	  - Pointers to elements are stable and are not moved around like they are
+ *		in simplehash.h
+ *	  - Sequential scans of the hash table remain very fast even when the
+ *		table is sparsely populated.
+ *	  - Both simplehash.h and densehash.h may move items around during inserts
+ *		and deletes.  If DH_ELEMENT_TYPE is large, since simplehash.h stores
+ *		the data in the hash bucket, these operations may become expensive in
+ *		simplehash.h.  In densehash.h these remain fairly cheap as the bucket
+ *		is always 8 bytes wide due to the hash entry being stored in the
+ *		DH_SEGMENT.
+ *
+ * If none of the above points are important for the given use case then,
+ * please consider using simplehash.h instead.
+ *
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/densehash.h
+ *
+ */
+
+#include "port/pg_bitutils.h"
+
+/* helpers */
+#define DH_MAKE_PREFIX(a) CppConcat(a,_)
+#define DH_MAKE_NAME(name) DH_MAKE_NAME_(DH_MAKE_PREFIX(DH_PREFIX),name)
+#define DH_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* type declarations */
+#define DH_TYPE DH_MAKE_NAME(hash)
+#define DH_BUCKET DH_MAKE_NAME(bucket)
+#define DH_SEGMENT DH_MAKE_NAME(segment)
+#define DH_ITERATOR DH_MAKE_NAME(iterator)
+
+/* function declarations */
+#define DH_CREATE DH_MAKE_NAME(create)
+#define DH_DESTROY DH_MAKE_NAME(destroy)
+#define DH_RESET DH_MAKE_NAME(reset)
+#define DH_INSERT DH_MAKE_NAME(insert)
+#define DH_INSERT_HASH DH_MAKE_NAME(insert_hash)
+#define DH_DELETE DH_MAKE_NAME(delete)
+#define DH_LOOKUP DH_MAKE_NAME(lookup)
+#define DH_LOOKUP_HASH DH_MAKE_NAME(lookup_hash)
+#define DH_GROW DH_MAKE_NAME(grow)
+#define DH_START_ITERATE DH_MAKE_NAME(start_iterate)
+#define DH_ITERATE DH_MAKE_NAME(iterate)
+
+/* internal helper functions (no externally visible prototypes) */
+#define DH_NEXT_ONEBIT DH_MAKE_NAME(next_onebit)
+#define DH_NEXT_ZEROBIT DH_MAKE_NAME(next_zerobit)
+#define DH_INDEX_TO_ELEMENT DH_MAKE_NAME(index_to_element)
+#define DH_MARK_SEGMENT_ITEM_USED DH_MAKE_NAME(mark_segment_item_used)
+#define DH_MARK_SEGMENT_ITEM_UNUSED DH_MAKE_NAME(mark_segment_item_unused)
+#define DH_GET_NEXT_UNUSED_ENTRY DH_MAKE_NAME(get_next_unused_entry)
+#define DH_REMOVE_ENTRY DH_MAKE_NAME(remove_entry)
+#define DH_SET_BUCKET_IN_USE DH_MAKE_NAME(set_bucket_in_use)
+#define DH_SET_BUCKET_EMPTY DH_MAKE_NAME(set_bucket_empty)
+#define DH_IS_BUCKET_IN_USE DH_MAKE_NAME(is_bucket_in_use)
+#define DH_COMPUTE_PARAMETERS DH_MAKE_NAME(compute_parameters)
+#define DH_NEXT DH_MAKE_NAME(next)
+#define DH_PREV DH_MAKE_NAME(prev)
+#define DH_DISTANCE_FROM_OPTIMAL DH_MAKE_NAME(distance)
+#define DH_INITIAL_BUCKET DH_MAKE_NAME(initial_bucket)
+#define DH_INSERT_HASH_INTERNAL DH_MAKE_NAME(insert_hash_internal)
+#define DH_LOOKUP_HASH_INTERNAL DH_MAKE_NAME(lookup_hash_internal)
+
+/*
+ * When allocating memory to store instances of DH_ELEMENT_TYPE, how many
+ * should we allocate at once?  This must be a power of 2 and at least
+ * DH_BITS_PER_WORD.
+ */
+#ifndef DH_ITEMS_PER_SEGMENT
+#define DH_ITEMS_PER_SEGMENT	256
+#endif
+
+/* A special index to set DH_BUCKET->index to when it's not in use */
+#define DH_UNUSED_BUCKET_INDEX	PG_UINT32_MAX
+
+/*
+ * Macros for translating a bucket's index into the segment index and another
+ * to determine the item number within the segment.
+ */
+#define DH_INDEX_SEGMENT(i)	(i) / DH_ITEMS_PER_SEGMENT
+#define DH_INDEX_ITEM(i)	(i) % DH_ITEMS_PER_SEGMENT
+
+ /*
+  * How many elements do we need in the bitmap array to store a bit for each
+  * of DH_ITEMS_PER_SEGMENT.  Keep the word size native to the processor.
+  */
+#if SIZEOF_VOID_P >= 8
+
+#define DH_BITS_PER_WORD		64
+#define DH_BITMAP_WORD			uint64
+#define DH_RIGHTMOST_ONE_POS(x) pg_rightmost_one_pos64(x)
+
+#else
+
+#define DH_BITS_PER_WORD		32
+#define DH_BITMAP_WORD			uint32
+#define DH_RIGHTMOST_ONE_POS(x) pg_rightmost_one_pos32(x)
+
+#endif
+
+/* Sanity check on DH_ITEMS_PER_SEGMENT setting */
+#if DH_ITEMS_PER_SEGMENT < DH_BITS_PER_WORD
+#error "DH_ITEMS_PER_SEGMENT must be >= than DH_BITS_PER_WORD"
+#endif
+
+/* Ensure DH_ITEMS_PER_SEGMENT is a power of 2 */
+#if DH_ITEMS_PER_SEGMENT & (DH_ITEMS_PER_SEGMENT - 1) != 0
+#error "DH_ITEMS_PER_SEGMENT must be a power of 2"
+#endif
+
+#define DH_BITMAP_WORDS			(DH_ITEMS_PER_SEGMENT / DH_BITS_PER_WORD)
+#define DH_WORDNUM(x)			((x) / DH_BITS_PER_WORD)
+#define DH_BITNUM(x)			((x) % DH_BITS_PER_WORD)
+
+/* generate forward declarations necessary to use the hash table */
+#ifdef DH_DECLARE
+
+typedef struct DH_BUCKET
+{
+	uint32		hashvalue;		/* Hash value for this bucket */
+	uint32		index;			/* Index to the actual data */
+}			DH_BUCKET;
+
+typedef struct DH_SEGMENT
+{
+	uint32		nitems;			/* Number of items stored */
+	DH_BITMAP_WORD used_items[DH_BITMAP_WORDS]; /* A 1-bit for each used item
+												 * in the items array */
+	DH_ELEMENT_TYPE items[DH_ITEMS_PER_SEGMENT];	/* the actual data */
+}			DH_SEGMENT;
+
+/* type definitions */
+
+/*
+ * DH_TYPE
+ *		Hash table metadata type
+ */
+typedef struct DH_TYPE
+{
+	/*
+	 * Size of bucket array.  Note that the maximum number of elements is
+	 * lower (DH_MAX_FILLFACTOR)
+	 */
+	uint32		size;
+
+	/* mask for bucket and size calculations, based on size */
+	uint32		sizemask;
+
+	/* the number of elements stored */
+	uint32		members;
+
+	/* boundary after which to grow hashtable */
+	uint32		grow_threshold;
+
+	/* how many elements are there in the segments array */
+	uint32		nsegments;
+
+	/* the number of elements in the used_segments array */
+	uint32		used_segment_words;
+
+	/*
+	 * The first segment we should search in for an empty slot.  This will be
+	 * the first segment that DH_GET_NEXT_UNUSED_ENTRY will search in when
+	 * looking for an unused entry.  We'll increase the value of this when we
+	 * fill a segment and we'll lower it down when we delete an item from a
+	 * segment lower than this value.
+	 */
+	uint32		first_free_segment;
+
+	/* dynamically allocated array of hash buckets */
+	DH_BUCKET  *buckets;
+
+	/* an array of segment pointers to store data */
+	DH_SEGMENT **segments;
+
+	/*
+	 * A bitmap of non-empty segments.  A 1-bit denotes that the corresponding
+	 * segment is non-empty.
+	 */
+	DH_BITMAP_WORD *used_segments;
+
+#ifdef DH_HAVE_PRIVATE_DATA
+	/* user defined data, useful for callbacks */
+	void	   *private_data;
+#endif
+}			DH_TYPE;
+
+/*
+ * DH_ITERATOR
+ *		Used when looping over the contents of the hash table.
+ */
+typedef struct DH_ITERATOR
+{
+	int32		cursegidx;		/* current segment. -1 means not started */
+	int32		curitemidx;		/* current item within cursegidx, -1 means not
+								 * started */
+	uint32		found_members;	/* number of items visitied so far in the loop */
+	uint32		total_members;	/* number of items that existed at the start
+								 * iteration. */
+}			DH_ITERATOR;
+
+/* externally visible function prototypes */
+
+#ifdef DH_HAVE_PRIVATE_DATA
+/* <prefix>_hash <prefix>_create(uint32 nbuckets, void *private_data) */
+DH_SCOPE	DH_TYPE *DH_CREATE(uint32 nbuckets, void *private_data);
+#else
+/* <prefix>_hash <prefix>_create(uint32 nbuckets) */
+DH_SCOPE	DH_TYPE *DH_CREATE(uint32 nbuckets);
+#endif
+
+/* void <prefix>_destroy(<prefix>_hash *tb) */
+DH_SCOPE void DH_DESTROY(DH_TYPE * tb);
+
+/* void <prefix>_reset(<prefix>_hash *tb) */
+DH_SCOPE void DH_RESET(DH_TYPE * tb);
+
+/* void <prefix>_grow(<prefix>_hash *tb) */
+DH_SCOPE void DH_GROW(DH_TYPE * tb, uint32 newsize);
+
+/* <element> *<prefix>_insert(<prefix>_hash *tb, <key> key, bool *found) */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_INSERT(DH_TYPE * tb, DH_KEY_TYPE key,
+									   bool *found);
+
+/*
+ * <element> *<prefix>_insert_hash(<prefix>_hash *tb, <key> key, uint32 hash,
+ * 								   bool *found)
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_INSERT_HASH(DH_TYPE * tb, DH_KEY_TYPE key,
+											uint32 hash, bool *found);
+
+/* <element> *<prefix>_lookup(<prefix>_hash *tb, <key> key) */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_LOOKUP(DH_TYPE * tb, DH_KEY_TYPE key);
+
+/* <element> *<prefix>_lookup_hash(<prefix>_hash *tb, <key> key, uint32 hash) */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_LOOKUP_HASH(DH_TYPE * tb, DH_KEY_TYPE key,
+											uint32 hash);
+
+/* bool <prefix>_delete(<prefix>_hash *tb, <key> key) */
+DH_SCOPE bool DH_DELETE(DH_TYPE * tb, DH_KEY_TYPE key);
+
+/* void <prefix>_start_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
+DH_SCOPE void DH_START_ITERATE(DH_TYPE * tb, DH_ITERATOR * iter);
+
+/* <element> *<prefix>_iterate(<prefix>_hash *tb, <prefix>_iterator *iter) */
+DH_SCOPE	DH_ELEMENT_TYPE *DH_ITERATE(DH_TYPE * tb, DH_ITERATOR * iter);
+
+#endif							/* DH_DECLARE */
+
+/* generate implementation of the hash table */
+#ifdef DH_DEFINE
+
+/*
+ * The maximum size for the hash table.  This must be a power of 2.  We cannot
+ * make this PG_UINT32_MAX + 1 because we use DH_UNUSED_BUCKET_INDEX denote an
+ * empty bucket.  Doing so would mean we could accidentally set a used
+ * bucket's index to DH_UNUSED_BUCKET_INDEX.
+ */
+#define DH_MAX_SIZE ((uint32) PG_INT32_MAX + 1)
+
+/* normal fillfactor, unless already close to maximum */
+#ifndef DH_FILLFACTOR
+#define DH_FILLFACTOR (0.9)
+#endif
+/* increase fillfactor if we otherwise would error out */
+#define DH_MAX_FILLFACTOR (0.98)
+/* grow if actual and optimal location bigger than */
+#ifndef DH_GROW_MAX_DIB
+#define DH_GROW_MAX_DIB 25
+#endif
+/*
+ * Grow if more than this number of buckets needs to be moved when inserting.
+ */
+#ifndef DH_GROW_MAX_MOVE
+#define DH_GROW_MAX_MOVE 150
+#endif
+#ifndef DH_GROW_MIN_FILLFACTOR
+/* but do not grow due to DH_GROW_MAX_* if below */
+#define DH_GROW_MIN_FILLFACTOR 0.1
+#endif
+
+/*
+ * Wrap the following definitions in include guards, to avoid multiple
+ * definition errors if this header is included more than once.  The rest of
+ * the file deliberately has no include guards, because it can be included
+ * with different parameters to define functions and types with non-colliding
+ * names.
+ */
+#ifndef DENSEHASH_H
+#define DENSEHASH_H
+
+#ifdef FRONTEND
+#define gh_error(...) pg_log_error(__VA_ARGS__)
+#define gh_log(...) pg_log_info(__VA_ARGS__)
+#else
+#define gh_error(...) elog(ERROR, __VA_ARGS__)
+#define gh_log(...) elog(LOG, __VA_ARGS__)
+#endif
+
+#endif							/* DENSEHASH_H */
+
+/*
+ * Gets the position of the first 1-bit which comes after 'prevbit' in the
+ * 'words' array.  'nwords' is the size of the 'words' array.
+ */
+static inline int32
+DH_NEXT_ONEBIT(DH_BITMAP_WORD * words, uint32 nwords, int32 prevbit)
+{
+	uint32		wordnum;
+
+	prevbit++;
+
+	wordnum = DH_WORDNUM(prevbit);
+	if (wordnum < nwords)
+	{
+		DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
+		DH_BITMAP_WORD word = words[wordnum] & mask;
+
+		if (word != 0)
+			return wordnum * DH_BITS_PER_WORD + DH_RIGHTMOST_ONE_POS(word);
+
+		for (++wordnum; wordnum < nwords; wordnum++)
+		{
+			word = words[wordnum];
+
+			if (word != 0)
+			{
+				int32		result = wordnum * DH_BITS_PER_WORD;
+
+				result += DH_RIGHTMOST_ONE_POS(word);
+				return result;
+			}
+		}
+	}
+	return -1;
+}
+
+/*
+ * Gets the position of the first 0-bit which comes after 'prevbit' in the
+ * 'words' array.  'nwords' is the size of the 'words' array.
+ *
+ * This is similar to DH_NEXT_ONEBIT but flips the bits before operating on
+ * each DH_BITMAP_WORD.
+ */
+static inline int32
+DH_NEXT_ZEROBIT(DH_BITMAP_WORD * words, uint32 nwords, int32 prevbit)
+{
+	uint32		wordnum;
+
+	prevbit++;
+
+	wordnum = DH_WORDNUM(prevbit);
+	if (wordnum < nwords)
+	{
+		DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
+		DH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */
+
+		if (word != 0)
+			return wordnum * DH_BITS_PER_WORD + DH_RIGHTMOST_ONE_POS(word);
+
+		for (++wordnum; wordnum < nwords; wordnum++)
+		{
+			word = ~words[wordnum]; /* flip bits */
+
+			if (word != 0)
+			{
+				int32		result = wordnum * DH_BITS_PER_WORD;
+
+				result += DH_RIGHTMOST_ONE_POS(word);
+				return result;
+			}
+		}
+	}
+	return -1;
+}
+
+/*
+ * Finds the hash table entry for a given DH_BUCKET's 'index'.
+ */
+static inline DH_ELEMENT_TYPE *
+DH_INDEX_TO_ELEMENT(DH_TYPE * tb, uint32 index)
+{
+	DH_SEGMENT *seg;
+	uint32		segidx;
+	uint32		item;
+
+	segidx = DH_INDEX_SEGMENT(index);
+	item = DH_INDEX_ITEM(index);
+
+	Assert(segidx < tb->nsegments);
+
+	seg = tb->segments[segidx];
+
+	Assert(seg != NULL);
+
+	/* ensure this segment is marked as used */
+	Assert(seg->used_items[DH_WORDNUM(item)] & (((DH_BITMAP_WORD) 1) << DH_BITNUM(item)));
+
+	return &seg->items[item];
+}
+
+static inline void
+DH_MARK_SEGMENT_ITEM_USED(DH_TYPE * tb, DH_SEGMENT * seg, uint32 segidx,
+						  uint32 segitem)
+{
+	uint32		word = DH_WORDNUM(segitem);
+	uint32		bit = DH_BITNUM(segitem);
+
+	/* ensure this item is not marked as used */
+	Assert((seg->used_items[word] & (((DH_BITMAP_WORD) 1) << bit)) == 0);
+
+	/* switch on the used bit */
+	seg->used_items[word] |= (((DH_BITMAP_WORD) 1) << bit);
+
+	/* if the segment was previously empty then mark it as used */
+	if (seg->nitems == 0)
+	{
+		word = DH_WORDNUM(segidx);
+		bit = DH_BITNUM(segidx);
+
+		/* switch on the used bit for this segment */
+		tb->used_segments[word] |= (((DH_BITMAP_WORD) 1) << bit);
+	}
+	seg->nitems++;
+}
+
+static inline void
+DH_MARK_SEGMENT_ITEM_UNUSED(DH_TYPE * tb, DH_SEGMENT * seg, uint32 segidx,
+							uint32 segitem)
+{
+	uint32		word = DH_WORDNUM(segitem);
+	uint32		bit = DH_BITNUM(segitem);
+
+	/* ensure this item is marked as used */
+	Assert((seg->used_items[word] & (((DH_BITMAP_WORD) 1) << bit)) != 0);
+
+	/* switch off the used bit */
+	seg->used_items[word] &= ~(((DH_BITMAP_WORD) 1) << bit);
+
+	/* when removing the last item mark the segment as unused */
+	if (seg->nitems == 1)
+	{
+		word = DH_WORDNUM(segidx);
+		bit = DH_BITNUM(segidx);
+
+		/* switch off the used bit for this segment */
+		tb->used_segments[word] &= ~(((DH_BITMAP_WORD) 1) << bit);
+	}
+
+	seg->nitems--;
+}
+
+/*
+ * Returns the first unused entry from the first non-full segment and set
+ * *index to the index of the returned entry.
+ */
+static inline DH_ELEMENT_TYPE *
+DH_GET_NEXT_UNUSED_ENTRY(DH_TYPE * tb, uint32 *index)
+{
+	DH_SEGMENT *seg;
+	uint32		segidx = tb->first_free_segment;
+	uint32		itemidx;
+
+	seg = tb->segments[segidx];
+
+	/* find the first segment with an unused item */
+	while (seg != NULL && seg->nitems == DH_ITEMS_PER_SEGMENT)
+		seg = tb->segments[++segidx];
+
+	tb->first_free_segment = segidx;
+
+	/* allocate the segment if it's not already */
+	if (seg == NULL)
+	{
+		seg = DH_ALLOCATE(sizeof(DH_SEGMENT));
+		tb->segments[segidx] = seg;
+
+		seg->nitems = 0;
+		memset(seg->used_items, 0, sizeof(seg->used_items));
+		/* no need to zero the items array */
+
+		/* use the first slot in this segment */
+		itemidx = 0;
+	}
+	else
+	{
+		/* find the first unused item in this segment */
+		itemidx = DH_NEXT_ZEROBIT(seg->used_items, DH_BITMAP_WORDS, -1);
+		Assert(itemidx >= 0);
+	}
+
+	/* this is a good spot to ensure nitems matches the bits in used_items */
+	Assert(seg->nitems == pg_popcount((const char *) seg->used_items, DH_ITEMS_PER_SEGMENT / 8));
+
+	DH_MARK_SEGMENT_ITEM_USED(tb, seg, segidx, itemidx);
+
+	*index = segidx * DH_ITEMS_PER_SEGMENT + itemidx;
+	return &seg->items[itemidx];
+
+}
+
+/*
+ * Remove the entry denoted by 'index' from its segment.
+ */
+static inline void
+DH_REMOVE_ENTRY(DH_TYPE * tb, uint32 index)
+{
+	DH_SEGMENT *seg;
+	uint32		segidx = DH_INDEX_SEGMENT(index);
+	uint32		item = DH_INDEX_ITEM(index);
+
+	Assert(segidx < tb->nsegments);
+	seg = tb->segments[segidx];
+	Assert(seg != NULL);
+
+	DH_MARK_SEGMENT_ITEM_UNUSED(tb, seg, segidx, item);
+
+	/*
+	 * Lower the first free segment index to point to this segment so that the
+	 * next insert will store in this segment.  If it's already set to a lower
+	 * segment number then don't adjust as we want to consume slots from the
+	 * earliest segment first.
+	 */
+	if (tb->first_free_segment > segidx)
+		tb->first_free_segment = segidx;
+}
+
+/*
+ * Set 'bucket' as in use by 'index'.
+ */
+static inline void
+DH_SET_BUCKET_IN_USE(DH_BUCKET * bucket, uint32 index)
+{
+	bucket->index = index;
+}
+
+/*
+ * Mark 'bucket' as unused.
+ */
+static inline void
+DH_SET_BUCKET_EMPTY(DH_BUCKET * bucket)
+{
+	bucket->index = DH_UNUSED_BUCKET_INDEX;
+}
+
+/*
+ * Return true if 'bucket' is in use.
+ */
+static inline bool
+DH_IS_BUCKET_IN_USE(DH_BUCKET * bucket)
+{
+	return bucket->index != DH_UNUSED_BUCKET_INDEX;
+}
+
+ /*
+  * Compute sizing parameters for hashtable.  Called when creating and growing
+  * the hashtable.
+  */
+static inline void
+DH_COMPUTE_PARAMETERS(DH_TYPE * tb, uint32 newsize)
+{
+	uint32		size;
+
+	/*
+	 * Ensure the bucket array size has not exceeded DH_MAX_SIZE or wrapped
+	 * back to zero.
+	 */
+	if (newsize == 0 || newsize > DH_MAX_SIZE)
+		gh_error("hash table too large");
+
+	/*
+	 * Ensure we don't build a table that can't store an entire single segment
+	 * worth of data.
+	 */
+	size = Max(newsize, DH_ITEMS_PER_SEGMENT);
+
+	/* round up size to the next power of 2 */
+	size = pg_nextpower2_32(size);
+
+	/* now set size */
+	tb->size = size;
+	tb->sizemask = tb->size - 1;
+
+	/* calculate how many segments we'll need to store 'size' items */
+	tb->nsegments = pg_nextpower2_32(size / DH_ITEMS_PER_SEGMENT);
+
+	/*
+	 * Calculate the number of bitmap words needed to store a bit for each
+	 * segment.
+	 */
+	tb->used_segment_words = (tb->nsegments + DH_BITS_PER_WORD - 1) / DH_BITS_PER_WORD;
+
+	/*
+	 * Compute the next threshold at which we need to grow the hash table
+	 * again.
+	 */
+	if (tb->size == DH_MAX_SIZE)
+		tb->grow_threshold = (uint32) (((double) tb->size) * DH_MAX_FILLFACTOR);
+	else
+		tb->grow_threshold = (uint32) (((double) tb->size) * DH_FILLFACTOR);
+}
+
+/* return the optimal bucket for the hash */
+static inline uint32
+DH_INITIAL_BUCKET(DH_TYPE * tb, uint32 hash)
+{
+	return hash & tb->sizemask;
+}
+
+/* return the next bucket after the current, handling wraparound */
+static inline uint32
+DH_NEXT(DH_TYPE * tb, uint32 curelem, uint32 startelem)
+{
+	curelem = (curelem + 1) & tb->sizemask;
+
+	Assert(curelem != startelem);
+
+	return curelem;
+}
+
+/* return the bucket before the current, handling wraparound */
+static inline uint32
+DH_PREV(DH_TYPE * tb, uint32 curelem, uint32 startelem)
+{
+	curelem = (curelem - 1) & tb->sizemask;
+
+	Assert(curelem != startelem);
+
+	return curelem;
+}
+
+/* return the distance between a bucket and its optimal position */
+static inline uint32
+DH_DISTANCE_FROM_OPTIMAL(DH_TYPE * tb, uint32 optimal, uint32 bucket)
+{
+	if (optimal <= bucket)
+		return bucket - optimal;
+	else
+		return (tb->size + bucket) - optimal;
+}
+
+/*
+ * Create a hash table with 'nbuckets' buckets.
+ */
+DH_SCOPE	DH_TYPE *
+#ifdef DH_HAVE_PRIVATE_DATA
+DH_CREATE(uint32 nbuckets, void *private_data)
+#else
+DH_CREATE(uint32 nbuckets)
+#endif
+{
+	DH_TYPE    *tb;
+	uint32		size;
+	uint32		i;
+
+	tb = DH_ALLOCATE_ZERO(sizeof(DH_TYPE));
+
+#ifdef DH_HAVE_PRIVATE_DATA
+	tb->private_data = private_data;
+#endif
+
+	/* increase nelements by fillfactor, want to store nelements elements */
+	size = (uint32) Min((double) DH_MAX_SIZE, ((double) nbuckets) / DH_FILLFACTOR);
+
+	DH_COMPUTE_PARAMETERS(tb, size);
+
+	tb->buckets = DH_ALLOCATE(sizeof(DH_BUCKET) * tb->size);
+
+	/* ensure all the buckets are set to empty */
+	for (i = 0; i < tb->size; i++)
+		DH_SET_BUCKET_EMPTY(&tb->buckets[i]);
+
+	tb->segments = DH_ALLOCATE_ZERO(sizeof(DH_SEGMENT *) * tb->nsegments);
+	tb->used_segments = DH_ALLOCATE_ZERO(sizeof(DH_BITMAP_WORD) * tb->used_segment_words);
+	return tb;
+}
+
+/* destroy a previously created hash table */
+DH_SCOPE void
+DH_DESTROY(DH_TYPE * tb)
+{
+	DH_FREE(tb->buckets);
+
+	/* Free each segment one by one */
+	for (uint32 n = 0; n < tb->nsegments; n++)
+	{
+		if (tb->segments[n] != NULL)
+			DH_FREE(tb->segments[n]);
+	}
+
+	DH_FREE(tb->segments);
+	DH_FREE(tb->used_segments);
+
+	pfree(tb);
+}
+
+/* reset the contents of a previously created hash table */
+DH_SCOPE void
+DH_RESET(DH_TYPE * tb)
+{
+	int32		i = -1;
+	uint32		x;
+
+	/* reset each used segment one by one */
+	while ((i = DH_NEXT_ONEBIT(tb->used_segments, tb->used_segment_words,
+							   i)) >= 0)
+	{
+		DH_SEGMENT *seg = tb->segments[i];
+
+		Assert(seg != NULL);
+
+		seg->nitems = 0;
+		memset(seg->used_items, 0, sizeof(seg->used_items));
+	}
+
+	/* empty every bucket */
+	for (x = 0; x < tb->size; x++)
+		DH_SET_BUCKET_EMPTY(&tb->buckets[x]);
+
+	/* zero the used segment bits */
+	memset(tb->used_segments, 0, sizeof(DH_BITMAP_WORD) * tb->used_segment_words);
+
+	/* and mark the table as having zero members */
+	tb->members = 0;
+
+	/* ensure we start putting any new items in the first segment */
+	tb->first_free_segment = 0;
+}
+
+/*
+ * Grow a hash table to at least 'newsize' buckets.
+ *
+ * Usually this will automatically be called by insertions/deletions, when
+ * necessary. But resizing to the exact input size can be advantageous
+ * performance-wise, when known at some point.
+ */
+DH_SCOPE void
+DH_GROW(DH_TYPE * tb, uint32 newsize)
+{
+	uint32		oldsize = tb->size;
+	uint32		oldnsegments = tb->nsegments;
+	uint32		oldusedsegmentwords = tb->used_segment_words;
+	DH_BUCKET  *oldbuckets = tb->buckets;
+	DH_SEGMENT **oldsegments = tb->segments;
+	DH_BITMAP_WORD *oldusedsegments = tb->used_segments;
+	DH_BUCKET  *newbuckets;
+	uint32		i;
+	uint32		startelem = 0;
+	uint32		copyelem;
+
+	Assert(oldsize == pg_nextpower2_32(oldsize));
+
+	/* compute parameters for new table */
+	DH_COMPUTE_PARAMETERS(tb, newsize);
+
+	tb->buckets = DH_ALLOCATE(sizeof(DH_ELEMENT_TYPE) * tb->size);
+
+	/* Ensure all the buckets are set to empty */
+	for (i = 0; i < tb->size; i++)
+		DH_SET_BUCKET_EMPTY(&tb->buckets[i]);
+
+	newbuckets = tb->buckets;
+
+	/*
+	 * Copy buckets from the old buckets to newbuckets. We theoretically could
+	 * use DH_INSERT here, to avoid code duplication, but that's more general
+	 * than we need. We neither want tb->members increased, nor do we need to
+	 * do deal with deleted elements, nor do we need to compare keys. So a
+	 * special-cased implementation is a lot faster.  Resizing can be time
+	 * consuming and frequent, that's worthwhile to optimize.
+	 *
+	 * To be able to simply move buckets over, we have to start not at the
+	 * first bucket (i.e oldbuckets[0]), but find the first bucket that's
+	 * either empty or is occupied by an entry at its optimal position. Such a
+	 * bucket has to exist in any table with a load factor under 1, as not all
+	 * buckets are occupied, i.e. there always has to be an empty bucket.  By
+	 * starting at such a bucket we can move the entries to the larger table,
+	 * without having to deal with conflicts.
+	 */
+
+	/* search for the first element in the hash that's not wrapped around */
+	for (i = 0; i < oldsize; i++)
+	{
+		DH_BUCKET  *oldbucket = &oldbuckets[i];
+		uint32		hash;
+		uint32		optimal;
+
+		if (!DH_IS_BUCKET_IN_USE(oldbucket))
+		{
+			startelem = i;
+			break;
+		}
+
+		hash = oldbucket->hashvalue;
+		optimal = DH_INITIAL_BUCKET(tb, hash);
+
+		if (optimal == i)
+		{
+			startelem = i;
+			break;
+		}
+	}
+
+	/* and copy all elements in the old table */
+	copyelem = startelem;
+	for (i = 0; i < oldsize; i++)
+	{
+		DH_BUCKET  *oldbucket = &oldbuckets[copyelem];
+
+		if (DH_IS_BUCKET_IN_USE(oldbucket))
+		{
+			uint32		hash;
+			uint32		startelem;
+			uint32		curelem;
+			DH_BUCKET  *newbucket;
+
+			hash = oldbucket->hashvalue;
+			startelem = DH_INITIAL_BUCKET(tb, hash);
+			curelem = startelem;
+
+			/* find empty element to put data into */
+			for (;;)
+			{
+				newbucket = &newbuckets[curelem];
+
+				if (!DH_IS_BUCKET_IN_USE(newbucket))
+					break;
+
+				curelem = DH_NEXT(tb, curelem, startelem);
+			}
+
+			/* copy entry to new slot */
+			memcpy(newbucket, oldbucket, sizeof(DH_BUCKET));
+		}
+
+		/* can't use DH_NEXT here, would use new size */
+		copyelem++;
+		if (copyelem >= oldsize)
+			copyelem = 0;
+	}
+
+	DH_FREE(oldbuckets);
+
+	/*
+	 * Enlarge the segment array so we can store enough segments for the new
+	 * hash table capacity.
+	 */
+	tb->segments = DH_ALLOCATE(sizeof(DH_SEGMENT *) * tb->nsegments);
+	memcpy(tb->segments, oldsegments, sizeof(DH_SEGMENT *) * oldnsegments);
+	/* zero the newly extended part of the array */
+	memset(&tb->segments[oldnsegments], 0, sizeof(DH_SEGMENT *) *
+		   (tb->nsegments - oldnsegments));
+	DH_FREE(oldsegments);
+
+	/*
+	 * The majority of tables will only ever need one bitmap word to store
+	 * used segments, so we only bother to reallocate the used_segments array
+	 * if the number of bitmap words has actually changed.
+	 */
+	if (tb->used_segment_words != oldusedsegmentwords)
+	{
+		tb->used_segments = DH_ALLOCATE(sizeof(DH_BITMAP_WORD) *
+										tb->used_segment_words);
+		memcpy(tb->used_segments, oldusedsegments, sizeof(DH_BITMAP_WORD) *
+			   oldusedsegmentwords);
+		memset(&tb->used_segments[oldusedsegmentwords], 0,
+			   sizeof(DH_BITMAP_WORD) * (tb->used_segment_words -
+										 oldusedsegmentwords));
+
+		DH_FREE(oldusedsegments);
+	}
+}
+
+/*
+ * This is a separate static inline function, so it can be reliably be inlined
+ * into its wrapper functions even if DH_SCOPE is extern.
+ */
+static inline DH_ELEMENT_TYPE *
+DH_INSERT_HASH_INTERNAL(DH_TYPE * tb, DH_KEY_TYPE key, uint32 hash, bool *found)
+{
+	uint32		startelem;
+	uint32		curelem;
+	DH_BUCKET  *buckets;
+	uint32		insertdist;
+
+restart:
+	insertdist = 0;
+
+	/*
+	 * To avoid doing the grow check inside the loop, we do the grow check
+	 * regardless of if the key is present.  This also lets us avoid having to
+	 * re-find our position in the hashtable after resizing.
+	 *
+	 * Note that this also reached when resizing the table due to
+	 * DH_GROW_MAX_DIB / DH_GROW_MAX_MOVE.
+	 */
+	if (unlikely(tb->members >= tb->grow_threshold))
+	{
+		/* this may wrap back to 0 when we're already at DH_MAX_SIZE */
+		DH_GROW(tb, tb->size * 2);
+	}
+
+	/* perform the insert starting the bucket search at optimal location */
+	buckets = tb->buckets;
+	startelem = DH_INITIAL_BUCKET(tb, hash);
+	curelem = startelem;
+	for (;;)
+	{
+		DH_BUCKET  *bucket = &buckets[curelem];
+		DH_ELEMENT_TYPE *entry;
+		uint32		curdist;
+		uint32		curhash;
+		uint32		curoptimal;
+
+		/* any empty bucket can directly be used */
+		if (!DH_IS_BUCKET_IN_USE(bucket))
+		{
+			uint32		index;
+
+			/* and add the new entry */
+			tb->members++;
+
+			entry = DH_GET_NEXT_UNUSED_ENTRY(tb, &index);
+			entry->DH_KEY = key;
+			bucket->hashvalue = hash;
+			DH_SET_BUCKET_IN_USE(bucket, index);
+			*found = false;
+			return entry;
+		}
+
+		curhash = bucket->hashvalue;
+
+		if (curhash == hash)
+		{
+			/*
+			 * The hash value matches so we just need to ensure the key
+			 * matches too.  To do that, we need to lookup the entry in the
+			 * segments using the index stored in the bucket.
+			 */
+			entry = DH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			/* if we find a match, we're done */
+			if (DH_EQUAL(tb, key, entry->DH_KEY))
+			{
+				Assert(DH_IS_BUCKET_IN_USE(bucket));
+				*found = true;
+				return entry;
+			}
+		}
+
+		/*
+		 * For non-empty, non-matching buckets we have to decide whether to
+		 * skip over or move the colliding entry.  When the colliding
+		 * element's distance to its optimal position is smaller than the
+		 * to-be-inserted entry's, we shift the colliding entry (and its
+		 * followers) one bucket closer to their optimal position.
+		 */
+		curoptimal = DH_INITIAL_BUCKET(tb, curhash);
+		curdist = DH_DISTANCE_FROM_OPTIMAL(tb, curoptimal, curelem);
+
+		if (insertdist > curdist)
+		{
+			DH_ELEMENT_TYPE *entry;
+			DH_BUCKET  *lastbucket = bucket;
+			uint32		emptyelem = curelem;
+			uint32		moveelem;
+			int32		emptydist = 0;
+			uint32		index;
+
+			/* find next empty bucket */
+			for (;;)
+			{
+				DH_BUCKET  *emptybucket;
+
+				emptyelem = DH_NEXT(tb, emptyelem, startelem);
+				emptybucket = &buckets[emptyelem];
+
+				if (!DH_IS_BUCKET_IN_USE(emptybucket))
+				{
+					lastbucket = emptybucket;
+					break;
+				}
+
+				/*
+				 * To avoid negative consequences from overly imbalanced
+				 * hashtables, grow the hashtable if collisions would require
+				 * us to move a lot of entries.  The most likely cause of such
+				 * imbalance is filling a (currently) small table, from a
+				 * currently big one, in hashtable order.  Don't grow if the
+				 * hashtable would be too empty, to prevent quick space
+				 * explosion for some weird edge cases.
+				 */
+				if (unlikely(++emptydist > DH_GROW_MAX_MOVE) &&
+					((double) tb->members / tb->size) >= DH_GROW_MIN_FILLFACTOR)
+				{
+					tb->grow_threshold = 0;
+					goto restart;
+				}
+			}
+
+			/* shift forward, starting at last occupied element */
+
+			/*
+			 * TODO: This could be optimized to be one memcpy in many cases,
+			 * excepting wrapping around at the end of ->data. Hasn't shown up
+			 * in profiles so far though.
+			 */
+			moveelem = emptyelem;
+			while (moveelem != curelem)
+			{
+				DH_BUCKET  *movebucket;
+
+				moveelem = DH_PREV(tb, moveelem, startelem);
+				movebucket = &buckets[moveelem];
+
+				memcpy(lastbucket, movebucket, sizeof(DH_BUCKET));
+				lastbucket = movebucket;
+			}
+
+			/* and add the new entry */
+			tb->members++;
+
+			entry = DH_GET_NEXT_UNUSED_ENTRY(tb, &index);
+			entry->DH_KEY = key;
+			bucket->hashvalue = hash;
+			DH_SET_BUCKET_IN_USE(bucket, index);
+			*found = false;
+			return entry;
+		}
+
+		curelem = DH_NEXT(tb, curelem, startelem);
+		insertdist++;
+
+		/*
+		 * To avoid negative consequences from overly imbalanced hashtables,
+		 * grow the hashtable if collisions lead to large runs. The most
+		 * likely cause of such imbalance is filling a (currently) small
+		 * table, from a currently big one, in hashtable order.  Don't grow if
+		 * the hashtable would be too empty, to prevent quick space explosion
+		 * for some weird edge cases.
+		 */
+		if (unlikely(insertdist > DH_GROW_MAX_DIB) &&
+			((double) tb->members / tb->size) >= DH_GROW_MIN_FILLFACTOR)
+		{
+			tb->grow_threshold = 0;
+			goto restart;
+		}
+	}
+}
+
+/*
+ * Insert the key into the hashtable, set *found to true if the key already
+ * exists, false otherwise. Returns the hashtable entry in either case.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_INSERT(DH_TYPE * tb, DH_KEY_TYPE key, bool *found)
+{
+	uint32		hash = DH_HASH_KEY(tb, key);
+
+	return DH_INSERT_HASH_INTERNAL(tb, key, hash, found);
+}
+
+/*
+ * Insert the key into the hashtable using an already-calculated hash. Set
+ * *found to true if the key already exists, false otherwise. Returns the
+ * hashtable entry in either case.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_INSERT_HASH(DH_TYPE * tb, DH_KEY_TYPE key, uint32 hash, bool *found)
+{
+	return DH_INSERT_HASH_INTERNAL(tb, key, hash, found);
+}
+
+/*
+ * This is a separate static inline function, so it can be reliably be inlined
+ * into its wrapper functions even if DH_SCOPE is extern.
+ */
+static inline DH_ELEMENT_TYPE *
+DH_LOOKUP_HASH_INTERNAL(DH_TYPE * tb, DH_KEY_TYPE key, uint32 hash)
+{
+	const uint32 startelem = DH_INITIAL_BUCKET(tb, hash);
+	uint32		curelem = startelem;
+
+	for (;;)
+	{
+		DH_BUCKET  *bucket = &tb->buckets[curelem];
+
+		if (!DH_IS_BUCKET_IN_USE(bucket))
+			return NULL;
+
+		if (bucket->hashvalue == hash)
+		{
+			DH_ELEMENT_TYPE *entry;
+
+			/*
+			 * The hash value matches so we just need to ensure the key
+			 * matches too.  To do that, we need to lookup the entry in the
+			 * segments using the index stored in the bucket.
+			 */
+			entry = DH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			/* if we find a match, we're done */
+			if (DH_EQUAL(tb, key, entry->DH_KEY))
+				return entry;
+		}
+
+		/*
+		 * TODO: we could stop search based on distance. If the current
+		 * buckets's distance-from-optimal is smaller than what we've skipped
+		 * already, the entry doesn't exist.
+		 */
+
+		curelem = DH_NEXT(tb, curelem, startelem);
+	}
+}
+
+/*
+ * Lookup an entry in the hash table.  Returns NULL if key not present.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_LOOKUP(DH_TYPE * tb, DH_KEY_TYPE key)
+{
+	uint32		hash = DH_HASH_KEY(tb, key);
+
+	return DH_LOOKUP_HASH_INTERNAL(tb, key, hash);
+}
+
+/*
+ * Lookup an entry in the hash table using an already-calculated hash.
+ *
+ * Returns NULL if key not present.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_LOOKUP_HASH(DH_TYPE * tb, DH_KEY_TYPE key, uint32 hash)
+{
+	return DH_LOOKUP_HASH_INTERNAL(tb, key, hash);
+}
+
+/*
+ * Delete an entry from hash table by key.  Returns whether to-be-deleted key
+ * was present.
+ */
+DH_SCOPE bool
+DH_DELETE(DH_TYPE * tb, DH_KEY_TYPE key)
+{
+	uint32		hash = DH_HASH_KEY(tb, key);
+	uint32		startelem = DH_INITIAL_BUCKET(tb, hash);
+	uint32		curelem = startelem;
+
+	for (;;)
+	{
+		DH_BUCKET  *bucket = &tb->buckets[curelem];
+
+		if (!DH_IS_BUCKET_IN_USE(bucket))
+			return false;
+
+		if (bucket->hashvalue == hash)
+		{
+			DH_ELEMENT_TYPE *entry;
+
+			entry = DH_INDEX_TO_ELEMENT(tb, bucket->index);
+
+			if (DH_EQUAL(tb, key, entry->DH_KEY))
+			{
+				DH_BUCKET  *lastbucket = bucket;
+
+				/* mark the entry as unused */
+				DH_REMOVE_ENTRY(tb, bucket->index);
+				/* and mark the bucket unused */
+				DH_SET_BUCKET_EMPTY(bucket);
+
+				tb->members--;
+
+				/*
+				 * Backward shift following buckets till either an empty
+				 * bucket or a bucket at its optimal position is encountered.
+				 *
+				 * While that sounds expensive, the average chain length is
+				 * short, and deletions would otherwise require tombstones.
+				 */
+				for (;;)
+				{
+					DH_BUCKET  *curbucket;
+					uint32		curhash;
+					uint32		curoptimal;
+
+					curelem = DH_NEXT(tb, curelem, startelem);
+					curbucket = &tb->buckets[curelem];
+
+					if (!DH_IS_BUCKET_IN_USE(curbucket))
+						break;
+
+					curhash = curbucket->hashvalue;
+					curoptimal = DH_INITIAL_BUCKET(tb, curhash);
+
+					/* current is at optimal position, done */
+					if (curoptimal == curelem)
+					{
+						DH_SET_BUCKET_EMPTY(lastbucket);
+						break;
+					}
+
+					/* shift */
+					memcpy(lastbucket, curbucket, sizeof(DH_BUCKET));
+					DH_SET_BUCKET_EMPTY(curbucket);
+
+					lastbucket = curbucket;
+				}
+
+				return true;
+			}
+		}
+		/* TODO: return false; if the distance is too big */
+
+		curelem = DH_NEXT(tb, curelem, startelem);
+	}
+}
+
+/*
+ * Initialize iterator.
+ */
+DH_SCOPE void
+DH_START_ITERATE(DH_TYPE * tb, DH_ITERATOR * iter)
+{
+	iter->cursegidx = -1;
+	iter->curitemidx = -1;
+	iter->found_members = 0;
+	iter->total_members = tb->members;
+}
+
+/*
+ * Iterate over all entries in the hashtable. Return the next occupied entry,
+ * or NULL if there are no more entries.
+ *
+ * During iteration the only current entry in the hash table and any entry
+ * which was previously visited in the loop may be deleted.  Deletion of items
+ * not yet visited is prohibited as are insertions of new entries.
+ */
+DH_SCOPE	DH_ELEMENT_TYPE *
+DH_ITERATE(DH_TYPE * tb, DH_ITERATOR * iter)
+{
+	/*
+	 * Bail if we've already visited all members.  This check allows us to
+	 * exit quickly in cases where the table is large but it only contains a
+	 * small number of records.  This also means that inserts into the table
+	 * are not possible during iteration.  If that is done then we may not
+	 * visit all items in the table.  Rather than ever removing this check to
+	 * allow table insertions during iteration, we should add another iterator
+	 * where insertions are safe.
+	 */
+	if (iter->found_members == iter->total_members)
+		return NULL;
+
+	for (;;)
+	{
+		DH_SEGMENT *seg;
+
+		/* need a new segment? */
+		if (iter->curitemidx == -1)
+		{
+			iter->cursegidx = DH_NEXT_ONEBIT(tb->used_segments,
+											 tb->used_segment_words,
+											 iter->cursegidx);
+
+			/* no more segments with items? We're done */
+			if (iter->cursegidx == -1)
+				return NULL;
+		}
+
+		seg = tb->segments[iter->cursegidx];
+
+		/* if the segment has items then it certainly shouldn't be NULL */
+		Assert(seg != NULL);
+
+		/*
+		 * Advance to the next used item in this segment.  For full segments
+		 * we bypass the bitmap and just skip to the next item, otherwise we
+		 * consult the bitmap to find the next used item.
+		 */
+		if (seg->nitems == DH_ITEMS_PER_SEGMENT)
+		{
+			if (iter->curitemidx == DH_ITEMS_PER_SEGMENT - 1)
+				iter->curitemidx = -1;
+			else
+			{
+				iter->curitemidx++;
+				iter->found_members++;
+				return &seg->items[iter->curitemidx];
+			}
+		}
+		else
+		{
+			iter->curitemidx = DH_NEXT_ONEBIT(seg->used_items,
+											  DH_BITMAP_WORDS,
+											  iter->curitemidx);
+
+			if (iter->curitemidx >= 0)
+			{
+				iter->found_members++;
+				return &seg->items[iter->curitemidx];
+			}
+		}
+
+		/*
+		 * DH_NEXT_ONEBIT returns -1 when there are no more bits.  We just
+		 * loop again to fetch the next segment.
+		 */
+	}
+}
+
+#endif							/* DH_DEFINE */
+
+/* undefine external parameters, so next hash table can be defined */
+#undef DH_PREFIX
+#undef DH_KEY_TYPE
+#undef DH_KEY
+#undef DH_ELEMENT_TYPE
+#undef DH_HASH_KEY
+#undef DH_SCOPE
+#undef DH_DECLARE
+#undef DH_DEFINE
+#undef DH_EQUAL
+#undef DH_ALLOCATE
+#undef DH_ALLOCATE_ZERO
+#undef DH_FREE
+
+/* undefine locally declared macros */
+#undef DH_MAKE_PREFIX
+#undef DH_MAKE_NAME
+#undef DH_MAKE_NAME_
+#undef DH_ITEMS_PER_SEGMENT
+#undef DH_UNUSED_BUCKET_INDEX
+#undef DH_INDEX_SEGMENT
+#undef DH_INDEX_ITEM
+#undef DH_BITS_PER_WORD
+#undef DH_BITMAP_WORD
+#undef DH_RIGHTMOST_ONE_POS
+#undef DH_BITMAP_WORDS
+#undef DH_WORDNUM
+#undef DH_BITNUM
+#undef DH_RAW_ALLOCATOR
+#undef DH_MAX_SIZE
+#undef DH_FILLFACTOR
+#undef DH_MAX_FILLFACTOR
+#undef DH_GROW_MAX_DIB
+#undef DH_GROW_MAX_MOVE
+#undef DH_GROW_MIN_FILLFACTOR
+
+/* types */
+#undef DH_TYPE
+#undef DH_BUCKET
+#undef DH_SEGMENT
+#undef DH_ITERATOR
+
+/* external function names */
+#undef DH_CREATE
+#undef DH_DESTROY
+#undef DH_RESET
+#undef DH_INSERT
+#undef DH_INSERT_HASH
+#undef DH_DELETE
+#undef DH_LOOKUP
+#undef DH_LOOKUP_HASH
+#undef DH_GROW
+#undef DH_START_ITERATE
+#undef DH_ITERATE
+
+/* internal function names */
+#undef DH_NEXT_ONEBIT
+#undef DH_NEXT_ZEROBIT
+#undef DH_INDEX_TO_ELEMENT
+#undef DH_MARK_SEGMENT_ITEM_USED
+#undef DH_MARK_SEGMENT_ITEM_UNUSED
+#undef DH_GET_NEXT_UNUSED_ENTRY
+#undef DH_REMOVE_ENTRY
+#undef DH_SET_BUCKET_IN_USE
+#undef DH_SET_BUCKET_EMPTY
+#undef DH_IS_BUCKET_IN_USE
+#undef DH_COMPUTE_PARAMETERS
+#undef DH_NEXT
+#undef DH_PREV
+#undef DH_DISTANCE_FROM_OPTIMAL
+#undef DH_INITIAL_BUCKET
+#undef DH_INSERT_HASH_INTERNAL
+#undef DH_LOOKUP_HASH_INTERNAL
-- 
2.30.2

#34

Jaime Casanova

jcasanov@systemguards.com.ec

over 4 years ago

In reply to: David Rowley (#33)

Re: Use simplehash.h instead of dynahash in SMgr

On Mon, Sep 27, 2021 at 04:30:25PM +1300, David Rowley wrote:

On Fri, 24 Sept 2021 at 20:26, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:

Are you planning to work on this in this CF?
This is marked as "Ready for committer" but it doesn't apply anymore.

I've attached an updated patch. Since this patch is pretty different
from the one that was marked as ready for committer, I'll move this to
needs review.

However, I'm a bit disinclined to go ahead with this patch at all.
Thomas made it quite clear it's not for the patch, and on discussing
the patch with Andres, it turned out he does not like the idea either.
Andres' argument was along the lines of bitmaps being slow. The hash
table uses bitmaps to record which items in each segment are in use. I
don't really agree with him about that, so we'd likely need some more
comments to help reach a consensus about if we want this or not.

Maybe Andres has more comments, so I've included him here.

Hi David,

Thanks for the updated patch.

Based on your comments I will mark this patch as withdrawn at midday of
my monday unless someone objects to that.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#35

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Jaime Casanova (#34)

Re: Use simplehash.h instead of dynahash in SMgr

On Mon, 4 Oct 2021 at 20:37, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:

Based on your comments I will mark this patch as withdrawn at midday of
my monday unless someone objects to that.

I really think we need a hash table implementation that's faster than
dynahash and supports stable pointers to elements (simplehash does not
have stable pointers). I think withdrawing this won't help us move
towards getting that.

Thomas voiced his concerns here about having an extra hash table
implementation and then also concerns that I've coded the hash table
code to be fast to iterate over the hashed items. To be honest, I
think both Andres and Thomas must be misunderstanding the bitmap part.
I get the impression that they both think the bitmap is solely there
to make interations faster, but in reality it's primarily there as a
compact freelist and can also be used to make iterations over sparsely
populated tables fast. For the freelist we look for 0-bits, and we
look for 1-bits during iteration.

I think I'd much rather talk about the concerns here than just
withdraw this. Even if what I have today just serves as something to
aid discussion.

It would also be good to get the points Andres raised with me off-list
on this thread. I think his primary concern was that bitmaps are
slow, but I don't really think maintaining full pointers into freed
items is going to improve the performance of this.

David

#36

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: David Rowley (#35)

1 attachment(s)

Re: Use simplehash.h instead of dynahash in SMgr

Good day, David and all.

В Вт, 05/10/2021 в 11:07 +1300, David Rowley пишет:

On Mon, 4 Oct 2021 at 20:37, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:

Based on your comments I will mark this patch as withdrawn at midday
of
my monday unless someone objects to that.

I really think we need a hash table implementation that's faster than
dynahash and supports stable pointers to elements (simplehash does not
have stable pointers). I think withdrawing this won't help us move
towards getting that.

Agree with you. I believe densehash could replace both dynahash and
simplehash. Shared memory usages of dynahash should be reworked to
other less dynamic hash structure. So there should be densehash for
local hashes and statichash for static shared memory.

densehash slight slowness compared to simplehash in some operations
doesn't worth keeping simplehash beside densehash.

Thomas voiced his concerns here about having an extra hash table
implementation and then also concerns that I've coded the hash table
code to be fast to iterate over the hashed items. To be honest, I
think both Andres and Thomas must be misunderstanding the bitmap part.
I get the impression that they both think the bitmap is solely there
to make interations faster, but in reality it's primarily there as a
compact freelist and can also be used to make iterations over sparsely
populated tables fast. For the freelist we look for 0-bits, and we
look for 1-bits during iteration.

I think this part is overengineered. More below.

I think I'd much rather talk about the concerns here than just
withdraw this. Even if what I have today just serves as something to
aid discussion.

It would also be good to get the points Andres raised with me off-list
on this thread. I think his primary concern was that bitmaps are
slow, but I don't really think maintaining full pointers into freed
items is going to improve the performance of this.

David

First on "quirks" in the patch I was able to see:

DH_NEXT_ZEROBIT:

DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
DH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */

really should be

DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
DH_BITMAP_WORD word = (~words[wordnum]) & mask; /* flip bits */

But it doesn't harm because DH_NEXT_ZEROBIT is always called with
`prevbit = -1`, which is incremented to `0`. Therefore `mask` is always
`0xffff...ff`.

DH_INDEX_TO_ELEMENT

/* ensure this segment is marked as used */
should be
/* ensure this item is marked as used in the segment */

DH_GET_NEXT_UNUSED_ENTRY

/* find the first segment with an unused item */
while (seg != NULL && seg->nitems == DH_ITEMS_PER_SEGMENT)
seg = tb->segments[++segidx];

No protection for `++segidx <= tb->nsegments` . I understand, it could
not happen due to `grow_threshold` is always lesser than
`nsegment * DH_ITEMS_PER_SEGMENT`. But at least comment should be
leaved about legality of absence of check.

Now architecture notes:

I don't believe there is need for configurable DH_ITEMS_PER_SEGMENT. I
don't event believe it should be not equal to 16 (or 8). Then segment
needs only one `used_items` word, which simplifies code a lot.
There is no much difference in overhead between 1/16 and 1/256.

And then I believe, segment doesn't need both `nitems` and `used_items`.
Condition "segment is full" will be equal to `used_items == 0xffff`.

Next, I think it is better to make real free list instead of looping to
search such one. Ie add `uint32 DH_SEGMENT->next` field and maintain
list starting from `first_free_segment`.
If concern were "allocate from lower-numbered segments first", than min-
heap could be created. It is possible to create very efficient non-
balanced "binary heap" with just two fields (`uint32 left, right`).
Algorithmic PoC in Ruby language is attached.

There is also allocation concern: AllocSet tends to allocate in power2
sizes. Use of power2 segments with header (nitems/used_items) certainly
will lead to wasted 2x space on every segment if element size is also
power2, and a bit lesser for other element sizes.
There could be two workarounds:
- make segment a bit less capable (15 elements instead of 16)
- move header from segment itself to `DH_TYPE->segments` array.
I think, second option is more prefered:
- `DH_TYPE->segments[x]` inevitable accessed on every operation,
therefore why not store some info here?
- if nitems/used_items will be in `DH_TYPE->segments[x]`, then
hashtable iteration doesn't need bitmap at all - there will be no need
in `DH_TYPE->used_segments` bitmap. Absence of this bitmap will
reduce overhead on usual operations (insert/delete) as well.

Hope I was useful.

regards

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#37

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: David Rowley (#35)

Re: Use simplehash.h instead of dynahash in SMgr

On Tue, Oct 05, 2021 at 11:07:48AM +1300, David Rowley wrote:

I think I'd much rather talk about the concerns here than just
withdraw this. Even if what I have today just serves as something to
aid discussion.

Hmm. This last update was two months ago, and the patch does not
apply anymore. I am marking it as RwF for now.
--
Michael

#38

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: David Rowley (#1)

Re: Use simplehash.h instead of dynahash in SMgr

Hi,

On 2021-04-25 03:58:38 +1200, David Rowley wrote:

Currently, we use dynahash hash tables to store the SMgrRelation so we
can perform fast lookups by RelFileNodeBackend. However, I had in mind
that a simplehash table might perform better. So I tried it...

The test case was basically inserting 100 million rows one at a time
into a hash partitioned table with 1000 partitions and 2 int columns
and a primary key on one of those columns. It was about 12GB of WAL. I
used a hash partitioned table in the hope to create a fairly
random-looking SMgr hash table access pattern. Hopefully something
similar to what might happen in the real world.

A potentially stupid question: Do we actually need to do smgr lookups in this
path? Afaict nearly all of the buffer lookups here will end up as cache hits in
shared buffers, correct?

Afaict we'll do two smgropens in a lot of paths:
1) XLogReadBufferExtended() does smgropen so it can do smgrnblocks()
2) ReadBufferWithoutRelcache() does an smgropen()

It's pretty sad that we constantly do two smgropen()s to start with. But in
the cache hit path we don't actually need an smgropen in either case afaict.

ReadBufferWithoutRelcache() does an smgropen, because that's
ReadBuffer_common()'s API. Which in turn has that API because it wants to use
RelationGetSmgr() when coming from ReadBufferExtended(). It doesn't seem
awful to allow smgr to be NULL and to pass in the rlocator in addition.

XLogReadBufferExtended() does an smgropen() so it can do smgrcreate() and
smgrnblocks(). But neither is needed in the cache hit case, I think. We could
do a "read-only" lookup in s_b, and only do the current logic in case that
fails?

Greetings,

Andres Freund